first commit

This commit is contained in:
2026-03-19 18:16:20 +01:00
commit 584b2e07b4
34 changed files with 4381 additions and 0 deletions

View File

@@ -0,0 +1,189 @@
# Pokemon Text-to-JSON Pipeline
This project converts free-form Pokemon description text into:
1. A normalized keyword list
2. A populated Pokemon JSON object (from a blank/key-only template)
The pipeline is deterministic and rule-based.
## Architecture
### Stage 1: Keyword Extraction
File: `keyword_extractor.py`
Input: raw text description
Core logic:
- spaCy tokenization and POS tagging
- POS filtering (`NOUN`, `ADJ`, `VERB`)
- stopword and punctuation removal
- lemma-based normalization
- domain synonym normalization (example: `flames -> fire`)
- optional YAKE relevance scoring
- conservative retention policy so detail is not over-pruned
Output: ordered list of normalized keywords
### Stage 2: JSON Inference
File: `json_inference.py`
Input: keyword list + optional JSON template
Core logic:
- infer primary/secondary type
- infer name candidate
- infer attacks, abilities, habitat, personality
- infer basic stats (`hp`, `attack`, `defense`, `speed`)
- fill nested TCG-like template fields (`types`, `attacks`, `weaknesses`, `stage`, `retreat`, etc.)
- preserve already non-empty values in the provided template
Output: inferred JSON profile
### Stage 3: Orchestration CLI
File: `infer_json_usage.py`
This is the main entrypoint for end-to-end usage.
Default behavior:
1. prints extracted keyword list
2. prints inferred JSON
## Project Structure
- `keyword_extractor.py`: keyword extraction engine
- `json_inference.py`: keyword-to-JSON inference logic
- `infer_json_usage.py`: end-to-end CLI
- `example_usage.py`: keyword extraction only CLI
- `json_template_example.json`: sample blank/key-only template
- `test_keyword_extractor.py`: extraction tests
- `test_json_inference.py`: inference tests
- `requirements.txt`: Python dependencies
## Requirements
- Python 3.13 or lower is recommended for spaCy compatibility
- pip
Dependencies in `requirements.txt`:
- `spacy>=3.7.0`
- `yake>=0.4.2`
## Setup
1. Create and activate a virtual environment (recommended)
```bash
python -m venv .venv
source .venv/bin/activate
```
2. Install dependencies
```bash
pip install -r requirements.txt
```
3. Install spaCy English model
```bash
python -m spacy download en_core_web_sm
```
## How To Run
### A) Extract keywords only
```bash
python example_usage.py "furret long slender agile creature with soft fur"
```
Output: JSON list of keywords.
### B) End-to-end: text -> keywords -> JSON
```bash
python infer_json_usage.py --template json_template_example.json "furret long slender agile creature with soft fur"
```
Output order:
1. keyword list
2. inferred JSON
### C) End-to-end but JSON only
```bash
python infer_json_usage.py --json-only --template json_template_example.json "furret long slender agile creature with soft fur"
```
### D) Start from keywords directly
```bash
python infer_json_usage.py --template json_template_example.json --keywords furret normal tail smash tunnel agile cheerful explore endurance
```
Tip: If you pass `--keywords`, text extraction is skipped.
## Template Behavior
If `--template` is omitted, inference returns a full inferred profile object.
If `--template` is provided:
- empty fields are populated from inferred values
- non-empty fields are preserved
Current sample template supports nested card-like data including:
- `types`
- `attacks` with `cost`, `name`, `effect`, `damage`
- `weaknesses` with `type`, `value`
- `stage`, `retreat`, `legal`
## Tests
Run all tests:
```bash
python -m unittest -q
```
## Troubleshooting
### 1) spaCy model not found
Error mentions `en_core_web_sm` not installed.
Fix:
```bash
python -m spacy download en_core_web_sm
```
### 2) spaCy import/runtime problems on very new Python versions
Use Python 3.13 or lower and reinstall requirements.
### 3) `--template` path errors
Ensure `--template` points to a valid file path, for example:
```bash
--template json_template_example.json
```
If your input is already a keyword list, use `--keywords` instead of putting the list in `--template`.
## Design Notes
- deterministic and explainable (no LLM calls)
- domain mappings are easy to extend in `keyword_extractor.py` and `json_inference.py`
- scoring and template fill rules are intentionally simple and stable for game-content generation