Files
Juicepyter/clean-text-to-keywords
2026-03-19 18:16:20 +01:00
..
2026-03-19 18:16:20 +01:00
2026-03-19 18:16:20 +01:00
2026-03-19 18:16:20 +01:00
2026-03-19 18:16:20 +01:00
2026-03-19 18:16:20 +01:00
2026-03-19 18:16:20 +01:00
2026-03-19 18:16:20 +01:00
2026-03-19 18:16:20 +01:00
2026-03-19 18:16:20 +01:00
2026-03-19 18:16:20 +01:00
2026-03-19 18:16:20 +01:00

Pokemon Text-to-JSON Pipeline

This project converts free-form Pokemon description text into:

  1. A normalized keyword list
  2. A populated Pokemon JSON object (from a blank/key-only template)

The pipeline is deterministic and rule-based.

Architecture

Stage 1: Keyword Extraction

File: keyword_extractor.py

Input: raw text description

Core logic:

  • spaCy tokenization and POS tagging
  • POS filtering (NOUN, ADJ, VERB)
  • stopword and punctuation removal
  • lemma-based normalization
  • domain synonym normalization (example: flames -> fire)
  • optional YAKE relevance scoring
  • conservative retention policy so detail is not over-pruned

Output: ordered list of normalized keywords

Stage 2: JSON Inference

File: json_inference.py

Input: keyword list + optional JSON template

Core logic:

  • infer primary/secondary type
  • infer name candidate
  • infer attacks, abilities, habitat, personality
  • infer basic stats (hp, attack, defense, speed)
  • fill nested TCG-like template fields (types, attacks, weaknesses, stage, retreat, etc.)
  • preserve already non-empty values in the provided template

Output: inferred JSON profile

Stage 3: Orchestration CLI

File: infer_json_usage.py

This is the main entrypoint for end-to-end usage.

Default behavior:

  1. prints extracted keyword list
  2. prints inferred JSON

Project Structure

  • keyword_extractor.py: keyword extraction engine
  • json_inference.py: keyword-to-JSON inference logic
  • infer_json_usage.py: end-to-end CLI
  • example_usage.py: keyword extraction only CLI
  • json_template_example.json: sample blank/key-only template
  • test_keyword_extractor.py: extraction tests
  • test_json_inference.py: inference tests
  • requirements.txt: Python dependencies

Requirements

  • Python 3.13 or lower is recommended for spaCy compatibility
  • pip

Dependencies in requirements.txt:

  • spacy>=3.7.0
  • yake>=0.4.2

Setup

  1. Create and activate a virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate
  1. Install dependencies
pip install -r requirements.txt
  1. Install spaCy English model
python -m spacy download en_core_web_sm

How To Run

A) Extract keywords only

python example_usage.py "furret long slender agile creature with soft fur"

Output: JSON list of keywords.

B) End-to-end: text -> keywords -> JSON

python infer_json_usage.py --template json_template_example.json "furret long slender agile creature with soft fur"

Output order:

  1. keyword list
  2. inferred JSON

C) End-to-end but JSON only

python infer_json_usage.py --json-only --template json_template_example.json "furret long slender agile creature with soft fur"

D) Start from keywords directly

python infer_json_usage.py --template json_template_example.json --keywords furret normal tail smash tunnel agile cheerful explore endurance

Tip: If you pass --keywords, text extraction is skipped.

Template Behavior

If --template is omitted, inference returns a full inferred profile object.

If --template is provided:

  • empty fields are populated from inferred values
  • non-empty fields are preserved

Current sample template supports nested card-like data including:

  • types
  • attacks with cost, name, effect, damage
  • weaknesses with type, value
  • stage, retreat, legal

Tests

Run all tests:

python -m unittest -q

Troubleshooting

1) spaCy model not found

Error mentions en_core_web_sm not installed.

Fix:

python -m spacy download en_core_web_sm

2) spaCy import/runtime problems on very new Python versions

Use Python 3.13 or lower and reinstall requirements.

3) --template path errors

Ensure --template points to a valid file path, for example:

--template json_template_example.json

If your input is already a keyword list, use --keywords instead of putting the list in --template.

Design Notes

  • deterministic and explainable (no LLM calls)
  • domain mappings are easy to extend in keyword_extractor.py and json_inference.py
  • scoring and template fill rules are intentionally simple and stable for game-content generation