NLTK resources re-downloaded on every clean_pokemon_text() call #10

Open
opened 2026-03-19 17:30:35 +00:00 by llabeyrie · 0 comments
Owner

Description

In text-cleaner/text_cleaning_pipeline.py line 121:

def clean_pokemon_text(raw_text: str, min_len: int = 3) -> Dict[str, Any]:
    ...
    ensure_nltk_resources(quiet=True)   # called EVERY time!

ensure_nltk_resources() calls nltk.download() for 6 resources on every invocation of the cleaning function.

Problem

  • While nltk.download() checks if already present, the check itself has I/O overhead (stat calls per resource)
  • In the Streamlit app, this runs on every user click
  • Adds unnecessary latency to each pipeline run

Fix

Use a module-level flag:

_nltk_ready = False

def ensure_nltk_resources(quiet: bool = True) -> None:
    global _nltk_ready
    if _nltk_ready:
        return
    ...
    _nltk_ready = True

Or call it once at import/app startup time.

## Description In `text-cleaner/text_cleaning_pipeline.py` line 121: ```python def clean_pokemon_text(raw_text: str, min_len: int = 3) -> Dict[str, Any]: ... ensure_nltk_resources(quiet=True) # called EVERY time! ``` `ensure_nltk_resources()` calls `nltk.download()` for 6 resources on every invocation of the cleaning function. ### Problem - While `nltk.download()` checks if already present, the check itself has I/O overhead (stat calls per resource) - In the Streamlit app, this runs on every user click - Adds unnecessary latency to each pipeline run ### Fix Use a module-level flag: ```python _nltk_ready = False def ensure_nltk_resources(quiet: bool = True) -> None: global _nltk_ready if _nltk_ready: return ... _nltk_ready = True ``` Or call it once at import/app startup time.
llabeyrie added the performancepriority: medium labels 2026-03-19 17:31:45 +00:00
Sign in to join this conversation.