feat(entity): filter common English content words from entity detection#1605
Conversation
Common English content words ("Code", "Brutal", "Phase", "Line", "Note",
"Planning", "Chat", ...) frequently appear capitalized at sentence
start, in headings, or in markdown — and the existing regex-based entity
detector treats every capitalized word appearing 3+ times as a proper
noun candidate. This produces false positives that pollute
known_entities.json, the per-drawer entities metadata, and closet
pointer entity lists.
Adds a curated list of ~1000 common English content words (nouns,
adjectives, verbs — no proper nouns, no known product names) at
mempalace/data/coca_content_words.json, derived from the COCA top-2000
frequency list and from observed false positives in real palaces.
The list is loaded once at module import via _get_coca_filter() (cached
frozenset) and consulted at three sites:
- entity_detector.extract_candidates — init-time entity detection
- palace.build_closet_lines — closet pointer construction
- miner._extract_entities_for_metadata — per-drawer entity tagging
Matching is case-insensitive: the candidate is lowercased before
membership check, so "Code", "CODE", and "code" are all blocked.
Only single-word candidates are filtered; multi-word phrases (e.g.
"Claude Code") are NOT filtered, so legitimate compound names continue
to be detected. A future PR will add a known-systems lexicon that
explicitly protects compound names — for now, the multi-word pass
provides the implicit protection.
Adds 8 new tests in tests/test_entity_detector.py covering:
- "Code" appearing 5x is filtered
- All known false positives from real palaces are filtered
- Case-insensitive matching works
- Real names ("Aya", "Riley") are NOT filtered
- Multi-word phrases like "Claude Code" are NOT filtered
- The data file ships in the wheel with the expected schema
- The data file contains every known false positive
Note: this filter prevents NEW false positives at detection time. It
does NOT retroactively scrub existing entries from a user's
~/.mempalace/known_entities.json. Stale entries from previous runs
must be removed via a future entity-management CLI (planned).
Verification: pytest 2191 passed / 3 skipped / 0 failed on macOS.
ruff check + format --check both clean. Cross-platform via
pip install -e ".[dev]" on python:3.9-slim / 3.11-slim / 3.13-slim
containers, each running 65 entity-detector tests including the new
data-file-presence test that imports from the installed package.
End-to-end pipeline test mining a deterministic corpus into an
isolated palace confirmed entity metadata is empty (no false
positives surface in real chromadb-backed storage).
There was a problem hiding this comment.
Code Review
This pull request introduces a COCA content-word filter to improve entity detection by excluding common English words that are frequently misidentified as proper nouns. The implementation includes a cached data loader in entity_detector.py and integration into the extraction pipelines in miner.py and palace.py, along with comprehensive unit tests. Review feedback suggests enhancing the robustness of the JSON loading logic by catching TypeError to prevent potential crashes on malformed data structures.
…n malformed JSON The previous commit landed the code referencing mempalace/data/coca_content_words.json but the data file itself was never actually staged (likely a multi-line paste mishap during the initial commit). CI on every platform failed because the file isn't in the repo. This commit ships the wordlist + adds TypeError to the COCA loader's graceful-degradation exception tuple per gemini-code-assist review on PR #1605.
|
gemini-code-assist /review |
There was a problem hiding this comment.
Code Review
This pull request implements a COCA content-word filter to reduce false-positive entity detections by excluding common English nouns, adjectives, and verbs. The changes include a new JSON wordlist, a cached loading mechanism, and integration into the entity extraction and mining processes. Review feedback recommends moving local imports and filter initializations to the top level of the modules to improve performance in hot paths and adhere to standard Python practices.
| from .entity_detector import _get_coca_filter | ||
| from .palace import _candidate_entity_words | ||
|
|
||
| coca_filter = _get_coca_filter() |
There was a problem hiding this comment.
The local imports (_get_coca_filter, _candidate_entity_words) and the call to _get_coca_filter() are located inside _extract_entities_for_metadata, which is a hot path called for every drawer (chunk) during mining. In a large project, this can be executed tens of thousands of times.
Local imports and repeated function calls (even if cached via lru_cache) in such a hot loop introduce unnecessary overhead. Since there are no circular dependencies preventing it, please move these imports to the top of the file and consider calling _get_coca_filter() once at the module level or passing the filter down from a higher-level function.
Additionally, note that import re at line 853 and from .palace import _ENTITY_STOPLIST at line 855 (inside the same function) are also local imports that should be moved to the top level for efficiency.
References
- Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants. Moving imports out of functions avoids repeated import overhead in hot loops. (link)
| from .entity_detector import _get_coca_filter | ||
|
|
||
| coca_filter = _get_coca_filter() |
There was a problem hiding this comment.
The local import of _get_coca_filter and the subsequent call can be moved to the top level of the module. While build_closet_lines is called once per file (rather than once per chunk), moving imports to the top level is standard practice in Python to avoid the overhead of repeated local imports and to keep dependencies clear, unless lazy loading is specifically required to break circular imports.
References
- Imports should be placed at the top of the file to maintain clarity and follow standard Python conventions (PEP-8). (link)
Bumps version 3.3.5 → 3.3.6 across pyproject.toml, version.py, plugin manifests (.claude-plugin/plugin.json, .claude-plugin/marketplace.json, .codex-plugin/plugin.json), README badge, and uv.lock. Flips CHANGELOG.md from ``[Unreleased]`` to ``[3.3.6] — 2026-05-24`` and backfills the major user-facing entries that landed without changelog entries during the cycle: Features: - MemPalace#1555 office-document mining via --mode extract + virtual line numbers - MemPalace#1584 surgical closet pointers with date+line locators (Tier 6a) - MemPalace#1558 + MemPalace#1560 within-wing hallways (entity co-occurrence graph) - MemPalace#1565 cross-wing tunnels auto-promoted from hallways - MemPalace#1578 Hebbian potentiation + Ebbinghaus decay on hallways/tunnels - MemPalace#1236 API-tool transcripts auto-route to wing_api - MemPalace#711 hooks.auto_save toggle for silent-mode sessions - MemPalace#1605 COCA content-word filter for entity detection - MemPalace#1557 case-insensitive entity matching at mine time - MemPalace#1483 multilingual embeddings (embeddinggemma-300m) by default Bug Fixes (selected, user-visible): - MemPalace#1540 silent data loss in three unchunked upsert sites - MemPalace#1538 paragraph chunker oversized chunks - MemPalace#1554 per-file chunk cap too low for transcripts - MemPalace#1562 Windows hook subprocess/ChromaDB deadlock - MemPalace#1529 create_tunnel corrupted hyphenated wing names - MemPalace#1424 save-hook truncated hyphenated project folders - MemPalace#1383 KG cache duplicated graphs for symlinked/cased paths - MemPalace#1466 silent symlink skip now logged - MemPalace#1441 macOS stock-bash 3.2 hook compatibility - MemPalace#1500 / MemPalace#1513 structured JSON-RPC errors on bad MCP input - MemPalace#1523 VACUUM + FTS5 rebuild after repair - MemPalace#1548 FTS5 validation at end of mine - plus MemPalace#1216, MemPalace#1408, MemPalace#1438, MemPalace#1439, MemPalace#1445, MemPalace#1452, MemPalace#1459, MemPalace#1461, MemPalace#1466, MemPalace#1470, MemPalace#1477, MemPalace#1485, MemPalace#1500, MemPalace#1513, MemPalace#1528, MemPalace#1532, MemPalace#1543, MemPalace#1546, MemPalace#1585 Performance: - MemPalace#1474 convo miner pre-fetches mined-set - MemPalace#1487 rebuild_index progress callback - MemPalace#1530 MCP cold-start diagnostics + opt-in warmup Lint passes (ruff 0.15.14); mempalace-mcp entry point alignment verified per RELEASING.md.
…aths The COCA content-word filter shipped in PR MemPalace#1605 imported `_get_coca_filter` and `_candidate_entity_words` locally inside two hot paths: - `palace.build_closet_lines` — runs per source file during mine - `miner._extract_entities_for_metadata` — runs per drawer during mine Both imports are now at module top, where they're resolved once at import time instead of on every per-drawer call. Module-top imports also make the dependency graph visible to static analysis (pylint's C0415 was flagging the locals). No behavior change. The `_get_coca_filter()` call is unchanged — only the import statement moved. End-to-end mining produces identical chromadb output. Addresses the MEDIUM finding gemini-code-assist raised on PR MemPalace#1605 review. Verification: full pytest 2258 passed / 3 skipped / coverage 85.35%. ruff check + format clean. Linux Py 3.9 / 3.11 / 3.13 via CI-matching `pip install -e ".[dev]"`: 2249 passed each. End-to-end mine of a test corpus produces the expected drawer + closet pointer.
…ot paths Two release-blocking fixes for v3.3.6: 1. CI ruff pin drift .github/workflows/ci.yml installed ruff==0.15.9 while pyproject.toml [dev] extras and .pre-commit-config.yaml both pin 0.15.14. Ruff's formatter output can change between minor versions, so a contributor running `pip install -e ".[dev]"` and formatting locally with 0.15.14 would produce output the 0.15.9 lint job rejects. Same failure mode that surfaced on PR MemPalace#1579 (2026-05-22). Aligning CI to 0.15.14 keeps the three pin sites in lock-step. 2. COCA filter imports inside per-drawer hot paths PR MemPalace#1605 (COCA content-word filter, shipping in 3.3.6) introduced `from .entity_detector import _get_coca_filter` and `from .palace import _candidate_entity_words` inside _extract_entities_for_metadata (called per drawer) and build_closet_lines (called per closet). Python caches module imports so the runtime cost after the first call is small, but the import machinery still runs Python bytecode every invocation — gemini flagged this on the original PR. Hoisting to module-level removes the per-call import overhead. The hoist is identical to PR MemPalace#1612, which targets develop. Folding it into the release so 3.3.6 doesn't ship the perf regression that 3.3.7 would immediately have to fix. Verification: ruff check + format clean on 0.15.14, full pytest (2258 passed / 12 skipped) on Linux Py 3.9 / 3.11 / 3.13 via `pip install -e ".[dev]"` (CI-matching).
…ot paths Two release-blocking fixes for v3.3.6: 1. CI ruff pin drift .github/workflows/ci.yml installed ruff==0.15.9 while pyproject.toml [dev] extras and .pre-commit-config.yaml both pin 0.15.14. Ruff's formatter output can change between minor versions, so a contributor running `pip install -e ".[dev]"` and formatting locally with 0.15.14 would produce output the 0.15.9 lint job rejects. Same failure mode that surfaced on PR MemPalace#1579 (2026-05-22). Aligning CI to 0.15.14 keeps the three pin sites in lock-step. 2. COCA filter imports inside per-drawer hot paths PR MemPalace#1605 (COCA content-word filter, shipping in 3.3.6) introduced `from .entity_detector import _get_coca_filter` and `from .palace import _candidate_entity_words` inside _extract_entities_for_metadata (called per drawer) and build_closet_lines (called per closet). Python caches module imports so the runtime cost after the first call is small, but the import machinery still runs Python bytecode every invocation — gemini flagged this on the original PR. Hoisting to module-level removes the per-call import overhead. The hoist is identical to PR MemPalace#1612, which targets develop. Folding it into the release so 3.3.6 doesn't ship the perf regression that 3.3.7 would immediately have to fix. Verification: ruff check + format clean on 0.15.14, full pytest (2258 passed / 12 skipped) on Linux Py 3.9 / 3.11 / 3.13 via `pip install -e ".[dev]"` (CI-matching).
…ot paths (#222) Two release-blocking fixes for v3.3.6: 1. CI ruff pin drift .github/workflows/ci.yml installed ruff==0.15.9 while pyproject.toml [dev] extras and .pre-commit-config.yaml both pin 0.15.14. Ruff's formatter output can change between minor versions, so a contributor running `pip install -e ".[dev]"` and formatting locally with 0.15.14 would produce output the 0.15.9 lint job rejects. Same failure mode that surfaced on PR MemPalace#1579 (2026-05-22). Aligning CI to 0.15.14 keeps the three pin sites in lock-step. 2. COCA filter imports inside per-drawer hot paths PR MemPalace#1605 (COCA content-word filter, shipping in 3.3.6) introduced `from .entity_detector import _get_coca_filter` and `from .palace import _candidate_entity_words` inside _extract_entities_for_metadata (called per drawer) and build_closet_lines (called per closet). Python caches module imports so the runtime cost after the first call is small, but the import machinery still runs Python bytecode every invocation — gemini flagged this on the original PR. Hoisting to module-level removes the per-call import overhead. The hoist is identical to PR MemPalace#1612, which targets develop. Folding it into the release so 3.3.6 doesn't ship the perf regression that 3.3.7 would immediately have to fix. Verification: ruff check + format clean on 0.15.14, full pytest (2258 passed / 12 skipped) on Linux Py 3.9 / 3.11 / 3.13 via `pip install -e ".[dev]"` (CI-matching). Co-authored-by: Milla J <232237854+milla-jovovich@users.noreply.github.com>
Common English content words ("Code", "Brutal", "Phase", "Line", "Note", "Planning", "Chat", ...) frequently appear capitalized at sentence start, in headings, or in markdown — and the existing regex-based entity detector treats every capitalized word appearing 3+ times as a proper noun candidate. This produces false positives that pollute known_entities.json, the per-drawer entities metadata, and closet pointer entity lists.
Adds a curated list of ~1000 common English content words (nouns, adjectives, verbs — no proper nouns, no known product names) at mempalace/data/coca_content_words.json, derived from the COCA top-2000 frequency list and from observed false positives in real palaces. The list is loaded once at module import via _get_coca_filter() (cached frozenset) and consulted at three sites:
Matching is case-insensitive: the candidate is lowercased before membership check, so "Code", "CODE", and "code" are all blocked. Only single-word candidates are filtered; multi-word phrases (e.g. "Claude Code") are NOT filtered, so legitimate compound names continue to be detected. A future PR will add a known-systems lexicon that explicitly protects compound names — for now, the multi-word pass provides the implicit protection.
Adds 8 new tests in tests/test_entity_detector.py covering:
Note: this filter prevents NEW false positives at detection time. It does NOT retroactively scrub existing entries from a user's ~/.mempalace/known_entities.json. Stale entries from previous runs must be removed via a future entity-management CLI (planned).
Verification: pytest 2191 passed / 3 skipped / 0 failed on macOS. ruff check + format --check both clean. Cross-platform via pip install -e ".[dev]" on python:3.9-slim / 3.11-slim / 3.13-slim containers, each running 65 entity-detector tests including the new data-file-presence test that imports from the installed package. End-to-end pipeline test mining a deterministic corpus into an isolated palace confirmed entity metadata is empty (no false positives surface in real chromadb-backed storage).
What does this PR do?
Filters common English content words ("Code", "Brutal", "Phase", "Line", "Note", "Planning", "Chat", …) from entity detection so they no
longer get falsely tagged as people or projects.
Adds
mempalace/data/coca_content_words.json(1016 lowercased common content words from the COCA top-2000 frequency list, POS-filtered tonoun/adjective/verb — no proper nouns, no known product names). The filter is consulted at three sites where capitalized-word entity
detection happens:
entity_detector.extract_candidates— init-time detectionpalace.build_closet_lines— closet pointer constructionminer._extract_entities_for_metadata— per-drawer entity taggingCase-insensitive matching. Multi-word phrases ("Claude Code") are not filtered, so legitimate compound names still get detected. Existing stale entries in a user's
known_entities.jsonare not retroactively scrubbed — that's a separate planned PR (Entity CLI).How to test
pip install -e ".[dev]"python -m pytest tests/test_entity_detector.py -v— 65 tests, including 8 new tests covering the COCA filter behavior (case insensitivity, real-name preservation, multi-word phrase preservation, data file shape).Full suite:
python -m pytest tests/ -q— 2191 passed, 3 skipped.Linux cross-platform: containers
python:3.9-slim,python:3.11-slim,python:3.13-slimwithpip install -e ".[dev]"thenpython -m pytest tests/test_entity_detector.py— all 65 tests pass on each Python version.End-to-end live-palace test: mine a corpus where "Line" appears 3+ times (e.g., a markdown doc with "Line 3:", "Line 4:" prefixes), confirm the resulting drawers'
entitiesmetadata does NOT include "Line".Checklist
python -m pytest tests/ -v) — 2191 passed, 3 skippedruff check .) — clean