feat(entity): filter common English content words from entity detection by milla-jovovich · Pull Request #1605 · MemPalace/mempalace

milla-jovovich · 2026-05-24T11:31:20Z

Common English content words ("Code", "Brutal", "Phase", "Line", "Note", "Planning", "Chat", ...) frequently appear capitalized at sentence start, in headings, or in markdown — and the existing regex-based entity detector treats every capitalized word appearing 3+ times as a proper noun candidate. This produces false positives that pollute known_entities.json, the per-drawer entities metadata, and closet pointer entity lists.

Adds a curated list of ~1000 common English content words (nouns, adjectives, verbs — no proper nouns, no known product names) at mempalace/data/coca_content_words.json, derived from the COCA top-2000 frequency list and from observed false positives in real palaces. The list is loaded once at module import via _get_coca_filter() (cached frozenset) and consulted at three sites:

entity_detector.extract_candidates — init-time entity detection
palace.build_closet_lines — closet pointer construction
miner._extract_entities_for_metadata — per-drawer entity tagging

Matching is case-insensitive: the candidate is lowercased before membership check, so "Code", "CODE", and "code" are all blocked. Only single-word candidates are filtered; multi-word phrases (e.g. "Claude Code") are NOT filtered, so legitimate compound names continue to be detected. A future PR will add a known-systems lexicon that explicitly protects compound names — for now, the multi-word pass provides the implicit protection.

Adds 8 new tests in tests/test_entity_detector.py covering:

"Code" appearing 5x is filtered
All known false positives from real palaces are filtered
Case-insensitive matching works
Real names ("Aya", "Riley") are NOT filtered
Multi-word phrases like "Claude Code" are NOT filtered
The data file ships in the wheel with the expected schema
The data file contains every known false positive

Note: this filter prevents NEW false positives at detection time. It does NOT retroactively scrub existing entries from a user's ~/.mempalace/known_entities.json. Stale entries from previous runs must be removed via a future entity-management CLI (planned).

Verification: pytest 2191 passed / 3 skipped / 0 failed on macOS. ruff check + format --check both clean. Cross-platform via pip install -e ".[dev]" on python:3.9-slim / 3.11-slim / 3.13-slim containers, each running 65 entity-detector tests including the new data-file-presence test that imports from the installed package. End-to-end pipeline test mining a deterministic corpus into an isolated palace confirmed entity metadata is empty (no false positives surface in real chromadb-backed storage).

What does this PR do?

Filters common English content words ("Code", "Brutal", "Phase", "Line", "Note", "Planning", "Chat", …) from entity detection so they no
longer get falsely tagged as people or projects.

Adds mempalace/data/coca_content_words.json (1016 lowercased common content words from the COCA top-2000 frequency list, POS-filtered to
noun/adjective/verb — no proper nouns, no known product names). The filter is consulted at three sites where capitalized-word entity
detection happens:

entity_detector.extract_candidates — init-time detection
palace.build_closet_lines — closet pointer construction
miner._extract_entities_for_metadata — per-drawer entity tagging

Case-insensitive matching. Multi-word phrases ("Claude Code") are not filtered, so legitimate compound names still get detected. Existing stale entries in a user's known_entities.json are not retroactively scrubbed — that's a separate planned PR (Entity CLI).

How to test

pip install -e ".[dev]"
python -m pytest tests/test_entity_detector.py -v — 65 tests, including 8 new tests covering the COCA filter behavior (case insensitivity, real-name preservation, multi-word phrase preservation, data file shape).
Full suite: python -m pytest tests/ -q — 2191 passed, 3 skipped.
Linux cross-platform: containers python:3.9-slim, python:3.11-slim, python:3.13-slim with pip install -e ".[dev]" then python -m pytest tests/test_entity_detector.py — all 65 tests pass on each Python version.
End-to-end live-palace test: mine a corpus where "Line" appears 3+ times (e.g., a markdown doc with "Line 3:", "Line 4:" prefixes), confirm the resulting drawers' entities metadata does NOT include "Line".

Checklist

[x ] Tests pass (python -m pytest tests/ -v) — 2191 passed, 3 skipped
[x ] No hardcoded paths — data file resolved via Path(file).parent / "data" / ...
[x ] Linter passes (ruff check .) — clean

Common English content words ("Code", "Brutal", "Phase", "Line", "Note", "Planning", "Chat", ...) frequently appear capitalized at sentence start, in headings, or in markdown — and the existing regex-based entity detector treats every capitalized word appearing 3+ times as a proper noun candidate. This produces false positives that pollute known_entities.json, the per-drawer entities metadata, and closet pointer entity lists. Adds a curated list of ~1000 common English content words (nouns, adjectives, verbs — no proper nouns, no known product names) at mempalace/data/coca_content_words.json, derived from the COCA top-2000 frequency list and from observed false positives in real palaces. The list is loaded once at module import via _get_coca_filter() (cached frozenset) and consulted at three sites: - entity_detector.extract_candidates — init-time entity detection - palace.build_closet_lines — closet pointer construction - miner._extract_entities_for_metadata — per-drawer entity tagging Matching is case-insensitive: the candidate is lowercased before membership check, so "Code", "CODE", and "code" are all blocked. Only single-word candidates are filtered; multi-word phrases (e.g. "Claude Code") are NOT filtered, so legitimate compound names continue to be detected. A future PR will add a known-systems lexicon that explicitly protects compound names — for now, the multi-word pass provides the implicit protection. Adds 8 new tests in tests/test_entity_detector.py covering: - "Code" appearing 5x is filtered - All known false positives from real palaces are filtered - Case-insensitive matching works - Real names ("Aya", "Riley") are NOT filtered - Multi-word phrases like "Claude Code" are NOT filtered - The data file ships in the wheel with the expected schema - The data file contains every known false positive Note: this filter prevents NEW false positives at detection time. It does NOT retroactively scrub existing entries from a user's ~/.mempalace/known_entities.json. Stale entries from previous runs must be removed via a future entity-management CLI (planned). Verification: pytest 2191 passed / 3 skipped / 0 failed on macOS. ruff check + format --check both clean. Cross-platform via pip install -e ".[dev]" on python:3.9-slim / 3.11-slim / 3.13-slim containers, each running 65 entity-detector tests including the new data-file-presence test that imports from the installed package. End-to-end pipeline test mining a deterministic corpus into an isolated palace confirmed entity metadata is empty (no false positives surface in real chromadb-backed storage).

gemini-code-assist

Code Review

This pull request introduces a COCA content-word filter to improve entity detection by excluding common English words that are frequently misidentified as proper nouns. The implementation includes a cached data loader in entity_detector.py and integration into the extraction pipelines in miner.py and palace.py, along with comprehensive unit tests. Review feedback suggests enhancing the robustness of the JSON loading logic by catching TypeError to prevent potential crashes on malformed data structures.

…n malformed JSON The previous commit landed the code referencing mempalace/data/coca_content_words.json but the data file itself was never actually staged (likely a multi-line paste mishap during the initial commit). CI on every platform failed because the file isn't in the repo. This commit ships the wordlist + adds TypeError to the COCA loader's graceful-degradation exception tuple per gemini-code-assist review on PR #1605.

milla-jovovich · 2026-05-24T12:49:32Z

gemini-code-assist /review

gemini-code-assist

Code Review

This pull request implements a COCA content-word filter to reduce false-positive entity detections by excluding common English nouns, adjectives, and verbs. The changes include a new JSON wordlist, a cached loading mechanism, and integration into the entity extraction and mining processes. Review feedback recommends moving local imports and filter initializations to the top level of the modules to improve performance in hot paths and adhere to standard Python practices.

gemini-code-assist · 2026-05-24T12:51:51Z

+    from .entity_detector import _get_coca_filter
    from .palace import _candidate_entity_words

+    coca_filter = _get_coca_filter()


The local imports (_get_coca_filter, _candidate_entity_words) and the call to _get_coca_filter() are located inside _extract_entities_for_metadata, which is a hot path called for every drawer (chunk) during mining. In a large project, this can be executed tens of thousands of times.

Local imports and repeated function calls (even if cached via lru_cache) in such a hot loop introduce unnecessary overhead. Since there are no circular dependencies preventing it, please move these imports to the top of the file and consider calling _get_coca_filter() once at the module level or passing the filter down from a higher-level function.

Additionally, note that import re at line 853 and from .palace import _ENTITY_STOPLIST at line 855 (inside the same function) are also local imports that should be moved to the top level for efficiency.

References

Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants. Moving imports out of functions avoids repeated import overhead in hot loops. ^(link)

gemini-code-assist · 2026-05-24T12:51:51Z

+    from .entity_detector import _get_coca_filter
+
+    coca_filter = _get_coca_filter()


The local import of _get_coca_filter and the subsequent call can be moved to the top level of the module. While build_closet_lines is called once per file (rather than once per chunk), moving imports to the top level is standard practice in Python to avoid the overhead of repeated local imports and to keep dependencies clear, unless lazy loading is specifically required to break circular imports.

References

Imports should be placed at the top of the file to maintain clarity and follow standard Python conventions (PEP-8). ^(link)

…1590, #1605)

Bumps version 3.3.5 → 3.3.6 across pyproject.toml, version.py, plugin manifests (.claude-plugin/plugin.json, .claude-plugin/marketplace.json, .codex-plugin/plugin.json), README badge, and uv.lock. Flips CHANGELOG.md from ``[Unreleased]`` to ``[3.3.6] — 2026-05-24`` and backfills the major user-facing entries that landed without changelog entries during the cycle: Features: - MemPalace#1555 office-document mining via --mode extract + virtual line numbers - MemPalace#1584 surgical closet pointers with date+line locators (Tier 6a) - MemPalace#1558 + MemPalace#1560 within-wing hallways (entity co-occurrence graph) - MemPalace#1565 cross-wing tunnels auto-promoted from hallways - MemPalace#1578 Hebbian potentiation + Ebbinghaus decay on hallways/tunnels - MemPalace#1236 API-tool transcripts auto-route to wing_api - MemPalace#711 hooks.auto_save toggle for silent-mode sessions - MemPalace#1605 COCA content-word filter for entity detection - MemPalace#1557 case-insensitive entity matching at mine time - MemPalace#1483 multilingual embeddings (embeddinggemma-300m) by default Bug Fixes (selected, user-visible): - MemPalace#1540 silent data loss in three unchunked upsert sites - MemPalace#1538 paragraph chunker oversized chunks - MemPalace#1554 per-file chunk cap too low for transcripts - MemPalace#1562 Windows hook subprocess/ChromaDB deadlock - MemPalace#1529 create_tunnel corrupted hyphenated wing names - MemPalace#1424 save-hook truncated hyphenated project folders - MemPalace#1383 KG cache duplicated graphs for symlinked/cased paths - MemPalace#1466 silent symlink skip now logged - MemPalace#1441 macOS stock-bash 3.2 hook compatibility - MemPalace#1500 / MemPalace#1513 structured JSON-RPC errors on bad MCP input - MemPalace#1523 VACUUM + FTS5 rebuild after repair - MemPalace#1548 FTS5 validation at end of mine - plus MemPalace#1216, MemPalace#1408, MemPalace#1438, MemPalace#1439, MemPalace#1445, MemPalace#1452, MemPalace#1459, MemPalace#1461, MemPalace#1466, MemPalace#1470, MemPalace#1477, MemPalace#1485, MemPalace#1500, MemPalace#1513, MemPalace#1528, MemPalace#1532, MemPalace#1543, MemPalace#1546, MemPalace#1585 Performance: - MemPalace#1474 convo miner pre-fetches mined-set - MemPalace#1487 rebuild_index progress callback - MemPalace#1530 MCP cold-start diagnostics + opt-in warmup Lint passes (ruff 0.15.14); mempalace-mcp entry point alignment verified per RELEASING.md.

…aths The COCA content-word filter shipped in PR MemPalace#1605 imported `_get_coca_filter` and `_candidate_entity_words` locally inside two hot paths: - `palace.build_closet_lines` — runs per source file during mine - `miner._extract_entities_for_metadata` — runs per drawer during mine Both imports are now at module top, where they're resolved once at import time instead of on every per-drawer call. Module-top imports also make the dependency graph visible to static analysis (pylint's C0415 was flagging the locals). No behavior change. The `_get_coca_filter()` call is unchanged — only the import statement moved. End-to-end mining produces identical chromadb output. Addresses the MEDIUM finding gemini-code-assist raised on PR MemPalace#1605 review. Verification: full pytest 2258 passed / 3 skipped / coverage 85.35%. ruff check + format clean. Linux Py 3.9 / 3.11 / 3.13 via CI-matching `pip install -e ".[dev]"`: 2249 passed each. End-to-end mine of a test corpus produces the expected drawer + closet pointer.

…ot paths Two release-blocking fixes for v3.3.6: 1. CI ruff pin drift .github/workflows/ci.yml installed ruff==0.15.9 while pyproject.toml [dev] extras and .pre-commit-config.yaml both pin 0.15.14. Ruff's formatter output can change between minor versions, so a contributor running `pip install -e ".[dev]"` and formatting locally with 0.15.14 would produce output the 0.15.9 lint job rejects. Same failure mode that surfaced on PR MemPalace#1579 (2026-05-22). Aligning CI to 0.15.14 keeps the three pin sites in lock-step. 2. COCA filter imports inside per-drawer hot paths PR MemPalace#1605 (COCA content-word filter, shipping in 3.3.6) introduced `from .entity_detector import _get_coca_filter` and `from .palace import _candidate_entity_words` inside _extract_entities_for_metadata (called per drawer) and build_closet_lines (called per closet). Python caches module imports so the runtime cost after the first call is small, but the import machinery still runs Python bytecode every invocation — gemini flagged this on the original PR. Hoisting to module-level removes the per-call import overhead. The hoist is identical to PR MemPalace#1612, which targets develop. Folding it into the release so 3.3.6 doesn't ship the perf regression that 3.3.7 would immediately have to fix. Verification: ruff check + format clean on 0.15.14, full pytest (2258 passed / 12 skipped) on Linux Py 3.9 / 3.11 / 3.13 via `pip install -e ".[dev]"` (CI-matching).

…ot paths (#222) Two release-blocking fixes for v3.3.6: 1. CI ruff pin drift .github/workflows/ci.yml installed ruff==0.15.9 while pyproject.toml [dev] extras and .pre-commit-config.yaml both pin 0.15.14. Ruff's formatter output can change between minor versions, so a contributor running `pip install -e ".[dev]"` and formatting locally with 0.15.14 would produce output the 0.15.9 lint job rejects. Same failure mode that surfaced on PR MemPalace#1579 (2026-05-22). Aligning CI to 0.15.14 keeps the three pin sites in lock-step. 2. COCA filter imports inside per-drawer hot paths PR MemPalace#1605 (COCA content-word filter, shipping in 3.3.6) introduced `from .entity_detector import _get_coca_filter` and `from .palace import _candidate_entity_words` inside _extract_entities_for_metadata (called per drawer) and build_closet_lines (called per closet). Python caches module imports so the runtime cost after the first call is small, but the import machinery still runs Python bytecode every invocation — gemini flagged this on the original PR. Hoisting to module-level removes the per-call import overhead. The hoist is identical to PR MemPalace#1612, which targets develop. Folding it into the release so 3.3.6 doesn't ship the perf regression that 3.3.7 would immediately have to fix. Verification: ruff check + format clean on 0.15.14, full pytest (2258 passed / 12 skipped) on Linux Py 3.9 / 3.11 / 3.13 via `pip install -e ".[dev]"` (CI-matching). Co-authored-by: Milla J <232237854+milla-jovovich@users.noreply.github.com>

milla-jovovich self-assigned this May 24, 2026

milla-jovovich requested a review from igorls as a code owner May 24, 2026 11:31

gemini-code-assist Bot reviewed May 24, 2026

View reviewed changes

Comment thread mempalace/entity_detector.py Outdated

milla-jovovich merged commit c0a7da1 into develop May 24, 2026
6 checks passed

gemini-code-assist Bot reviewed May 24, 2026

View reviewed changes

igorls added a commit that referenced this pull request May 24, 2026

Merge origin/develop into feat/benchmark-multilingual (#1548, #711, #…

b931151

…1590, #1605)

igorls mentioned this pull request May 24, 2026

Release v3.3.6 #1610

Merged

3 tasks

milla-jovovich mentioned this pull request May 25, 2026

feat(entity): opt-in spaCy NER augmentation via mempalace[nlp] extra #1616

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(entity): filter common English content words from entity detection#1605

feat(entity): filter common English content words from entity detection#1605
milla-jovovich merged 2 commits into
developfrom
feat/coca-content-word-filter

milla-jovovich commented May 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

milla-jovovich commented May 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 24, 2026

Uh oh!

gemini-code-assist Bot May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		from .entity_detector import _get_coca_filter

		coca_filter = _get_coca_filter()

Conversation

milla-jovovich commented May 24, 2026

What does this PR do?

How to test

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

milla-jovovich commented May 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant