Skip to content

feat(entity): filter common English content words from entity detection#1605

Merged
milla-jovovich merged 2 commits into
developfrom
feat/coca-content-word-filter
May 24, 2026
Merged

feat(entity): filter common English content words from entity detection#1605
milla-jovovich merged 2 commits into
developfrom
feat/coca-content-word-filter

Conversation

@milla-jovovich

Copy link
Copy Markdown
Collaborator

Common English content words ("Code", "Brutal", "Phase", "Line", "Note", "Planning", "Chat", ...) frequently appear capitalized at sentence start, in headings, or in markdown — and the existing regex-based entity detector treats every capitalized word appearing 3+ times as a proper noun candidate. This produces false positives that pollute known_entities.json, the per-drawer entities metadata, and closet pointer entity lists.

Adds a curated list of ~1000 common English content words (nouns, adjectives, verbs — no proper nouns, no known product names) at mempalace/data/coca_content_words.json, derived from the COCA top-2000 frequency list and from observed false positives in real palaces. The list is loaded once at module import via _get_coca_filter() (cached frozenset) and consulted at three sites:

  • entity_detector.extract_candidates — init-time entity detection
  • palace.build_closet_lines — closet pointer construction
  • miner._extract_entities_for_metadata — per-drawer entity tagging

Matching is case-insensitive: the candidate is lowercased before membership check, so "Code", "CODE", and "code" are all blocked. Only single-word candidates are filtered; multi-word phrases (e.g. "Claude Code") are NOT filtered, so legitimate compound names continue to be detected. A future PR will add a known-systems lexicon that explicitly protects compound names — for now, the multi-word pass provides the implicit protection.

Adds 8 new tests in tests/test_entity_detector.py covering:

  • "Code" appearing 5x is filtered
  • All known false positives from real palaces are filtered
  • Case-insensitive matching works
  • Real names ("Aya", "Riley") are NOT filtered
  • Multi-word phrases like "Claude Code" are NOT filtered
  • The data file ships in the wheel with the expected schema
  • The data file contains every known false positive

Note: this filter prevents NEW false positives at detection time. It does NOT retroactively scrub existing entries from a user's ~/.mempalace/known_entities.json. Stale entries from previous runs must be removed via a future entity-management CLI (planned).

Verification: pytest 2191 passed / 3 skipped / 0 failed on macOS. ruff check + format --check both clean. Cross-platform via pip install -e ".[dev]" on python:3.9-slim / 3.11-slim / 3.13-slim containers, each running 65 entity-detector tests including the new data-file-presence test that imports from the installed package. End-to-end pipeline test mining a deterministic corpus into an isolated palace confirmed entity metadata is empty (no false positives surface in real chromadb-backed storage).

What does this PR do?

Filters common English content words ("Code", "Brutal", "Phase", "Line", "Note", "Planning", "Chat", …) from entity detection so they no
longer get falsely tagged as people or projects.

Adds mempalace/data/coca_content_words.json (1016 lowercased common content words from the COCA top-2000 frequency list, POS-filtered to
noun/adjective/verb — no proper nouns, no known product names). The filter is consulted at three sites where capitalized-word entity
detection happens:

  • entity_detector.extract_candidates — init-time detection
  • palace.build_closet_lines — closet pointer construction
  • miner._extract_entities_for_metadata — per-drawer entity tagging

Case-insensitive matching. Multi-word phrases ("Claude Code") are not filtered, so legitimate compound names still get detected. Existing stale entries in a user's known_entities.json are not retroactively scrubbed — that's a separate planned PR (Entity CLI).

How to test

  1. pip install -e ".[dev]"

  2. python -m pytest tests/test_entity_detector.py -v — 65 tests, including 8 new tests covering the COCA filter behavior (case insensitivity, real-name preservation, multi-word phrase preservation, data file shape).

  3. Full suite: python -m pytest tests/ -q — 2191 passed, 3 skipped.

  4. Linux cross-platform: containers python:3.9-slim, python:3.11-slim, python:3.13-slim with pip install -e ".[dev]" then python -m pytest tests/test_entity_detector.py — all 65 tests pass on each Python version.

  5. End-to-end live-palace test: mine a corpus where "Line" appears 3+ times (e.g., a markdown doc with "Line 3:", "Line 4:" prefixes), confirm the resulting drawers' entities metadata does NOT include "Line".

Checklist

  • [x ] Tests pass (python -m pytest tests/ -v) — 2191 passed, 3 skipped
  • [x ] No hardcoded paths — data file resolved via Path(file).parent / "data" / ...
  • [x ] Linter passes (ruff check .) — clean

Common English content words ("Code", "Brutal", "Phase", "Line", "Note",
"Planning", "Chat", ...) frequently appear capitalized at sentence
start, in headings, or in markdown — and the existing regex-based entity
detector treats every capitalized word appearing 3+ times as a proper
noun candidate. This produces false positives that pollute
known_entities.json, the per-drawer entities metadata, and closet
pointer entity lists.

Adds a curated list of ~1000 common English content words (nouns,
adjectives, verbs — no proper nouns, no known product names) at
mempalace/data/coca_content_words.json, derived from the COCA top-2000
frequency list and from observed false positives in real palaces.
The list is loaded once at module import via _get_coca_filter() (cached
frozenset) and consulted at three sites:

  - entity_detector.extract_candidates — init-time entity detection
  - palace.build_closet_lines           — closet pointer construction
  - miner._extract_entities_for_metadata — per-drawer entity tagging

Matching is case-insensitive: the candidate is lowercased before
membership check, so "Code", "CODE", and "code" are all blocked.
Only single-word candidates are filtered; multi-word phrases (e.g.
"Claude Code") are NOT filtered, so legitimate compound names continue
to be detected. A future PR will add a known-systems lexicon that
explicitly protects compound names — for now, the multi-word pass
provides the implicit protection.

Adds 8 new tests in tests/test_entity_detector.py covering:

  - "Code" appearing 5x is filtered
  - All known false positives from real palaces are filtered
  - Case-insensitive matching works
  - Real names ("Aya", "Riley") are NOT filtered
  - Multi-word phrases like "Claude Code" are NOT filtered
  - The data file ships in the wheel with the expected schema
  - The data file contains every known false positive

Note: this filter prevents NEW false positives at detection time. It
does NOT retroactively scrub existing entries from a user's
~/.mempalace/known_entities.json. Stale entries from previous runs
must be removed via a future entity-management CLI (planned).

Verification: pytest 2191 passed / 3 skipped / 0 failed on macOS.
ruff check + format --check both clean. Cross-platform via
pip install -e ".[dev]" on python:3.9-slim / 3.11-slim / 3.13-slim
containers, each running 65 entity-detector tests including the new
data-file-presence test that imports from the installed package.
End-to-end pipeline test mining a deterministic corpus into an
isolated palace confirmed entity metadata is empty (no false
positives surface in real chromadb-backed storage).
@milla-jovovich milla-jovovich self-assigned this May 24, 2026
@milla-jovovich milla-jovovich requested a review from igorls as a code owner May 24, 2026 11:31

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a COCA content-word filter to improve entity detection by excluding common English words that are frequently misidentified as proper nouns. The implementation includes a cached data loader in entity_detector.py and integration into the extraction pipelines in miner.py and palace.py, along with comprehensive unit tests. Review feedback suggests enhancing the robustness of the JSON loading logic by catching TypeError to prevent potential crashes on malformed data structures.

Comment thread mempalace/entity_detector.py Outdated
…n malformed JSON

The previous commit landed the code referencing mempalace/data/coca_content_words.json
but the data file itself was never actually staged (likely a multi-line paste mishap
during the initial commit). CI on every platform failed because the file isn't in
the repo. This commit ships the wordlist + adds TypeError to the COCA loader's
graceful-degradation exception tuple per gemini-code-assist review on PR #1605.
@milla-jovovich milla-jovovich merged commit c0a7da1 into develop May 24, 2026
6 checks passed
@milla-jovovich

Copy link
Copy Markdown
Collaborator Author

gemini-code-assist /review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a COCA content-word filter to reduce false-positive entity detections by excluding common English nouns, adjectives, and verbs. The changes include a new JSON wordlist, a cached loading mechanism, and integration into the entity extraction and mining processes. Review feedback recommends moving local imports and filter initializations to the top level of the modules to improve performance in hot paths and adhere to standard Python practices.

Comment thread mempalace/miner.py
Comment on lines +868 to +871
from .entity_detector import _get_coca_filter
from .palace import _candidate_entity_words

coca_filter = _get_coca_filter()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The local imports (_get_coca_filter, _candidate_entity_words) and the call to _get_coca_filter() are located inside _extract_entities_for_metadata, which is a hot path called for every drawer (chunk) during mining. In a large project, this can be executed tens of thousands of times.

Local imports and repeated function calls (even if cached via lru_cache) in such a hot loop introduce unnecessary overhead. Since there are no circular dependencies preventing it, please move these imports to the top of the file and consider calling _get_coca_filter() once at the module level or passing the filter down from a higher-level function.

Additionally, note that import re at line 853 and from .palace import _ENTITY_STOPLIST at line 855 (inside the same function) are also local imports that should be moved to the top level for efficiency.

References
  1. Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants. Moving imports out of functions avoids repeated import overhead in hot loops. (link)

Comment thread mempalace/palace.py
Comment on lines +267 to +269
from .entity_detector import _get_coca_filter

coca_filter = _get_coca_filter()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The local import of _get_coca_filter and the subsequent call can be moved to the top level of the module. While build_closet_lines is called once per file (rather than once per chunk), moving imports to the top level is standard practice in Python to avoid the overhead of repeated local imports and to keep dependencies clear, unless lazy loading is specifically required to break circular imports.

References
  1. Imports should be placed at the top of the file to maintain clarity and follow standard Python conventions (PEP-8). (link)

@igorls igorls mentioned this pull request May 24, 2026
3 tasks
arnoldwender pushed a commit to arnoldwender/mempalace that referenced this pull request May 24, 2026
Bumps version 3.3.5 → 3.3.6 across pyproject.toml, version.py, plugin
manifests (.claude-plugin/plugin.json, .claude-plugin/marketplace.json,
.codex-plugin/plugin.json), README badge, and uv.lock. Flips CHANGELOG.md
from ``[Unreleased]`` to ``[3.3.6] — 2026-05-24`` and backfills the
major user-facing entries that landed without changelog entries during
the cycle:

Features:
- MemPalace#1555 office-document mining via --mode extract + virtual line numbers
- MemPalace#1584 surgical closet pointers with date+line locators (Tier 6a)
- MemPalace#1558 + MemPalace#1560 within-wing hallways (entity co-occurrence graph)
- MemPalace#1565 cross-wing tunnels auto-promoted from hallways
- MemPalace#1578 Hebbian potentiation + Ebbinghaus decay on hallways/tunnels
- MemPalace#1236 API-tool transcripts auto-route to wing_api
- MemPalace#711 hooks.auto_save toggle for silent-mode sessions
- MemPalace#1605 COCA content-word filter for entity detection
- MemPalace#1557 case-insensitive entity matching at mine time
- MemPalace#1483 multilingual embeddings (embeddinggemma-300m) by default

Bug Fixes (selected, user-visible):
- MemPalace#1540 silent data loss in three unchunked upsert sites
- MemPalace#1538 paragraph chunker oversized chunks
- MemPalace#1554 per-file chunk cap too low for transcripts
- MemPalace#1562 Windows hook subprocess/ChromaDB deadlock
- MemPalace#1529 create_tunnel corrupted hyphenated wing names
- MemPalace#1424 save-hook truncated hyphenated project folders
- MemPalace#1383 KG cache duplicated graphs for symlinked/cased paths
- MemPalace#1466 silent symlink skip now logged
- MemPalace#1441 macOS stock-bash 3.2 hook compatibility
- MemPalace#1500 / MemPalace#1513 structured JSON-RPC errors on bad MCP input
- MemPalace#1523 VACUUM + FTS5 rebuild after repair
- MemPalace#1548 FTS5 validation at end of mine
- plus MemPalace#1216, MemPalace#1408, MemPalace#1438, MemPalace#1439, MemPalace#1445, MemPalace#1452, MemPalace#1459, MemPalace#1461, MemPalace#1466,
  MemPalace#1470, MemPalace#1477, MemPalace#1485, MemPalace#1500, MemPalace#1513, MemPalace#1528, MemPalace#1532, MemPalace#1543, MemPalace#1546, MemPalace#1585

Performance:
- MemPalace#1474 convo miner pre-fetches mined-set
- MemPalace#1487 rebuild_index progress callback
- MemPalace#1530 MCP cold-start diagnostics + opt-in warmup

Lint passes (ruff 0.15.14); mempalace-mcp entry point alignment
verified per RELEASING.md.
mvalentsev pushed a commit to mvalentsev/mempalace that referenced this pull request May 24, 2026
…aths

The COCA content-word filter shipped in PR MemPalace#1605 imported
`_get_coca_filter` and `_candidate_entity_words` locally inside two
hot paths:

  - `palace.build_closet_lines` — runs per source file during mine
  - `miner._extract_entities_for_metadata` — runs per drawer during mine

Both imports are now at module top, where they're resolved once at
import time instead of on every per-drawer call. Module-top imports
also make the dependency graph visible to static analysis (pylint's
C0415 was flagging the locals).

No behavior change. The `_get_coca_filter()` call is unchanged — only
the import statement moved. End-to-end mining produces identical
chromadb output. Addresses the MEDIUM finding gemini-code-assist
raised on PR MemPalace#1605 review.

Verification: full pytest 2258 passed / 3 skipped / coverage 85.35%.
ruff check + format clean. Linux Py 3.9 / 3.11 / 3.13 via CI-matching
`pip install -e ".[dev]"`: 2249 passed each. End-to-end mine of a
test corpus produces the expected drawer + closet pointer.
kekse1 pushed a commit to kekse1/mempalace that referenced this pull request May 25, 2026
…ot paths

Two release-blocking fixes for v3.3.6:

1. CI ruff pin drift
   .github/workflows/ci.yml installed ruff==0.15.9 while pyproject.toml
   [dev] extras and .pre-commit-config.yaml both pin 0.15.14. Ruff's
   formatter output can change between minor versions, so a contributor
   running `pip install -e ".[dev]"` and formatting locally with 0.15.14
   would produce output the 0.15.9 lint job rejects. Same failure mode
   that surfaced on PR MemPalace#1579 (2026-05-22). Aligning CI to 0.15.14 keeps
   the three pin sites in lock-step.

2. COCA filter imports inside per-drawer hot paths
   PR MemPalace#1605 (COCA content-word filter, shipping in 3.3.6) introduced
   `from .entity_detector import _get_coca_filter` and
   `from .palace import _candidate_entity_words` inside
   _extract_entities_for_metadata (called per drawer) and
   build_closet_lines (called per closet). Python caches module imports
   so the runtime cost after the first call is small, but the import
   machinery still runs Python bytecode every invocation — gemini
   flagged this on the original PR. Hoisting to module-level removes
   the per-call import overhead.

   The hoist is identical to PR MemPalace#1612, which targets develop. Folding
   it into the release so 3.3.6 doesn't ship the perf regression that
   3.3.7 would immediately have to fix.

Verification: ruff check + format clean on 0.15.14, full pytest
(2258 passed / 12 skipped) on Linux Py 3.9 / 3.11 / 3.13 via
`pip install -e ".[dev]"` (CI-matching).
jphein pushed a commit to techempower-org/mempalace that referenced this pull request May 26, 2026
…ot paths

Two release-blocking fixes for v3.3.6:

1. CI ruff pin drift
   .github/workflows/ci.yml installed ruff==0.15.9 while pyproject.toml
   [dev] extras and .pre-commit-config.yaml both pin 0.15.14. Ruff's
   formatter output can change between minor versions, so a contributor
   running `pip install -e ".[dev]"` and formatting locally with 0.15.14
   would produce output the 0.15.9 lint job rejects. Same failure mode
   that surfaced on PR MemPalace#1579 (2026-05-22). Aligning CI to 0.15.14 keeps
   the three pin sites in lock-step.

2. COCA filter imports inside per-drawer hot paths
   PR MemPalace#1605 (COCA content-word filter, shipping in 3.3.6) introduced
   `from .entity_detector import _get_coca_filter` and
   `from .palace import _candidate_entity_words` inside
   _extract_entities_for_metadata (called per drawer) and
   build_closet_lines (called per closet). Python caches module imports
   so the runtime cost after the first call is small, but the import
   machinery still runs Python bytecode every invocation — gemini
   flagged this on the original PR. Hoisting to module-level removes
   the per-call import overhead.

   The hoist is identical to PR MemPalace#1612, which targets develop. Folding
   it into the release so 3.3.6 doesn't ship the perf regression that
   3.3.7 would immediately have to fix.

Verification: ruff check + format clean on 0.15.14, full pytest
(2258 passed / 12 skipped) on Linux Py 3.9 / 3.11 / 3.13 via
`pip install -e ".[dev]"` (CI-matching).
jphein added a commit to techempower-org/mempalace that referenced this pull request May 26, 2026
…ot paths (#222)

Two release-blocking fixes for v3.3.6:

1. CI ruff pin drift
   .github/workflows/ci.yml installed ruff==0.15.9 while pyproject.toml
   [dev] extras and .pre-commit-config.yaml both pin 0.15.14. Ruff's
   formatter output can change between minor versions, so a contributor
   running `pip install -e ".[dev]"` and formatting locally with 0.15.14
   would produce output the 0.15.9 lint job rejects. Same failure mode
   that surfaced on PR MemPalace#1579 (2026-05-22). Aligning CI to 0.15.14 keeps
   the three pin sites in lock-step.

2. COCA filter imports inside per-drawer hot paths
   PR MemPalace#1605 (COCA content-word filter, shipping in 3.3.6) introduced
   `from .entity_detector import _get_coca_filter` and
   `from .palace import _candidate_entity_words` inside
   _extract_entities_for_metadata (called per drawer) and
   build_closet_lines (called per closet). Python caches module imports
   so the runtime cost after the first call is small, but the import
   machinery still runs Python bytecode every invocation — gemini
   flagged this on the original PR. Hoisting to module-level removes
   the per-call import overhead.

   The hoist is identical to PR MemPalace#1612, which targets develop. Folding
   it into the release so 3.3.6 doesn't ship the perf regression that
   3.3.7 would immediately have to fix.

Verification: ruff check + format clean on 0.15.14, full pytest
(2258 passed / 12 skipped) on Linux Py 3.9 / 3.11 / 3.13 via
`pip install -e ".[dev]"` (CI-matching).

Co-authored-by: Milla J <232237854+milla-jovovich@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant