Skip to content

feat: add OpenAI Codex CLI JSONL normalizer#5

Closed
dpschen wants to merge 1 commit into
MemPalace:mainfrom
dpschen:feat/add-codex-normalizer
Closed

feat: add OpenAI Codex CLI JSONL normalizer#5
dpschen wants to merge 1 commit into
MemPalace:mainfrom
dpschen:feat/add-codex-normalizer

Conversation

@dpschen

@dpschen dpschen commented Apr 7, 2026

Copy link
Copy Markdown

What this does

Adds support for parsing OpenAI Codex CLI session files (JSONL format stored at ~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl).

How it works

Codex uses a different structure than Claude Code:

  • Each line has {type: "response_item", payload: {type: "message", role: "user"|"assistant", content: [...]}}
  • This extracts role from payload.role and content from payload.content

Matches the existing Claude Code parser structure — same line-by-line JSONL approach, same content extraction, same transcript output format.

Why it matters

Users accumulate hundreds of Codex sessions locally (each containing real conversations, decisions, and debugging sessions). This lets mempalace mine all of them directly.

Testing

  • Tested against actual Codex session files from ~/.codex/sessions/
  • No regressions to existing Claude Code, ChatGPT, or Slack parsers

Parse Codex session JSONL files with response_item/payload structure.
Extracts user and assistant messages from Codex's nested payload format
where role is 'user' or 'assistant' inside payload.role, matching the
same structure as the existing Claude Code parser but adapted for Codex.
@dpschen

dpschen commented Apr 7, 2026

Copy link
Copy Markdown
Author

sry this wasn’t ready

@dpschen dpschen closed this Apr 7, 2026
This was referenced Apr 8, 2026
brandonhon added a commit to brandonhon/mempalace that referenced this pull request Apr 10, 2026
…tion

Addresses PR MemPalace#548 review feedback about scan amplification on large
palaces. The previous implementation made up to six ChromaDB scans per
clean operation:

  1. count_drawers(drawers_col)       — scan MemPalace#1
  2. count_drawers(compressed_col)    — scan MemPalace#2
  3. delete_drawers(drawers_col)      — internal get(where=...) scan MemPalace#3
  4. delete_drawers(drawers_col)      — delete(where=...) scan MemPalace#4
  5. delete_drawers(compressed_col)   — scan MemPalace#5
  6. delete_drawers(compressed_col)   — scan MemPalace#6

Each call was doing its own metadata filter over the collection,
meaning a 100K-drawer palace paid the filter cost six times for a
single cleanup. This refactor drops it to exactly two scans — one per
collection — regardless of palace size.

Changes:

  * palace.py: introduce `find_drawer_ids(col, wing, room)` which
    returns the matching ID list in a single `get(where=..., include=[])`
    call. ChromaDB fetches only IDs — no documents, embeddings, or
    metadatas — so the scan is as cheap as ChromaDB can make it.
  * palace.py: `count_drawers` is now a thin wrapper around
    `find_drawer_ids`. The old standalone `delete_drawers` helper is
    removed because its count-then-delete pattern is exactly what we
    are trying to avoid.
  * cli.py::cmd_clean: call `find_drawer_ids` once per collection,
    reuse the returned ID lists for both the preview counts and the
    subsequent delete. Deletes go through `col.delete(ids=[...])`
    which is an O(n) primary-key delete, not another metadata scan.
    Empty lists are guarded to stay compatible with ChromaDB versions
    that reject `delete()` with no ids or where filter.

Why two scans and not one:

ChromaDB collections are independent — there is no cross-collection
query. `mempalace_drawers` and `mempalace_compressed` must each be
filtered separately. A single-scan variant would have to assume that
compressed IDs are a strict subset of drawer IDs and delete compressed
by the drawer IDs we already found, but `tool_delete_drawer` in
mcp_server.py does not cascade to compressed, so real palaces can
contain orphaned compressed rows whose drawer is already gone. Going
to one scan would silently leak those orphans. Two scans is the
minimum that preserves correctness.

Tests:

  * New `test_find_drawer_ids_*` unit tests cover wing-only, wing+room,
    and no-match cases.
  * New `test_find_drawer_ids_single_scan` monkey-patches
    `collection.get` to assert exactly one call.
  * New `test_clean_scans_each_collection_exactly_once` is a
    regression test that wraps `chromadb.PersistentClient` and counts
    `where`-filtered `get()` calls per collection during a full
    `cmd_clean` invocation, failing if either collection is scanned
    more than once.
  * The existing 15 CLI black-box tests stay identical — the behavior
    is unchanged, only the scan count dropped.

Full suite: 551 passed (was 549, +2 new perf regression tests).

Also in this commit: `-V` / `--version` flags, because every good app
needs a version flag and we somehow shipped three minor releases
without one. The installed version is now embedded in the `-h`
description line too, so `mempalace -h` answers "what am I running?"
without a separate invocation.
jphein referenced this pull request in techempower-org/mempalace Apr 12, 2026
Document Claude Code's two memory layers (auto-memory flat files vs
MemPalace archive) and correct Auto Dream status — it's unreleased
code behind a disabled feature flag, not a shipped feature. TODO #4
(decay) and #5 (feedback) remain full priority.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
OmkarKirpan added a commit to OmkarKirpan/mempalace that referenced this pull request Apr 15, 2026
- ChromaBackend.create_collection() now accepts embedding_function
  and embedding_model_name params
- cli.py repair, repair.py rebuild_index: read embedding model from
  existing collection before delete/recreate, preserve it
- migrate.py: stamp new_palace_model() on migrated palaces
- palace.get_collection(): accept optional config param so CLI mining
  respects config.json embedding_model setting
- Update test_rebuild_index_success to verify new embedding args

Addresses code review findings MemPalace#4, MemPalace#5, MemPalace#7 for MemPalace#903
rusel95 added a commit to rusel95/mempalace that referenced this pull request Apr 15, 2026
Addresses review items MemPalace#5 and MemPalace#6 from @igorls:

1. Extract core sync logic from cmd_sync (~200 lines) into mempalace/sync.py
   as sync_palace(...) returning a SyncReport dataclass. cmd_sync is now a
   thin CLI wrapper. Makes sync callable from MCP tools, tests, and future
   change-detection features (PII Guard, KG sync).

2. Replace direct chromadb.PersistentClient calls in _force_clean and
   cmd_sync with ChromaBackend.get_collection. All storage access now goes
   through the backend abstraction. _force_clean is also now a thin wrapper
   around sync.force_clean.

3. Document mempalace_sync_status in website/reference/mcp-tools.md so it
   passes test_no_undocumented_tools.

Also ran ruff format with CI-pinned 0.4.x.

All 956 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
felipetruman added a commit to felipetruman/mempalace that referenced this pull request Apr 17, 2026
…ut, tests

Addresses the six Copilot review comments on the initial commit.

1) MemPalace#6 (critical) — mcp_server.py `_get_collection` bypassed ChromaBackend

   The MCP server creates its palace collection directly via
   `chromadb.PersistentClient.get_or_create_collection` in `_get_collection`,
   not through `ChromaBackend.get_collection`. That path was missing the
   `hnsw:num_threads=1` metadata, so the primary crash surface for MemPalace#974
   and MemPalace#965 was untouched by the original patch. Fixed by passing
   `hnsw:num_threads=1` at the mcp_server create site too. Documented
   in a code comment that the setting is only honored at creation
   time — existing palaces created before this fix still need a
   `mempalace nuke` + re-mine to gain the protection.

2) MemPalace#3 — mine_global_lock over-serialized mines across unrelated palaces

   Replaced the single global lock file `mine_global.lock` with a
   per-palace lock keyed by `sha256(os.path.abspath(palace_path))`
   (`mine_palace_<hash>.lock`). Mines against the same palace still
   collapse to a single runner (the correctness boundary), but mines
   against *different* palaces are now free to run in parallel.
   `mine_global_lock` is kept as a backward-compatible alias for
   `mine_palace_lock` so any external callers that imported the
   previous name keep working.

3) MemPalace#1 — hook_precompact swallowed OSError but not subprocess.TimeoutExpired

   `subprocess.run(..., timeout=60)` raises `TimeoutExpired` on slow
   palaces. The previous `except OSError` clause didn't catch it, so
   the hook could raise and fail to emit any JSON decision — leaving
   the harness without a block/passthrough signal. Fixed by catching
   `(OSError, subprocess.TimeoutExpired)` together and always falling
   through to the block decision so the hook reliably emits a response.

4) MemPalace#2 + MemPalace#4 — tests

   - tests/test_hooks_cli.py: added
     `test_precompact_first_two_attempts_block`,
     `test_precompact_passes_through_after_cap`, and
     `test_precompact_counter_is_per_session` to lock in the MemPalace#955
     deadlock fix.
   - tests/test_palace_locks.py (new): covers `mine_palace_lock`
     single-acquire, reuse-after-release, cross-process serialization
     on the same palace, non-interference across different palaces,
     path normalization, and the `mine_global_lock` back-compat alias.

5) MemPalace#5 — known limitation, documented but not auto-fixed

   Copilot suggested detecting collections missing `hnsw:num_threads=1`
   and calling `collection.modify(metadata=...)` to retrofit existing
   palaces. Verified against chromadb 1.5.7: `modify(metadata=...)`
   replaces metadata rather than merging, and re-passing
   `hnsw:space="cosine"` then raises `ValueError: Changing the
   distance function of a collection once it is created is not
   supported currently.` The HNSW runtime configuration
   (`configuration_json`) also does not expose `num_threads` in
   chromadb 1.5.x, so the flag appears to be read only at creation
   time. Rather than paper over the limitation with a best-effort
   `modify` that silently drops `hnsw:space`, documented in the
   mcp_server comment that pre-existing palaces need a
   `mempalace nuke` + re-mine to gain the protection. Fresh palaces
   are always protected.

Testing
- pytest tests/test_palace_locks.py tests/test_hooks_cli.py
  tests/test_backends.py tests/test_cli.py → **98 passed, 0 failed**.
- Runtime validation with two concurrent `mempalace mine` calls:
  - Different palaces → both complete in parallel ✓
  - Same palace     → one completes, the other exits with
    "another `mine` is already running against <palace> — exiting
    cleanly." ✓
felipetruman added a commit to felipetruman/mempalace that referenced this pull request Apr 17, 2026
…ut, tests

Addresses the six Copilot review comments on the initial commit.

1) MemPalace#6 (critical) — mcp_server.py `_get_collection` bypassed ChromaBackend

   The MCP server creates its palace collection directly via
   `chromadb.PersistentClient.get_or_create_collection` in `_get_collection`,
   not through `ChromaBackend.get_collection`. That path was missing the
   `hnsw:num_threads=1` metadata, so the primary crash surface for MemPalace#974
   and MemPalace#965 was untouched by the original patch. Fixed by passing
   `hnsw:num_threads=1` at the mcp_server create site too. Documented
   in a code comment that the setting is only honored at creation
   time — existing palaces created before this fix still need a
   `mempalace nuke` + re-mine to gain the protection.

2) MemPalace#3 — mine_global_lock over-serialized mines across unrelated palaces

   Replaced the single global lock file `mine_global.lock` with a
   per-palace lock keyed by `sha256(os.path.abspath(palace_path))`
   (`mine_palace_<hash>.lock`). Mines against the same palace still
   collapse to a single runner (the correctness boundary), but mines
   against *different* palaces are now free to run in parallel.
   `mine_global_lock` is kept as a backward-compatible alias for
   `mine_palace_lock` so any external callers that imported the
   previous name keep working.

3) MemPalace#1 — hook_precompact swallowed OSError but not subprocess.TimeoutExpired

   `subprocess.run(..., timeout=60)` raises `TimeoutExpired` on slow
   palaces. The previous `except OSError` clause didn't catch it, so
   the hook could raise and fail to emit any JSON decision — leaving
   the harness without a block/passthrough signal. Fixed by catching
   `(OSError, subprocess.TimeoutExpired)` together and always falling
   through to the block decision so the hook reliably emits a response.

4) MemPalace#2 + MemPalace#4 — tests

   - tests/test_hooks_cli.py: added
     `test_precompact_first_two_attempts_block`,
     `test_precompact_passes_through_after_cap`, and
     `test_precompact_counter_is_per_session` to lock in the MemPalace#955
     deadlock fix.
   - tests/test_palace_locks.py (new): covers `mine_palace_lock`
     single-acquire, reuse-after-release, cross-process serialization
     on the same palace, non-interference across different palaces,
     path normalization, and the `mine_global_lock` back-compat alias.

5) MemPalace#5 — known limitation, documented but not auto-fixed

   Copilot suggested detecting collections missing `hnsw:num_threads=1`
   and calling `collection.modify(metadata=...)` to retrofit existing
   palaces. Verified against chromadb 1.5.7: `modify(metadata=...)`
   replaces metadata rather than merging, and re-passing
   `hnsw:space="cosine"` then raises `ValueError: Changing the
   distance function of a collection once it is created is not
   supported currently.` The HNSW runtime configuration
   (`configuration_json`) also does not expose `num_threads` in
   chromadb 1.5.x, so the flag appears to be read only at creation
   time. Rather than paper over the limitation with a best-effort
   `modify` that silently drops `hnsw:space`, documented in the
   mcp_server comment that pre-existing palaces need a
   `mempalace nuke` + re-mine to gain the protection. Fresh palaces
   are always protected.

Testing
- pytest tests/test_palace_locks.py tests/test_hooks_cli.py
  tests/test_backends.py tests/test_cli.py → **98 passed, 0 failed**.
- Runtime validation with two concurrent `mempalace mine` calls:
  - Different palaces → both complete in parallel ✓
  - Same palace     → one completes, the other exits with
    "another `mine` is already running against <palace> — exiting
    cleanly." ✓
igorls pushed a commit to felipetruman/mempalace that referenced this pull request Apr 25, 2026
…ut, tests

Addresses the six Copilot review comments on the initial commit.

1) MemPalace#6 (critical) — mcp_server.py `_get_collection` bypassed ChromaBackend

   The MCP server creates its palace collection directly via
   `chromadb.PersistentClient.get_or_create_collection` in `_get_collection`,
   not through `ChromaBackend.get_collection`. That path was missing the
   `hnsw:num_threads=1` metadata, so the primary crash surface for MemPalace#974
   and MemPalace#965 was untouched by the original patch. Fixed by passing
   `hnsw:num_threads=1` at the mcp_server create site too. Documented
   in a code comment that the setting is only honored at creation
   time — existing palaces created before this fix still need a
   `mempalace nuke` + re-mine to gain the protection.

2) MemPalace#3 — mine_global_lock over-serialized mines across unrelated palaces

   Replaced the single global lock file `mine_global.lock` with a
   per-palace lock keyed by `sha256(os.path.abspath(palace_path))`
   (`mine_palace_<hash>.lock`). Mines against the same palace still
   collapse to a single runner (the correctness boundary), but mines
   against *different* palaces are now free to run in parallel.
   `mine_global_lock` is kept as a backward-compatible alias for
   `mine_palace_lock` so any external callers that imported the
   previous name keep working.

3) MemPalace#1 — hook_precompact swallowed OSError but not subprocess.TimeoutExpired

   `subprocess.run(..., timeout=60)` raises `TimeoutExpired` on slow
   palaces. The previous `except OSError` clause didn't catch it, so
   the hook could raise and fail to emit any JSON decision — leaving
   the harness without a block/passthrough signal. Fixed by catching
   `(OSError, subprocess.TimeoutExpired)` together and always falling
   through to the block decision so the hook reliably emits a response.

4) MemPalace#2 + MemPalace#4 — tests

   - tests/test_hooks_cli.py: added
     `test_precompact_first_two_attempts_block`,
     `test_precompact_passes_through_after_cap`, and
     `test_precompact_counter_is_per_session` to lock in the MemPalace#955
     deadlock fix.
   - tests/test_palace_locks.py (new): covers `mine_palace_lock`
     single-acquire, reuse-after-release, cross-process serialization
     on the same palace, non-interference across different palaces,
     path normalization, and the `mine_global_lock` back-compat alias.

5) MemPalace#5 — known limitation, documented but not auto-fixed

   Copilot suggested detecting collections missing `hnsw:num_threads=1`
   and calling `collection.modify(metadata=...)` to retrofit existing
   palaces. Verified against chromadb 1.5.7: `modify(metadata=...)`
   replaces metadata rather than merging, and re-passing
   `hnsw:space="cosine"` then raises `ValueError: Changing the
   distance function of a collection once it is created is not
   supported currently.` The HNSW runtime configuration
   (`configuration_json`) also does not expose `num_threads` in
   chromadb 1.5.x, so the flag appears to be read only at creation
   time. Rather than paper over the limitation with a best-effort
   `modify` that silently drops `hnsw:space`, documented in the
   mcp_server comment that pre-existing palaces need a
   `mempalace nuke` + re-mine to gain the protection. Fresh palaces
   are always protected.

Testing
- pytest tests/test_palace_locks.py tests/test_hooks_cli.py
  tests/test_backends.py tests/test_cli.py → **98 passed, 0 failed**.
- Runtime validation with two concurrent `mempalace mine` calls:
  - Different palaces → both complete in parallel ✓
  - Same palace     → one completes, the other exits with
    "another `mine` is already running against <palace> — exiting
    cleanly." ✓
lealvona pushed a commit to lealvona/mempalace that referenced this pull request Apr 29, 2026
…ut, tests

Addresses the six Copilot review comments on the initial commit.

1) MemPalace#6 (critical) — mcp_server.py `_get_collection` bypassed ChromaBackend

   The MCP server creates its palace collection directly via
   `chromadb.PersistentClient.get_or_create_collection` in `_get_collection`,
   not through `ChromaBackend.get_collection`. That path was missing the
   `hnsw:num_threads=1` metadata, so the primary crash surface for MemPalace#974
   and MemPalace#965 was untouched by the original patch. Fixed by passing
   `hnsw:num_threads=1` at the mcp_server create site too. Documented
   in a code comment that the setting is only honored at creation
   time — existing palaces created before this fix still need a
   `mempalace nuke` + re-mine to gain the protection.

2) MemPalace#3 — mine_global_lock over-serialized mines across unrelated palaces

   Replaced the single global lock file `mine_global.lock` with a
   per-palace lock keyed by `sha256(os.path.abspath(palace_path))`
   (`mine_palace_<hash>.lock`). Mines against the same palace still
   collapse to a single runner (the correctness boundary), but mines
   against *different* palaces are now free to run in parallel.
   `mine_global_lock` is kept as a backward-compatible alias for
   `mine_palace_lock` so any external callers that imported the
   previous name keep working.

3) MemPalace#1 — hook_precompact swallowed OSError but not subprocess.TimeoutExpired

   `subprocess.run(..., timeout=60)` raises `TimeoutExpired` on slow
   palaces. The previous `except OSError` clause didn't catch it, so
   the hook could raise and fail to emit any JSON decision — leaving
   the harness without a block/passthrough signal. Fixed by catching
   `(OSError, subprocess.TimeoutExpired)` together and always falling
   through to the block decision so the hook reliably emits a response.

4) MemPalace#2 + MemPalace#4 — tests

   - tests/test_hooks_cli.py: added
     `test_precompact_first_two_attempts_block`,
     `test_precompact_passes_through_after_cap`, and
     `test_precompact_counter_is_per_session` to lock in the MemPalace#955
     deadlock fix.
   - tests/test_palace_locks.py (new): covers `mine_palace_lock`
     single-acquire, reuse-after-release, cross-process serialization
     on the same palace, non-interference across different palaces,
     path normalization, and the `mine_global_lock` back-compat alias.

5) MemPalace#5 — known limitation, documented but not auto-fixed

   Copilot suggested detecting collections missing `hnsw:num_threads=1`
   and calling `collection.modify(metadata=...)` to retrofit existing
   palaces. Verified against chromadb 1.5.7: `modify(metadata=...)`
   replaces metadata rather than merging, and re-passing
   `hnsw:space="cosine"` then raises `ValueError: Changing the
   distance function of a collection once it is created is not
   supported currently.` The HNSW runtime configuration
   (`configuration_json`) also does not expose `num_threads` in
   chromadb 1.5.x, so the flag appears to be read only at creation
   time. Rather than paper over the limitation with a best-effort
   `modify` that silently drops `hnsw:space`, documented in the
   mcp_server comment that pre-existing palaces need a
   `mempalace nuke` + re-mine to gain the protection. Fresh palaces
   are always protected.

Testing
- pytest tests/test_palace_locks.py tests/test_hooks_cli.py
  tests/test_backends.py tests/test_cli.py → **98 passed, 0 failed**.
- Runtime validation with two concurrent `mempalace mine` calls:
  - Different palaces → both complete in parallel ✓
  - Same palace     → one completes, the other exits with
    "another `mine` is already running against <palace> — exiting
    cleanly." ✓
rergards pushed a commit to rergards/mempalace that referenced this pull request Apr 30, 2026
Merge PR MemPalace#5 after local merge rehearsal, full lint/format, full pytest, GitHub checks, and LanceDB schema smoke passed.
FBISiri added a commit to FBISiri/mempalace that referenced this pull request May 9, 2026
Adds _try_gemini_json parser to normalize.py for three layouts:

  1. Gemini API contents format (~/.gemini/sessions/*.json):
     {"contents": [{"role": "user", "parts": [{"text": "..."}]}, ...]}
  2. Messages-wrapper variant:
     {"messages": [{"role": "user", ...}, {"role": "model", ...}]}
  3. Flat top-level list with role="model".

This complements the existing _try_gemini_jsonl parser (which handles
~/.gemini/tmp/<hash>/chats/session-*.jsonl with session_metadata
sentinel) — JSONL covers Gemini CLI runtime sessions, JSON covers
exported / Studio-saved transcripts.

## Review feedback addressed (PR MemPalace#204)

bgauryy review:
- MemPalace#1 Parser-precedence bug: _try_gemini_json runs *before*
  _try_claude_ai_json so the {"messages":[..., role=model, ...]}
  layout is no longer silently claimed by the Claude parser. The
  Gemini parser's has_model_role guard prevents false-positives
  against Claude / ChatGPT data.
- MemPalace#2 Layout 2a coverage: TestGeminiJson.test_messages_wrapper_format
  + test_messages_wrapper_does_not_get_claimed_by_claude pin the
  fix in place.
- MemPalace#3 Test conflicts with current main: rebased onto develop;
  tests restructured into TestGeminiJson class.
- MemPalace#4 tempfile/os.unlink → pytest tmp_path everywhere.
- MemPalace#5 elif not text → else (the elif branch was dead).
- MemPalace#6 Module docstring updated to mention Google AI Studio.

Tests: 9 new cases in TestGeminiJson covering all three layouts,
multi-part text joining, non-text part skipping, has_model_role
disambiguation, dispatch-chain regression for review MemPalace#1.
brandonhon added a commit to brandonhon/mempalace that referenced this pull request May 13, 2026
…tion

Addresses PR MemPalace#548 review feedback about scan amplification on large
palaces. The previous implementation made up to six ChromaDB scans per
clean operation:

  1. count_drawers(drawers_col)       — scan MemPalace#1
  2. count_drawers(compressed_col)    — scan MemPalace#2
  3. delete_drawers(drawers_col)      — internal get(where=...) scan MemPalace#3
  4. delete_drawers(drawers_col)      — delete(where=...) scan MemPalace#4
  5. delete_drawers(compressed_col)   — scan MemPalace#5
  6. delete_drawers(compressed_col)   — scan MemPalace#6

Each call was doing its own metadata filter over the collection,
meaning a 100K-drawer palace paid the filter cost six times for a
single cleanup. This refactor drops it to exactly two scans — one per
collection — regardless of palace size.

Changes:

  * palace.py: introduce `find_drawer_ids(col, wing, room)` which
    returns the matching ID list in a single `get(where=..., include=[])`
    call. ChromaDB fetches only IDs — no documents, embeddings, or
    metadatas — so the scan is as cheap as ChromaDB can make it.
  * palace.py: `count_drawers` is now a thin wrapper around
    `find_drawer_ids`. The old standalone `delete_drawers` helper is
    removed because its count-then-delete pattern is exactly what we
    are trying to avoid.
  * cli.py::cmd_clean: call `find_drawer_ids` once per collection,
    reuse the returned ID lists for both the preview counts and the
    subsequent delete. Deletes go through `col.delete(ids=[...])`
    which is an O(n) primary-key delete, not another metadata scan.
    Empty lists are guarded to stay compatible with ChromaDB versions
    that reject `delete()` with no ids or where filter.

Why two scans and not one:

ChromaDB collections are independent — there is no cross-collection
query. `mempalace_drawers` and `mempalace_compressed` must each be
filtered separately. A single-scan variant would have to assume that
compressed IDs are a strict subset of drawer IDs and delete compressed
by the drawer IDs we already found, but `tool_delete_drawer` in
mcp_server.py does not cascade to compressed, so real palaces can
contain orphaned compressed rows whose drawer is already gone. Going
to one scan would silently leak those orphans. Two scans is the
minimum that preserves correctness.

Tests:

  * New `test_find_drawer_ids_*` unit tests cover wing-only, wing+room,
    and no-match cases.
  * New `test_find_drawer_ids_single_scan` monkey-patches
    `collection.get` to assert exactly one call.
  * New `test_clean_scans_each_collection_exactly_once` is a
    regression test that wraps `chromadb.PersistentClient` and counts
    `where`-filtered `get()` calls per collection during a full
    `cmd_clean` invocation, failing if either collection is scanned
    more than once.
  * The existing 15 CLI black-box tests stay identical — the behavior
    is unchanged, only the scan count dropped.

Full suite: 551 passed (was 549, +2 new perf regression tests).

Also in this commit: `-V` / `--version` flags, because every good app
needs a version flag and we somehow shipped three minor releases
without one. The installed version is now embedded in the `-h`
description line too, so `mempalace -h` answers "what am I running?"
without a separate invocation.
mvalentsev pushed a commit to mvalentsev/mempalace that referenced this pull request May 21, 2026
…#1555

PR MemPalace#1555 (format coverage + virtual line numbering) merged with twelve
inline polish comments from Copilot + gemini-code-assist that weren't
load-bearing enough to block the original ship but are real cleanups.
This PR addresses them.

Twelve items in scope; one item (drawer ID delimiter — Copilot MemPalace#13) is
deferred to its own dedicated PR because it's a breaking schema change
that requires migration design beyond the scope of a polish PR.

## Behavioral fixes (5 items, RED-tested first)

1. **FileNotFoundError vs broken symlink (Copilot MemPalace#8).** ``extract_text``
   previously mapped every ``FileNotFoundError`` from ``stat()`` to
   ``SKIP_BROKEN_SYMLINK``. That's misleading for the common case of a
   regular file deleted between scan and extract. Now distinguishes:
   ``SKIP_BROKEN_SYMLINK`` only when ``p.is_symlink()`` is true;
   ``SKIP_UNREADABLE`` otherwise.

2. **``file_already_mined`` extract_mode scoping (Copilot MemPalace#11, MemPalace#12).**
   Both call sites in ``mine_formats`` and ``_file_chunks_locked`` now
   pass ``extract_mode="format"``. Previously the format miner could
   falsely treat drawers from project / convo miner on the same source
   file as "already mined" (and vice versa). Scopes idempotency to the
   correct drawer subset.

3. **Sentinel skip for transient missing-dep statuses (Copilot MemPalace#14).**
   New ``_TRANSIENT_MISSING_DEP_STATUSES`` set + ``_register_skip_sentinel_if_appropriate``
   helper. Skip variants like ``SKIP_NO_MARKITDOWN`` /
   ``SKIP_NO_STRIPRTF`` / ``SKIP_MISSING_FORMAT_DEPS`` /
   ``SKIP_NETWORK_TIMEOUT`` no longer write the "already-mined" sentinel.
   Otherwise installing the missing extra later wouldn't trigger a re-mine.

4. **Outer ``except Exception`` in ``mine_formats`` (Gemini MemPalace#5).** The
   outer try around the loop previously caught only ``KeyboardInterrupt``,
   leaving any setup-time error (e.g., ``scan_formats`` raising) to
   propagate as a bare traceback. Now catches ``Exception`` defensively,
   logs it, prints a partial-progress summary, and lets the ``finally``
   PID-cleanup run. Mirrors miner.py's belt-and-suspenders pattern.

5. **Thread user's ``chunk_size`` / ``chunk_overlap`` / ``min_chunk_size``
   through to ``chunk_text`` (Gemini MemPalace#3).** ``MempalaceConfig`` was loaded
   only to validate readability; users who tuned their config saw no
   effect in format-mode mining. Now properly threaded.

## Trivial cleanups (5 items)

6. **Path expanduser in ``extract_text`` (Copilot MemPalace#7).** ``Path(path)`` →
   ``Path(path).expanduser()`` so CLI inputs like ``~/docs/file.pdf``
   resolve correctly.

7. **Path expanduser+resolve in ``scan_formats`` (Copilot MemPalace#9).** Same
   fix; ``~/docs`` and relative paths now work consistently.

8. **Use resolved ``format_path`` in ``mine_formats`` (Copilot MemPalace#10).**
   ``scan_formats(format_dir)`` → ``scan_formats(format_path)`` so the
   already-resolved path is used.

9. **``render_with_line_numbers`` type annotation (Copilot MemPalace#15).**
   ``text: "str | None"`` reflects the documented + tested ``None``
   handling.

10. **Test + docs claims (Copilot MemPalace#16, MemPalace#17, MemPalace#18).** Stale framings
    removed:
    - ``docs/format-coverage.md`` — 14 fringe cases + "see the file for
      the current test inventory" (no more frozen test count).
    - ``tests/test_line_numbers.py`` — drops "proposed for mempalace
      3.3.6" + "run from the proposal directory" references.
    - ``tests/test_format_miner.py`` — drops "MarkItDown is mocked
      throughout" (live integration tests exist) + proposal-directory
      framing.

## Module-level hoists (enables clean test patching)

- ``MempalaceConfig`` (from ``.config``) hoisted from lazy local import
  to module-level so tests can patch ``mempalace.format_miner.MempalaceConfig``.
- ``chunk_text`` (from ``.miner``) hoisted similarly.

Both follow the pattern PR MemPalace#1565 used for ``compute_hallways_for_wing``.

## Complexity refactor

Extracted ``_print_mine_summary`` from ``mine_formats`` so the orchestrator
stays under the project's ``max-complexity = 25`` ceiling (per
``pyproject.toml [tool.ruff.lint.mccabe]``). Behavior unchanged; pure
extraction.

## Out of scope (intentionally deferred)

- **Drawer ID delimiter collision (Copilot MemPalace#13)** — ``f"{source_file}{chunk_index}"``
  can theoretically collide (``"/path/a1" + "23"`` == ``"/path/a" + "123"``).
  Fixing this is a breaking schema change to drawer IDs and requires a
  migration plan; will land as its own PR after design.

- The four bot comments that were ALREADY addressed by amendment MemPalace#3
  before the PR MemPalace#1555 merge (``_SKIP_DIRS`` dedup, ``scan_formats``
  symlink skip, ``source_mtime`` tracking, hall+entities metadata) —
  no action needed; verified during audit.

## Tests (RED-first)

Six new RED-first tests in ``tests/test_format_miner.py``:

  test_extract_text_nonexistent_regular_file_returns_unreadable_not_broken_symlink
  test_mine_formats_passes_extract_mode_format_to_file_already_mined
  test_mine_formats_does_not_write_sentinel_for_skip_no_markitdown
  test_mine_formats_does_not_write_sentinel_for_skip_missing_format_deps
  test_mine_formats_catches_unexpected_exception_and_prints_summary
  test_mine_formats_threads_chunk_size_from_user_config

All six RED before this commit (failures correctly identified the bugs
they're targeting), all six GREEN after.

One existing test (``test_mine_formats_continues_after_per_file_error``)
updated to patch the new module-level binding
``mempalace.format_miner.chunk_text`` instead of the old
``mempalace.miner.chunk_text`` source location, and to accept the
``**kwargs`` the call now passes through. Behavior unchanged.

## Verification

  pytest -q (full mempalace suite)
    → 2065 passed, 3 skipped, 0 regressions
  ruff check mempalace/format_miner.py mempalace/searcher.py tests/
    → All checks passed!
  ruff format --check ...
    → 4 files already formatted (pinned 0.15.9)
  mine_formats complexity
    → ≤ 25 (under the project ceiling)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant