feat(3.3.6): virtual line numbering + format coverage via --mode extract#1555
Conversation
Two additive features, both following the same read-time-transform pattern:
1. Virtual line numbering — new render_with_line_numbers() and
extract_line_range() in mempalace/searcher.py. Closet pointers like
2026-01-18:L55-L72 resolve to drawer slices rendered with [55] through
[72] line prefixes without modifying any stored content. Lines already
prefixed with [<digits>] pass through unchanged. Pure functions, no I/O.
2. Format coverage (mempalace mine --mode extract) — new mempalace/format_miner.py
reads binary office formats and files drawers via the lock + purge +
upsert pattern convo_miner uses. Source files never modified.
Per-format transformer routing: MarkItDown 0.1.5 does not actually
convert .rtf (returns raw control codes unchanged, verified live), so
.rtf is routed to striprtf which does convert. Other formats stay on
MarkItDown:
.pdf .docx .pptx .xlsx .epub -> MarkItDown
.rtf -> striprtf
13 fringe cases handled with dedicated ExtractionStatus codes:
SKIP_NO_MARKITDOWN, SKIP_NO_STRIPRTF, SKIP_TOO_LARGE, SKIP_CLOUD_ONLY,
SKIP_ENCRYPTED, SKIP_EMPTY, SKIP_PERMISSION, SKIP_BROKEN_SYMLINK,
SKIP_UNRECOGNIZED, SKIP_EXTRACTION_ERROR, SKIP_NETWORK_TIMEOUT,
SKIP_UNREADABLE, plus the encoding fallback handled internally.
Drawers carry ingest_mode=extract + extract_mode=format so they are
distinguishable from project / convo drawers in the palace.
Both transformers are optional extras: pip install mempalace[extract].
MarkItDown requires Python >= 3.10 (env marker ensures it only installs
where supported; Python 3.9 users still get RTF coverage via striprtf).
Tests: 94 new (21 line-numbering + 73 format-miner, including 9
mine_formats orchestrator tests). Full mempalace suite still green
(1992 passed locally). Coverage 86% on mempalace/format_miner.py.
ruff check + ruff format clean on the pinned 0.15.9.
Live verification: mempalace mine --mode extract on a directory with 2
RTFs + 1 PDF produced 90 drawers correctly; mempalace search found
content from both file types in the resulting wing.
Documented limits (out of scope for 3.3.6): custom PDF parsers, OCR on
scanned PDFs, DRM-locked files, pathological corrupt files. These get
reported via skip codes and skipped.
Refs: docs/format-coverage.md, docs/virtual-line-numbering.md
There was a problem hiding this comment.
Code Review
This pull request adds a new 'extract' mode for mining binary office documents and introduces a virtual line numbering system for read-time display. Feedback identifies a bug where updated documents are skipped due to missing modification time tracking and suggests improvements for maintainability, such as deduplicating constants and respecting global configuration. Further recommendations include enhancing error handling, ensuring consistent symlink processing, and adding metadata tagging to match the quality of existing miners.
| _SKIP_DIRS = { | ||
| ".git", | ||
| "node_modules", | ||
| "__pycache__", | ||
| ".venv", | ||
| "venv", | ||
| "env", | ||
| "dist", | ||
| "build", | ||
| ".next", | ||
| "coverage", | ||
| ".mempalace", | ||
| ".ruff_cache", | ||
| ".mypy_cache", | ||
| ".pytest_cache", | ||
| ".cache", | ||
| ".tox", | ||
| ".nox", | ||
| ".idea", | ||
| ".vscode", | ||
| ".ipynb_checkpoints", | ||
| ".eggs", | ||
| "htmlcov", | ||
| "target", | ||
| } |
| for dirpath, dirnames, filenames in os.walk(root): | ||
| # Mutate dirnames in place to prune the walk. | ||
| dirnames[:] = [d for d in dirnames if d not in _SKIP_DIRS] | ||
| for name in filenames: |
There was a problem hiding this comment.
For consistency with miner.py and convo_miner.py, scan_formats should explicitly skip symlinks. This prevents potential issues with circular links or processing the same file multiple times via different paths.
p = Path(dirpath) / name
if p.is_symlink():
continue
if p.suffix.lower() not in SUPPORTED_FORMATS:
continue| # Local import — chunk_text is the same chunker miner.py uses for | ||
| # text-format project files. Imported inside the function to avoid a | ||
| # module-load cycle. | ||
| from .miner import chunk_text |
There was a problem hiding this comment.
The mine_formats function currently uses hardcoded defaults for chunking because it doesn't load the user's configuration. It should respect the chunk_size, chunk_overlap, and min_chunk_size settings from MempalaceConfig to ensure consistent behavior across all mining modes.
| from .miner import chunk_text | |
| from .config import MempalaceConfig | |
| from .miner import chunk_text | |
| palace_config = MempalaceConfig() | |
| cfg_chunk_size = palace_config.chunk_size | |
| cfg_chunk_overlap = palace_config.chunk_overlap | |
| cfg_min_chunk_size = palace_config.min_chunk_size |
| for i, filepath in enumerate(files, 1): | ||
| source_file = str(filepath) | ||
|
|
||
| if not dry_run and file_already_mined(collection, source_file): |
There was a problem hiding this comment.
The format miner is missing source_mtime tracking. Unlike the project miner, it calls file_already_mined with the default check_mtime=False, meaning updated documents (e.g., an edited PDF or DOCX) will be skipped during re-mining. Please capture the file's mtime, store it in the drawer metadata, and enable mtime-based idempotency checks.
| return drawers_added, False | ||
|
|
||
|
|
||
| def mine_formats( |
There was a problem hiding this comment.
The mine_formats orchestrator lacks the robust error handling found in miner.py. It should wrap the processing loop in a try...except KeyboardInterrupt...except Exception block to provide a clean summary on interruption/failure and ensure that the PID lock file is properly cleaned up via _cleanup_mine_pid_file() if spawned by a hook.
| { | ||
| "wing": wing, | ||
| "room": room, | ||
| "source_file": source_file, | ||
| "chunk_index": chunk["chunk_index"], | ||
| "added_by": agent, | ||
| "filed_at": filed_at, | ||
| "ingest_mode": "extract", | ||
| "extract_mode": "format", | ||
| "normalize_version": NORMALIZE_VERSION, | ||
| } | ||
| ) |
There was a problem hiding this comment.
There was a problem hiding this comment.
Pull request overview
This PR introduces two additive ingestion/read-time features aimed at the 3.3.6 release: (1) virtual line numbering helpers for drawer rendering, and (2) a new “format extraction” miner surfaced via mempalace mine --mode extract with optional dependencies for PDF/Office/RTF/EPUB conversion.
Changes:
- Add
render_with_line_numbers/extract_line_rangepure helpers to support line-addressable drawer rendering. - Add
format_miner.py(scan → extract → chunk → upsert) and wire it into the CLI as--mode extract. - Add an
extractoptional-dependencies extra and extensive new unit tests + specs.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
mempalace/searcher.py |
Adds virtual line-number rendering helpers (pure functions). |
mempalace/format_miner.py |
New format extraction miner (scan + transformer routing + mine orchestration). |
mempalace/cli.py |
Adds --mode extract dispatch path to invoke mine_formats. |
pyproject.toml |
Adds extract optional extra for striprtf + markitdown (py>=3.10). |
tests/test_line_numbers.py |
New tests for virtual line numbering helpers. |
tests/test_format_miner.py |
New tests for format miner, including orchestrator tests. |
docs/virtual-line-numbering.md |
Feature spec for line-number rendering. |
docs/format-coverage.md |
Feature spec for extract-mode mining and skip semantics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| never modified. | ||
| """ | ||
| # Accept str or Path; pathlib handles Windows separators correctly. | ||
| p = Path(path) |
| logger.info("skip:broken_symlink (stat) %s", p) | ||
| return None, ExtractionStatus.SKIP_BROKEN_SYMLINK |
| (sorted by path) so a re-mine processes files in the same order each | ||
| time — useful for reproducing bug reports. | ||
| """ | ||
| root = Path(directory) |
|
|
||
| wing = normalize_wing_name(format_path.name) | ||
|
|
||
| files = scan_formats(format_dir) |
| for i, filepath in enumerate(files, 1): | ||
| source_file = str(filepath) | ||
|
|
||
| if not dry_run and file_already_mined(collection, source_file): |
| if status != ExtractionStatus.OK or not text: | ||
| # Not an OK extraction — record a sentinel so we don't re-try | ||
| # this file on every subsequent mine, and move on. | ||
| if not dry_run: | ||
| _register_file(collection, source_file, wing, agent) | ||
| print(f" - [{i:4}/{len(files)}] {filepath.name[:50]:50} {status.name}") | ||
| continue |
| def render_with_line_numbers(text, start_line: int = 1) -> str: | ||
| """Prefix each line of ``text`` with ``[N] `` for read-time grid display. | ||
|
|
||
| Lines that already begin with ``[<digits>]`` pass through unchanged, | ||
| but the counter still advances on them so callers can rely on positional | ||
| alignment with the original line indices. | ||
|
|
||
| ``None`` is treated as empty string. Pure function. | ||
| """ | ||
| if not text: |
| **Proposed for mempalace 3.3.6.** | ||
| **Target file (new):** `mempalace/format_miner.py` | ||
| **Target file (edited, one-line addition):** `mempalace/cli.py` (dispatcher in `cmd_mine`) | ||
| **New tests:** `tests/test_format_miner.py` (32 cases, all green; see `PROOF.md`). |
| """Tests for virtual line numbering (proposed for mempalace 3.3.6). | ||
|
|
||
| Run from the proposal directory: | ||
| pytest tests/test_line_numbers.py -v | ||
|
|
||
| After integration into the mempalace package, change the import below to: | ||
| from mempalace.searcher import render_with_line_numbers, extract_line_range |
| happy paths. MarkItDown is mocked throughout so tests run without the | ||
| library installed — same discipline as the existing test_line_numbers.py. |
Six follow-up fixes addressing gemini-code-assist review on PR #1555. All medium-priority parity gaps between format_miner.py and the existing miner.py / convo_miner.py patterns. 1. Reuse palace.SKIP_DIRS instead of duplicating it locally. Replaces the local _SKIP_DIRS set with the shared constant so the format miner's directory-skip set stays in sync automatically. 2. scan_formats now explicitly skips symlinks. Prevents circular links, following symlinks to /dev/urandom, and processing the same file via multiple paths. Mirrors miner.py:1068 exactly. 3. mine_formats loads MempalaceConfig at start. The current chunker (miner.chunk_text) uses module-level CHUNK_SIZE constants and doesn't accept overrides, so the loaded config values aren't yet threaded into chunking; the load is wired so future chunk_text refactors that accept per-call sizing pick this up automatically. Honest scope note: this gives format_miner the same config-load stance as miner.py without overreaching what chunk_text supports. 4. source_mtime tracking + check_mtime=True. Each drawer's metadata now carries source_mtime, and file_already_mined is called with check_mtime=True so an edited PDF/DOCX is correctly re-mined on the next pass. Mirrors miner.py semantics exactly. Previously, updated documents would have been silently skipped. 5. Per-file try/except wrap + KeyboardInterrupt handling + PID cleanup. One malformed file in a mine of thousands no longer crashes the whole orchestrator — the error is logged, a counter bumps, and the loop continues with the next file. KeyboardInterrupt still produces a clean summary on Ctrl-C. _cleanup_mine_pid_file is called in a finally block so hook-spawned mines don't leak a stale PID file on exit. 6. hall + entities metadata on every drawer. Format-mined drawers now carry the same hall (via miner.detect_hall) and entities (via miner._extract_entities_for_metadata) tags as project-mined drawers. Matches the search-result quality the other miners ship. Tests: 5 new tests covering the new behaviors: - test_scan_formats_skips_symlinks - test_mine_formats_uses_check_mtime_true - test_mine_formats_records_source_mtime_in_drawer_metadata - test_mine_formats_records_hall_in_drawer_metadata - test_mine_formats_continues_after_per_file_error Total: 78 tests passing (was 73). Full mempalace test suite green modulo one pre-existing flaky test (test_chroma_close_palace_..., a sqlite-lock-timing flake unrelated to this change — passes 4/4 in isolation). ruff check + ruff format clean on pinned 0.15.9. Coverage on mempalace/format_miner.py: 84% (clears the 80% gate). Refs: PR #1555 review comments from gemini-code-assist[bot]
…parity) Third amendment to PR #1555. Closes the remaining parity gaps between format_miner and the existing project miner that Aya's palace audit surfaced after amendment #2. ## What was broken A smoke test against the just-mined v6 palace confirmed: WING: wing_aya ROOM: documents 2904 drawers ← ALL in one room Tunnels for wing_aya: 0 ← no cross-wing links Cause: format_miner hardcoded `room = "documents"` for every drawer and never called `_compute_topic_tunnels_for_wing()` post-mine. miner.py calls `detect_room()` per drawer and computes tunnels after the loop. format_miner did neither. ## What this commit changes 1. Module-level imports of `detect_room`, `load_config`, and `_compute_topic_tunnels_for_wing` from `.miner`. Module-level (not lazy) so test seams `patch("mempalace.format_miner.detect_room", ...)` work — lazy imports inside a function don't expose attributes on the module object. 2. `mine_formats()` loads the project's `mempalace.yaml` via `load_config()` to get the rooms list. Falls back to a single "documents" room if no config exists. 3. The hardcoded `room = "documents"` line is replaced with: room = detect_room(filepath, text, rooms, format_path) Mirrors miner.py:904 exactly — folder-match → filename-match → content-keyword scoring → fallback "general". 4. After the per-file loop completes (in an `else` branch on the outer try, so it does NOT run on KeyboardInterrupt), call `_compute_topic_tunnels_for_wing(wing)` in try/except. Exact mirror of miner.py:1241-1249. Tunnel-compute failures must never fail a mine. 5. Import `sys` at module level (used by the tunnel-compute error path for `print(..., file=sys.stderr)`). ## Tests 5 new RED-first tests in `tests/test_format_miner.py`: - test_mine_formats_calls_load_config_for_rooms - test_mine_formats_calls_detect_room_per_file - test_mine_formats_uses_detected_room_in_drawer_metadata - test_mine_formats_calls_compute_topic_tunnels_after_loop - test_mine_formats_tunnel_failure_does_not_crash_mine All 5 failed against pre-commit code with AttributeError on missing module attributes (proof the bug existed). All 5 pass after the implementation. Total format_miner test count: 62. Total proposal test count (format_miner + line_numbers): 83. Full mempalace test suite: 2002 passed, 1 skipped (no regressions). ## Verification ruff check mempalace/format_miner.py tests/test_format_miner.py → All checks passed! ruff format --check ... → already formatted (pinned ruff 0.15.9) pytest tests/test_format_miner.py tests/test_line_numbers.py → 83 passed pytest -q (full suite) → 2002 passed, 1 skipped ## Not in this commit (deferred) Within-wing hallway primitives are a separate architectural addition (separate PR — being designed). The current `_compute_topic_tunnels_for_wing` matches miner.py exactly but inherits miner.py's gap: it computes tunnels from raw topic words rather than from hallway primitives. That refactor lands in its own PR after the hallway primitive ships.
There was a problem hiding this comment.
@milla-jovovich we tested locally before approval. Two issues need fixing before merging.
| # only installs where supported; Python 3.9 users still get RTF coverage). | ||
| extract = [ | ||
| "striprtf>=0.0.27", | ||
| "markitdown>=0.1.5; python_version >= '3.10'", |
There was a problem hiding this comment.
I tested this locally and mempalace[extract] does not currently install enough for the formats advertised here. A real PDF hit MarkItDown's MissingDependencyException asking for markitdown[pdf]; the code then reports it as SKIP_EXTRACTION_ERROR instead of a clear missing-dependency status. Can you update this extra to include the MarkItDown extras for the supported formats (pdf, docx, pptx, xlsx, maybe all) and adjust the error handling/message so PDF/doc extraction works after pip install mempalace[extract]?
| def test_fringe_broken_symlink(tmp_path: Path): | ||
| target = tmp_path / "does-not-exist.pdf" | ||
| link = tmp_path / "broken-link.pdf" | ||
| link.symlink_to(target) |
There was a problem hiding this comment.
This fails on Windows when the test process does not have symlink privileges (WinError 1314) before it reaches the product code. Can you skip these symlink tests when Path.symlink_to() raises OSError/privilege errors? Same issue applies to the later test_scan_formats_skips_symlinks case.
…tras + skip symlink tests on Windows Addresses PR #1555 review (Igor) — two bugs that block this PR from merging cleanly: 1. ``mempalace[extract]`` extras did not pull MarkItDown's per-format sub-dependencies. A real PDF after ``pip install mempalace[extract]`` would raise ``MissingDependencyException`` asking for ``markitdown[pdf]`` — the code then routed that exception through the generic ``except Exception`` and surfaced it as ``SKIP_EXTRACTION_ERROR``, stripping the actionable signal from the user. 2. ``test_fringe_broken_symlink`` and ``test_scan_formats_skips_symlinks`` called ``Path.symlink_to()`` unguarded, which raises ``OSError`` (``WinError 1314``) on Windows test environments without ``SeCreateSymbolicLinkPrivilege`` — surfacing as hard test failures before any product code ran. ## What this commit does 1. ``mempalace/format_miner.py`` — adds a new ``ExtractionStatus.SKIP_MISSING_FORMAT_DEPS`` enum member and a matching catch in ``extract_text`` that fires BEFORE the generic ``except Exception`` block. The catch matches by exception type name (``type(exc).__name__ == "MissingDependencyException"``) so the static import surface doesn't change — MarkItDown stays an optional dependency. 2. ``pyproject.toml`` — changes ``[extract]`` to include MarkItDown's per-format sub-extras: "markitdown[docx,pdf,pptx,xlsx]>=0.1.5; python_version >= '3.10'" These pull ``pdfminer-six``, ``pdfplumber``, ``mammoth``, ``lxml``, ``python-pptx``, ``openpyxl``, ``pandas`` — the deps each per-format converter actually needs at runtime. Verified against MarkItDown 0.1.5's PyPI metadata (Provides-Extra includes ``pdf``, ``docx``, ``pptx``, ``xlsx`` among others). ``.epub`` is handled by base markitdown (``EpubConverter`` uses ``beautifulsoup4``, a base requirement, so no ``[epub]`` extra is needed — and none exists in 0.1.5). ``.rtf`` is covered by the existing ``striprtf`` entry. ``markitdown[all]`` was NOT used because it pulls audio (``pydub``, ``speechrecognition``), YouTube, and Azure deps that mempalace does not claim support for and that would bloat the install for users. 3. ``tests/test_format_miner.py`` — adds a ``_make_symlink_or_skip`` helper near the top of the file that wraps ``Path.symlink_to()`` in try/except ``OSError`` and calls ``pytest.skip(...)`` on failure. Refactors both existing symlink tests (``test_fringe_broken_symlink``, ``test_scan_formats_skips_symlinks``) to use the helper. Behavior is unchanged on macOS/Linux; on Windows without symlink privileges the tests skip cleanly instead of spuriously failing. ## Tests (RED-first) Two new RED-first tests: test_extract_text_missing_format_dep_returns_distinct_status Patches ``_extract_via_markitdown`` to raise a fake exception with ``__name__ == "MissingDependencyException"``. Asserts the dispatcher returns ``SKIP_MISSING_FORMAT_DEPS``, not ``SKIP_EXTRACTION_ERROR``. test_pyproject_extract_extra_pulls_markitdown_format_subdeps Parses ``pyproject.toml`` (via ``tomllib`` 3.11+ or ``tomli`` on 3.9/3.10 — the latter is already a base dependency under the ``python_version < '3.11'`` marker). Asserts ``[extract]`` includes ``pdf``, ``docx``, ``pptx``, ``xlsx`` inside SOME ``markitdown[...]`` bracketed group. Both RED before this commit. Both GREEN after. Also updates ``test_extraction_status_enum_has_all_documented_codes`` to include the new ``SKIP_MISSING_FORMAT_DEPS`` code in the documented set, so the enum-completeness doc-test stays honest. ## Verification pytest tests/test_format_miner.py::test_extract_text_missing_format_dep_returns_distinct_status tests/test_format_miner.py::test_pyproject_extract_extra_pulls_markitdown_format_subdeps → 2 passed (RED before, GREEN after) pytest tests/test_format_miner.py::test_fringe_broken_symlink tests/test_format_miner.py::test_scan_formats_skips_symlinks → 2 passed on macOS (Windows behaviour validated by CI test-windows) pytest tests/test_format_miner.py → 64 passed, 2 skipped, 0 regressions pytest -q (full mempalace suite) → 2002 passed, 3 skipped, 0 regressions ruff check mempalace/format_miner.py tests/test_format_miner.py → All checks passed! ruff format --check ... → 2 files already formatted (pinned 0.15.9) ## Why no OrbStack run this time OrbStack runs Linux containers, which have unprivileged symlinks. It cannot reproduce the Windows ``WinError 1314`` failure mode this amendment fixes. The only authoritative gate for the symlink fix is GitHub CI's ``test-windows`` job. For the enum + extras changes, both are pure-Python / install-time concerns the CI test-linux matrix covers identically to OrbStack. The ``tomli`` fallback in the new pyproject test was verified against the base dependency declaration in ``pyproject.toml:32``.
Wires the hallway primitive (from PR MemPalace#1558) into the project miner so that every mine that touches a wing also materializes within-wing entity hallways for that wing. Without this integration, the hallway module is dead code — no miner triggers it, no hallways ever land in ~/.mempalace/hallways.json. ## What this commit does 1. Adds a module-level import of ``compute_hallways_for_wing`` from ``.hallways`` near the top of ``miner.py``. Module-level (not lazy) so tests can patch it as ``mempalace.miner.compute_hallways_for_wing`` — lazy imports inside a function wouldn't expose the seam. 2. In ``_mine_impl``, immediately after the existing ``_compute_topic_tunnels_for_wing(wing)`` post-mine block, adds a parallel hallway block. The block: - calls ``compute_hallways_for_wing(wing, col=collection)`` - prints the count if any hallways were materialized - wraps the whole thing in try/except so a hallway-compute failure is logged + degraded, never propagated. Mirrors the tunnel block's fault-tolerance pattern exactly. Hallway computation is a derived analytic, not load-bearing for the drawer write that already committed above. ## Stacking This PR stacks on PR MemPalace#1558 (which introduces the hallway primitive module). PR MemPalace#1558 is the prerequisite — without it, the import ``from .hallways import compute_hallways_for_wing`` doesn't resolve. Base branch for this PR is ``feat/hallways-within-wing-connectors`` (PR MemPalace#1558's branch). When PR MemPalace#1558 merges to develop, GitHub will auto-update this PR's base to develop and the diff will reduce to just the miner.py and tests/test_miner.py changes. ## Out of scope (deferred to follow-up PRs) - ``format_miner.py`` integration: lives on a different branch (PR MemPalace#1555) so the parallel call there goes in as a follow-up amendment to that branch (or a separate post-merge PR). - ``convo_miner.py`` integration: convo_miner currently doesn't call ``_compute_topic_tunnels_for_wing`` either. Adding both calls is a separate concerned PR about convo_miner parity, not just hallways. - Refactoring ``_compute_topic_tunnels_for_wing`` to BUILD ON hallways (rather than computing from raw topic words): the architecturally meaningful follow-up that completes the Wing → Drawer-entities → Hallway → Tunnel sequence. Real refactor, separate PR. ## Tests Two new RED-first tests in ``tests/test_miner.py``: test_mine_computes_hallways_for_wing_post_mine Stubs ``mempalace.miner.compute_hallways_for_wing`` via monkeypatch. Runs a real mine into a tmp palace. Asserts the stub was called exactly once, with the wing name from mempalace.yaml and a live ChromaDB collection (not None). test_mine_hallway_failure_does_not_crash_mine Stubs the hallway function to raise. Runs a real mine. Asserts mine() doesn't propagate, and that the drawer write (which happens BEFORE the hallway block) still committed. Both RED before this commit (AttributeError — module had no attribute ``compute_hallways_for_wing``). Both GREEN after. ## Verification ruff check mempalace/miner.py tests/test_miner.py → All checks passed! ruff format --check ... → 2 files already formatted (pinned 0.15.9) pytest tests/test_miner.py → 47 passed (the 2 new + 45 pre-existing) pytest -q (full mempalace suite) → 1938 passed, 1 skipped, 0 regressions Linux CI parity replicated locally via OrbStack containers (Python 3.9, 3.11, 3.13): 2/2 new integration tests pass, ruff clean on all three.
Wires _validate_palace_fts5_after_mine into all three mine entry points so corrupted-FTS5 palaces cannot silently exit 0 from any of them: - _mine_impl (mempalace/miner.py) — project file miner - mine_convos (mempalace/convo_miner.py) — conversation exports - mine_formats (mempalace/format_miner.py) — binary office documents via --mode extract, introduced on develop by MemPalace#1555 (3.3.6 release) between this PR's open date and its rebase cmd_mine surfaces MineValidationError as exit 1 + the same print_sqlite_integrity_abort banner cmd_repair already prints, appended with a mine-specific stderr note that hedges attribution (quick_check cannot tell pre-existing corruption from corruption this mine produced). 17 tests in tests/test_miner_fts5_validation.py cover the helper, the three call sites, dry-run / KeyboardInterrupt skip semantics, and the MineValidationError constructor invariants. Closes MemPalace#1537. Co-authored-by: Caleb Wells <15988028+calebcwells@users.noreply.github.com>
fix(extract): polish PR — address bot review feedback on PR #1555
Wires _validate_palace_fts5_after_mine into all three mine entry points so corrupted-FTS5 palaces cannot silently exit 0 from any of them: - _mine_impl (mempalace/miner.py) — project file miner - mine_convos (mempalace/convo_miner.py) — conversation exports - mine_formats (mempalace/format_miner.py) — binary office documents via --mode extract, introduced on develop by MemPalace#1555 (3.3.6 release) between this PR's open date and its rebase cmd_mine surfaces MineValidationError as exit 1 + the same print_sqlite_integrity_abort banner cmd_repair already prints, appended with a mine-specific stderr note that hedges attribution (quick_check cannot tell pre-existing corruption from corruption this mine produced). 17 tests in tests/test_miner_fts5_validation.py cover the helper, the three call sites, dry-run / KeyboardInterrupt skip semantics, and the MineValidationError constructor invariants. Closes MemPalace#1537. Co-authored-by: Caleb Wells <15988028+calebcwells@users.noreply.github.com>
…#1555 PR MemPalace#1555 (format coverage + virtual line numbering) merged with twelve inline polish comments from Copilot + gemini-code-assist that weren't load-bearing enough to block the original ship but are real cleanups. This PR addresses them. Twelve items in scope; one item (drawer ID delimiter — Copilot MemPalace#13) is deferred to its own dedicated PR because it's a breaking schema change that requires migration design beyond the scope of a polish PR. ## Behavioral fixes (5 items, RED-tested first) 1. **FileNotFoundError vs broken symlink (Copilot MemPalace#8).** ``extract_text`` previously mapped every ``FileNotFoundError`` from ``stat()`` to ``SKIP_BROKEN_SYMLINK``. That's misleading for the common case of a regular file deleted between scan and extract. Now distinguishes: ``SKIP_BROKEN_SYMLINK`` only when ``p.is_symlink()`` is true; ``SKIP_UNREADABLE`` otherwise. 2. **``file_already_mined`` extract_mode scoping (Copilot MemPalace#11, MemPalace#12).** Both call sites in ``mine_formats`` and ``_file_chunks_locked`` now pass ``extract_mode="format"``. Previously the format miner could falsely treat drawers from project / convo miner on the same source file as "already mined" (and vice versa). Scopes idempotency to the correct drawer subset. 3. **Sentinel skip for transient missing-dep statuses (Copilot MemPalace#14).** New ``_TRANSIENT_MISSING_DEP_STATUSES`` set + ``_register_skip_sentinel_if_appropriate`` helper. Skip variants like ``SKIP_NO_MARKITDOWN`` / ``SKIP_NO_STRIPRTF`` / ``SKIP_MISSING_FORMAT_DEPS`` / ``SKIP_NETWORK_TIMEOUT`` no longer write the "already-mined" sentinel. Otherwise installing the missing extra later wouldn't trigger a re-mine. 4. **Outer ``except Exception`` in ``mine_formats`` (Gemini MemPalace#5).** The outer try around the loop previously caught only ``KeyboardInterrupt``, leaving any setup-time error (e.g., ``scan_formats`` raising) to propagate as a bare traceback. Now catches ``Exception`` defensively, logs it, prints a partial-progress summary, and lets the ``finally`` PID-cleanup run. Mirrors miner.py's belt-and-suspenders pattern. 5. **Thread user's ``chunk_size`` / ``chunk_overlap`` / ``min_chunk_size`` through to ``chunk_text`` (Gemini MemPalace#3).** ``MempalaceConfig`` was loaded only to validate readability; users who tuned their config saw no effect in format-mode mining. Now properly threaded. ## Trivial cleanups (5 items) 6. **Path expanduser in ``extract_text`` (Copilot MemPalace#7).** ``Path(path)`` → ``Path(path).expanduser()`` so CLI inputs like ``~/docs/file.pdf`` resolve correctly. 7. **Path expanduser+resolve in ``scan_formats`` (Copilot MemPalace#9).** Same fix; ``~/docs`` and relative paths now work consistently. 8. **Use resolved ``format_path`` in ``mine_formats`` (Copilot MemPalace#10).** ``scan_formats(format_dir)`` → ``scan_formats(format_path)`` so the already-resolved path is used. 9. **``render_with_line_numbers`` type annotation (Copilot MemPalace#15).** ``text: "str | None"`` reflects the documented + tested ``None`` handling. 10. **Test + docs claims (Copilot MemPalace#16, MemPalace#17, MemPalace#18).** Stale framings removed: - ``docs/format-coverage.md`` — 14 fringe cases + "see the file for the current test inventory" (no more frozen test count). - ``tests/test_line_numbers.py`` — drops "proposed for mempalace 3.3.6" + "run from the proposal directory" references. - ``tests/test_format_miner.py`` — drops "MarkItDown is mocked throughout" (live integration tests exist) + proposal-directory framing. ## Module-level hoists (enables clean test patching) - ``MempalaceConfig`` (from ``.config``) hoisted from lazy local import to module-level so tests can patch ``mempalace.format_miner.MempalaceConfig``. - ``chunk_text`` (from ``.miner``) hoisted similarly. Both follow the pattern PR MemPalace#1565 used for ``compute_hallways_for_wing``. ## Complexity refactor Extracted ``_print_mine_summary`` from ``mine_formats`` so the orchestrator stays under the project's ``max-complexity = 25`` ceiling (per ``pyproject.toml [tool.ruff.lint.mccabe]``). Behavior unchanged; pure extraction. ## Out of scope (intentionally deferred) - **Drawer ID delimiter collision (Copilot MemPalace#13)** — ``f"{source_file}{chunk_index}"`` can theoretically collide (``"/path/a1" + "23"`` == ``"/path/a" + "123"``). Fixing this is a breaking schema change to drawer IDs and requires a migration plan; will land as its own PR after design. - The four bot comments that were ALREADY addressed by amendment MemPalace#3 before the PR MemPalace#1555 merge (``_SKIP_DIRS`` dedup, ``scan_formats`` symlink skip, ``source_mtime`` tracking, hall+entities metadata) — no action needed; verified during audit. ## Tests (RED-first) Six new RED-first tests in ``tests/test_format_miner.py``: test_extract_text_nonexistent_regular_file_returns_unreadable_not_broken_symlink test_mine_formats_passes_extract_mode_format_to_file_already_mined test_mine_formats_does_not_write_sentinel_for_skip_no_markitdown test_mine_formats_does_not_write_sentinel_for_skip_missing_format_deps test_mine_formats_catches_unexpected_exception_and_prints_summary test_mine_formats_threads_chunk_size_from_user_config All six RED before this commit (failures correctly identified the bugs they're targeting), all six GREEN after. One existing test (``test_mine_formats_continues_after_per_file_error``) updated to patch the new module-level binding ``mempalace.format_miner.chunk_text`` instead of the old ``mempalace.miner.chunk_text`` source location, and to accept the ``**kwargs`` the call now passes through. Behavior unchanged. ## Verification pytest -q (full mempalace suite) → 2065 passed, 3 skipped, 0 regressions ruff check mempalace/format_miner.py mempalace/searcher.py tests/ → All checks passed! ruff format --check ... → 4 files already formatted (pinned 0.15.9) mine_formats complexity → ≤ 25 (under the project ceiling)
Wires _validate_palace_fts5_after_mine into all three mine entry points so corrupted-FTS5 palaces cannot silently exit 0 from any of them: - _mine_impl (mempalace/miner.py) — project file miner - mine_convos (mempalace/convo_miner.py) — conversation exports - mine_formats (mempalace/format_miner.py) — binary office documents via --mode extract, introduced on develop by MemPalace#1555 (3.3.6 release) between this PR's open date and its rebase cmd_mine surfaces MineValidationError as exit 1 + the same print_sqlite_integrity_abort banner cmd_repair already prints, appended with a mine-specific stderr note that hedges attribution (quick_check cannot tell pre-existing corruption from corruption this mine produced). 17 tests in tests/test_miner_fts5_validation.py cover the helper, the three call sites, dry-run / KeyboardInterrupt skip semantics, and the MineValidationError constructor invariants. Closes MemPalace#1537. Co-authored-by: Caleb Wells <15988028+calebcwells@users.noreply.github.com>
Wires _validate_palace_fts5_after_mine into all three mine entry points so corrupted-FTS5 palaces cannot silently exit 0 from any of them: - _mine_impl (mempalace/miner.py) — project file miner - mine_convos (mempalace/convo_miner.py) — conversation exports - mine_formats (mempalace/format_miner.py) — binary office documents via --mode extract, introduced on develop by MemPalace#1555 (3.3.6 release) between this PR's open date and its rebase cmd_mine surfaces MineValidationError as exit 1 + the same print_sqlite_integrity_abort banner cmd_repair already prints, appended with a mine-specific stderr note that hedges attribution (quick_check cannot tell pre-existing corruption from corruption this mine produced). 17 tests in tests/test_miner_fts5_validation.py cover the helper, the three call sites, dry-run / KeyboardInterrupt skip semantics, and the MineValidationError constructor invariants. Closes MemPalace#1537. Co-authored-by: Caleb Wells <15988028+calebcwells@users.noreply.github.com>
Wires _validate_palace_fts5_after_mine into all three mine entry points so corrupted-FTS5 palaces cannot silently exit 0 from any of them: - _mine_impl (mempalace/miner.py) — project file miner - mine_convos (mempalace/convo_miner.py) — conversation exports - mine_formats (mempalace/format_miner.py) — binary office documents via --mode extract, introduced on develop by MemPalace#1555 (3.3.6 release) between this PR's open date and its rebase cmd_mine surfaces MineValidationError as exit 1 + the same print_sqlite_integrity_abort banner cmd_repair already prints, appended with a mine-specific stderr note that hedges attribution (quick_check cannot tell pre-existing corruption from corruption this mine produced). 17 tests in tests/test_miner_fts5_validation.py cover the helper, the three call sites, dry-run / KeyboardInterrupt skip semantics, and the MineValidationError constructor invariants. Closes MemPalace#1537. Co-authored-by: Caleb Wells <15988028+calebcwells@users.noreply.github.com>
Bumps version 3.3.5 → 3.3.6 across pyproject.toml, version.py, plugin manifests (.claude-plugin/plugin.json, .claude-plugin/marketplace.json, .codex-plugin/plugin.json), README badge, and uv.lock. Flips CHANGELOG.md from ``[Unreleased]`` to ``[3.3.6] — 2026-05-24`` and backfills the major user-facing entries that landed without changelog entries during the cycle: Features: - MemPalace#1555 office-document mining via --mode extract + virtual line numbers - MemPalace#1584 surgical closet pointers with date+line locators (Tier 6a) - MemPalace#1558 + MemPalace#1560 within-wing hallways (entity co-occurrence graph) - MemPalace#1565 cross-wing tunnels auto-promoted from hallways - MemPalace#1578 Hebbian potentiation + Ebbinghaus decay on hallways/tunnels - MemPalace#1236 API-tool transcripts auto-route to wing_api - MemPalace#711 hooks.auto_save toggle for silent-mode sessions - MemPalace#1605 COCA content-word filter for entity detection - MemPalace#1557 case-insensitive entity matching at mine time - MemPalace#1483 multilingual embeddings (embeddinggemma-300m) by default Bug Fixes (selected, user-visible): - MemPalace#1540 silent data loss in three unchunked upsert sites - MemPalace#1538 paragraph chunker oversized chunks - MemPalace#1554 per-file chunk cap too low for transcripts - MemPalace#1562 Windows hook subprocess/ChromaDB deadlock - MemPalace#1529 create_tunnel corrupted hyphenated wing names - MemPalace#1424 save-hook truncated hyphenated project folders - MemPalace#1383 KG cache duplicated graphs for symlinked/cased paths - MemPalace#1466 silent symlink skip now logged - MemPalace#1441 macOS stock-bash 3.2 hook compatibility - MemPalace#1500 / MemPalace#1513 structured JSON-RPC errors on bad MCP input - MemPalace#1523 VACUUM + FTS5 rebuild after repair - MemPalace#1548 FTS5 validation at end of mine - plus MemPalace#1216, MemPalace#1408, MemPalace#1438, MemPalace#1439, MemPalace#1445, MemPalace#1452, MemPalace#1459, MemPalace#1461, MemPalace#1466, MemPalace#1470, MemPalace#1477, MemPalace#1485, MemPalace#1500, MemPalace#1513, MemPalace#1528, MemPalace#1532, MemPalace#1543, MemPalace#1546, MemPalace#1585 Performance: - MemPalace#1474 convo miner pre-fetches mined-set - MemPalace#1487 rebuild_index progress callback - MemPalace#1530 MCP cold-start diagnostics + opt-in warmup Lint passes (ruff 0.15.14); mempalace-mcp entry point alignment verified per RELEASING.md.
Two additive features for 3.3.6 — both following the same read-time-transform pattern. Full design in
docs/format-coverage.mdanddocs/virtual-line-numbering.md.Verification done locally
mempalace/format_miner.pymempalace mine --mode extracton real RTF + PDF produced 90 drawers, all searchableWhat's in the diff (2,148 lines, +5 deletions)
mempalace/format_miner.py(NEW)extract_text,scan_formats,mine_formats,ExtractionStatusmempalace/searcher.py(modified)render_with_line_numbers,extract_line_range— zero existing code touchedmempalace/cli.py(modified)elifbranch incmd_mine+"extract"added to argparse choicespyproject.toml(modified)[extract]optional dep block:striprtf+markitdown(withpython_version >= '3.10'env marker)tests/test_format_miner.py(NEW)tests/test_line_numbers.py(NEW)docs/format-coverage.md(NEW)docs/virtual-line-numbering.md(NEW)Modifications to existing code = 93 lines across 3 files, all pure additions (no deletions of existing logic).
Architecture — per-format transformer routing
A real bug surfaced during live integration testing: MarkItDown 0.1.5 does NOT actually convert
.rtf— it returns the raw control-code source unchanged. So the format miner routes per-extension:Both libraries are optional extras. MarkItDown requires Python ≥ 3.10 (env marker ensures 3.9 users still get RTF coverage via striprtf).
13 fringe cases
Each maps to a dedicated
ExtractionStatusenum value. All have explicit tests:SKIP_NO_MARKITDOWN,SKIP_NO_STRIPRTF,SKIP_TOO_LARGE,SKIP_CLOUD_ONLY,SKIP_ENCRYPTED,SKIP_EMPTY,SKIP_PERMISSION,SKIP_BROKEN_SYMLINK,SKIP_UNRECOGNIZED,SKIP_EXTRACTION_ERROR,SKIP_NETWORK_TIMEOUT,SKIP_UNREADABLE, plus encoding fallback (handled internally viadecode_robust).Documented limits (out of scope for 3.3.6)
These get reported via skip codes and skipped.
For Igor — review questions
extract_text,scan_formats,mine_formats;render_with_line_numbers,extract_line_range)pip install mempalace[extract]pulling bothstriprtf+markitdown(with the 3.10+ env marker), or do you want a different mechanism?Search-pipeline wiring for the line-numbering helpers (auto-resolving
→date:Lstart-Lendcloset pointers in search output) is deferred to a follow-up PR — this PR ships them as library helpers ready to call.CHANGELOG entry + version bump intentionally NOT in this PR — those land in the separate
chore/release-3.3.6-prepPR per the v3.3.5 release flow.