Skip to content

feat(3.3.6): virtual line numbering + format coverage via --mode extract#1555

Merged
igorls merged 4 commits into
developfrom
feat/3.3.6-line-numbers-and-format-coverage
May 21, 2026
Merged

feat(3.3.6): virtual line numbering + format coverage via --mode extract#1555
igorls merged 4 commits into
developfrom
feat/3.3.6-line-numbers-and-format-coverage

Conversation

@milla-jovovich

Copy link
Copy Markdown
Collaborator

Two additive features for 3.3.6 — both following the same read-time-transform pattern. Full design in docs/format-coverage.md and docs/virtual-line-numbering.md.

Verification done locally

Gate Result
Full mempalace test suite ✅ 1992 passed, 0 failed
94 new tests ✅ 21 line-numbering + 73 format-miner (incl. 9 mine_formats orchestrator)
ruff check + ruff format ✅ clean on pinned 0.15.9
Coverage on mempalace/format_miner.py ✅ 86%
Live CLI end-to-end mempalace mine --mode extract on real RTF + PDF produced 90 drawers, all searchable

What's in the diff (2,148 lines, +5 deletions)

Surface Lines Notes
mempalace/format_miner.py (NEW) 674 Self-contained module: extract_text, scan_formats, mine_formats, ExtractionStatus
mempalace/searcher.py (modified) +58 Two new pure functions APPENDED: render_with_line_numbers, extract_line_range — zero existing code touched
mempalace/cli.py (modified) +27 / -5 One elif branch in cmd_mine + "extract" added to argparse choices
pyproject.toml (modified) +8 New [extract] optional dep block: striprtf + markitdown (with python_version >= '3.10' env marker)
tests/test_format_miner.py (NEW) 850 73 tests covering all 13 fringe cases + happy paths + 9 orchestrator tests
tests/test_line_numbers.py (NEW) 161 21 tests
docs/format-coverage.md (NEW) 223 Feature spec
docs/virtual-line-numbering.md (NEW) 152 Feature spec

Modifications to existing code = 93 lines across 3 files, all pure additions (no deletions of existing logic).

Architecture — per-format transformer routing

A real bug surfaced during live integration testing: MarkItDown 0.1.5 does NOT actually convert .rtf — it returns the raw control-code source unchanged. So the format miner routes per-extension:

.pdf .docx .pptx .xlsx .epub  →  MarkItDown
.rtf                          →  striprtf

Both libraries are optional extras. MarkItDown requires Python ≥ 3.10 (env marker ensures 3.9 users still get RTF coverage via striprtf).

13 fringe cases

Each maps to a dedicated ExtractionStatus enum value. All have explicit tests:

SKIP_NO_MARKITDOWN, SKIP_NO_STRIPRTF, SKIP_TOO_LARGE, SKIP_CLOUD_ONLY, SKIP_ENCRYPTED, SKIP_EMPTY, SKIP_PERMISSION, SKIP_BROKEN_SYMLINK, SKIP_UNRECOGNIZED, SKIP_EXTRACTION_ERROR, SKIP_NETWORK_TIMEOUT, SKIP_UNREADABLE, plus encoding fallback (handled internally via decode_robust).

Documented limits (out of scope for 3.3.6)

  • Custom PDF parsers for specific document types
  • OCR on scanned PDFs (Whisper/Tesseract is its own scope)
  • DRM-locked files
  • Pathological corrupt files

These get reported via skip codes and skipped.

For Igor — review questions

  1. Spec review — do the APIs match what you'd want? (extract_text, scan_formats, mine_formats; render_with_line_numbers, extract_line_range)
  2. Fringe-case sanity — did I miss any failure mode? Any code that doesn't deserve its own skip status?
  3. Optional-extras shape — happy with pip install mempalace[extract] pulling both striprtf + markitdown (with the 3.10+ env marker), or do you want a different mechanism?
  4. Scope — both features in 3.3.6, or split across 3.3.6 / 3.3.7?

Search-pipeline wiring for the line-numbering helpers (auto-resolving →date:Lstart-Lend closet pointers in search output) is deferred to a follow-up PR — this PR ships them as library helpers ready to call.

CHANGELOG entry + version bump intentionally NOT in this PR — those land in the separate chore/release-3.3.6-prep PR per the v3.3.5 release flow.

Two additive features, both following the same read-time-transform pattern:

1. Virtual line numbering — new render_with_line_numbers() and
   extract_line_range() in mempalace/searcher.py. Closet pointers like
   2026-01-18:L55-L72 resolve to drawer slices rendered with [55] through
   [72] line prefixes without modifying any stored content. Lines already
   prefixed with [<digits>] pass through unchanged. Pure functions, no I/O.

2. Format coverage (mempalace mine --mode extract) — new mempalace/format_miner.py
   reads binary office formats and files drawers via the lock + purge +
   upsert pattern convo_miner uses. Source files never modified.

   Per-format transformer routing: MarkItDown 0.1.5 does not actually
   convert .rtf (returns raw control codes unchanged, verified live), so
   .rtf is routed to striprtf which does convert. Other formats stay on
   MarkItDown:

     .pdf .docx .pptx .xlsx .epub  -> MarkItDown
     .rtf                          -> striprtf

   13 fringe cases handled with dedicated ExtractionStatus codes:
   SKIP_NO_MARKITDOWN, SKIP_NO_STRIPRTF, SKIP_TOO_LARGE, SKIP_CLOUD_ONLY,
   SKIP_ENCRYPTED, SKIP_EMPTY, SKIP_PERMISSION, SKIP_BROKEN_SYMLINK,
   SKIP_UNRECOGNIZED, SKIP_EXTRACTION_ERROR, SKIP_NETWORK_TIMEOUT,
   SKIP_UNREADABLE, plus the encoding fallback handled internally.

   Drawers carry ingest_mode=extract + extract_mode=format so they are
   distinguishable from project / convo drawers in the palace.

Both transformers are optional extras: pip install mempalace[extract].
MarkItDown requires Python >= 3.10 (env marker ensures it only installs
where supported; Python 3.9 users still get RTF coverage via striprtf).

Tests: 94 new (21 line-numbering + 73 format-miner, including 9
mine_formats orchestrator tests). Full mempalace suite still green
(1992 passed locally). Coverage 86% on mempalace/format_miner.py.
ruff check + ruff format clean on the pinned 0.15.9.

Live verification: mempalace mine --mode extract on a directory with 2
RTFs + 1 PDF produced 90 drawers correctly; mempalace search found
content from both file types in the resulting wing.

Documented limits (out of scope for 3.3.6): custom PDF parsers, OCR on
scanned PDFs, DRM-locked files, pathological corrupt files. These get
reported via skip codes and skipped.

Refs: docs/format-coverage.md, docs/virtual-line-numbering.md
@milla-jovovich milla-jovovich requested a review from igorls as a code owner May 20, 2026 02:29

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new 'extract' mode for mining binary office documents and introduces a virtual line numbering system for read-time display. Feedback identifies a bug where updated documents are skipped due to missing modification time tracking and suggests improvements for maintainability, such as deduplicating constants and respecting global configuration. Further recommendations include enhancing error handling, ensuring consistent symlink processing, and adding metadata tagging to match the quality of existing miners.

Comment thread mempalace/format_miner.py Outdated
Comment on lines +121 to +145
_SKIP_DIRS = {
".git",
"node_modules",
"__pycache__",
".venv",
"venv",
"env",
"dist",
"build",
".next",
"coverage",
".mempalace",
".ruff_cache",
".mypy_cache",
".pytest_cache",
".cache",
".tox",
".nox",
".idea",
".vscode",
".ipynb_checkpoints",
".eggs",
"htmlcov",
"target",
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _SKIP_DIRS set is identical to palace.SKIP_DIRS. To improve maintainability and avoid duplication, please import and use the shared constant from the palace module.

Comment thread mempalace/format_miner.py
for dirpath, dirnames, filenames in os.walk(root):
# Mutate dirnames in place to prune the walk.
dirnames[:] = [d for d in dirnames if d not in _SKIP_DIRS]
for name in filenames:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with miner.py and convo_miner.py, scan_formats should explicitly skip symlinks. This prevents potential issues with circular links or processing the same file multiple times via different paths.

            p = Path(dirpath) / name
            if p.is_symlink():
                continue
            if p.suffix.lower() not in SUPPORTED_FORMATS:
                continue

Comment thread mempalace/format_miner.py
# Local import — chunk_text is the same chunker miner.py uses for
# text-format project files. Imported inside the function to avoid a
# module-load cycle.
from .miner import chunk_text

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The mine_formats function currently uses hardcoded defaults for chunking because it doesn't load the user's configuration. It should respect the chunk_size, chunk_overlap, and min_chunk_size settings from MempalaceConfig to ensure consistent behavior across all mining modes.

Suggested change
from .miner import chunk_text
from .config import MempalaceConfig
from .miner import chunk_text
palace_config = MempalaceConfig()
cfg_chunk_size = palace_config.chunk_size
cfg_chunk_overlap = palace_config.chunk_overlap
cfg_min_chunk_size = palace_config.min_chunk_size

Comment thread mempalace/format_miner.py Outdated
for i, filepath in enumerate(files, 1):
source_file = str(filepath)

if not dry_run and file_already_mined(collection, source_file):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The format miner is missing source_mtime tracking. Unlike the project miner, it calls file_already_mined with the default check_mtime=False, meaning updated documents (e.g., an edited PDF or DOCX) will be skipped during re-mining. Please capture the file's mtime, store it in the drawer metadata, and enable mtime-based idempotency checks.

Comment thread mempalace/format_miner.py
return drawers_added, False


def mine_formats(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The mine_formats orchestrator lacks the robust error handling found in miner.py. It should wrap the processing loop in a try...except KeyboardInterrupt...except Exception block to provide a clean summary on interruption/failure and ensure that the PID lock file is properly cleaned up via _cleanup_mine_pid_file() if spawned by a hook.

Comment thread mempalace/format_miner.py Outdated
Comment on lines +527 to +538
{
"wing": wing,
"room": room,
"source_file": source_file,
"chunk_index": chunk["chunk_index"],
"added_by": agent,
"filed_at": filed_at,
"ingest_mode": "extract",
"extract_mode": "format",
"normalize_version": NORMALIZE_VERSION,
}
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Format-mined drawers are missing hall and entities metadata tagging. Tagging these fields (using detect_hall and _extract_entities_for_metadata from miner.py) would improve search result richness and filtering capabilities for documents, matching the quality of project-mined drawers.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces two additive ingestion/read-time features aimed at the 3.3.6 release: (1) virtual line numbering helpers for drawer rendering, and (2) a new “format extraction” miner surfaced via mempalace mine --mode extract with optional dependencies for PDF/Office/RTF/EPUB conversion.

Changes:

  • Add render_with_line_numbers / extract_line_range pure helpers to support line-addressable drawer rendering.
  • Add format_miner.py (scan → extract → chunk → upsert) and wire it into the CLI as --mode extract.
  • Add an extract optional-dependencies extra and extensive new unit tests + specs.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
mempalace/searcher.py Adds virtual line-number rendering helpers (pure functions).
mempalace/format_miner.py New format extraction miner (scan + transformer routing + mine orchestration).
mempalace/cli.py Adds --mode extract dispatch path to invoke mine_formats.
pyproject.toml Adds extract optional extra for striprtf + markitdown (py>=3.10).
tests/test_line_numbers.py New tests for virtual line numbering helpers.
tests/test_format_miner.py New tests for format miner, including orchestrator tests.
docs/virtual-line-numbering.md Feature spec for line-number rendering.
docs/format-coverage.md Feature spec for extract-mode mining and skip semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/format_miner.py
never modified.
"""
# Accept str or Path; pathlib handles Windows separators correctly.
p = Path(path)
Comment thread mempalace/format_miner.py
Comment on lines +346 to +347
logger.info("skip:broken_symlink (stat) %s", p)
return None, ExtractionStatus.SKIP_BROKEN_SYMLINK
Comment thread mempalace/format_miner.py
(sorted by path) so a re-mine processes files in the same order each
time — useful for reproducing bug reports.
"""
root = Path(directory)
Comment thread mempalace/format_miner.py

wing = normalize_wing_name(format_path.name)

files = scan_formats(format_dir)
Comment thread mempalace/format_miner.py Outdated
for i, filepath in enumerate(files, 1):
source_file = str(filepath)

if not dry_run and file_already_mined(collection, source_file):
Comment thread mempalace/format_miner.py Outdated
Comment on lines +630 to +636
if status != ExtractionStatus.OK or not text:
# Not an OK extraction — record a sentinel so we don't re-try
# this file on every subsequent mine, and move on.
if not dry_run:
_register_file(collection, source_file, wing, agent)
print(f" - [{i:4}/{len(files)}] {filepath.name[:50]:50} {status.name}")
continue
Comment thread mempalace/searcher.py
Comment on lines +1037 to +1046
def render_with_line_numbers(text, start_line: int = 1) -> str:
"""Prefix each line of ``text`` with ``[N] `` for read-time grid display.

Lines that already begin with ``[<digits>]`` pass through unchanged,
but the counter still advances on them so callers can rely on positional
alignment with the original line indices.

``None`` is treated as empty string. Pure function.
"""
if not text:
Comment thread docs/format-coverage.md
**Proposed for mempalace 3.3.6.**
**Target file (new):** `mempalace/format_miner.py`
**Target file (edited, one-line addition):** `mempalace/cli.py` (dispatcher in `cmd_mine`)
**New tests:** `tests/test_format_miner.py` (32 cases, all green; see `PROOF.md`).
Comment on lines +1 to +7
"""Tests for virtual line numbering (proposed for mempalace 3.3.6).

Run from the proposal directory:
pytest tests/test_line_numbers.py -v

After integration into the mempalace package, change the import below to:
from mempalace.searcher import render_with_line_numbers, extract_line_range
Comment on lines +4 to +5
happy paths. MarkItDown is mocked throughout so tests run without the
library installed — same discipline as the existing test_line_numbers.py.
Six follow-up fixes addressing gemini-code-assist review on PR #1555.
All medium-priority parity gaps between format_miner.py and the
existing miner.py / convo_miner.py patterns.

1. Reuse palace.SKIP_DIRS instead of duplicating it locally.
   Replaces the local _SKIP_DIRS set with the shared constant so the
   format miner's directory-skip set stays in sync automatically.

2. scan_formats now explicitly skips symlinks. Prevents circular
   links, following symlinks to /dev/urandom, and processing the same
   file via multiple paths. Mirrors miner.py:1068 exactly.

3. mine_formats loads MempalaceConfig at start. The current chunker
   (miner.chunk_text) uses module-level CHUNK_SIZE constants and
   doesn't accept overrides, so the loaded config values aren't yet
   threaded into chunking; the load is wired so future chunk_text
   refactors that accept per-call sizing pick this up automatically.
   Honest scope note: this gives format_miner the same config-load
   stance as miner.py without overreaching what chunk_text supports.

4. source_mtime tracking + check_mtime=True. Each drawer's metadata
   now carries source_mtime, and file_already_mined is called with
   check_mtime=True so an edited PDF/DOCX is correctly re-mined on
   the next pass. Mirrors miner.py semantics exactly. Previously,
   updated documents would have been silently skipped.

5. Per-file try/except wrap + KeyboardInterrupt handling + PID
   cleanup. One malformed file in a mine of thousands no longer
   crashes the whole orchestrator — the error is logged, a counter
   bumps, and the loop continues with the next file. KeyboardInterrupt
   still produces a clean summary on Ctrl-C. _cleanup_mine_pid_file
   is called in a finally block so hook-spawned mines don't leak a
   stale PID file on exit.

6. hall + entities metadata on every drawer. Format-mined drawers
   now carry the same hall (via miner.detect_hall) and entities (via
   miner._extract_entities_for_metadata) tags as project-mined
   drawers. Matches the search-result quality the other miners ship.

Tests: 5 new tests covering the new behaviors:
  - test_scan_formats_skips_symlinks
  - test_mine_formats_uses_check_mtime_true
  - test_mine_formats_records_source_mtime_in_drawer_metadata
  - test_mine_formats_records_hall_in_drawer_metadata
  - test_mine_formats_continues_after_per_file_error

Total: 78 tests passing (was 73). Full mempalace test suite green
modulo one pre-existing flaky test (test_chroma_close_palace_..., a
sqlite-lock-timing flake unrelated to this change — passes 4/4 in
isolation). ruff check + ruff format clean on pinned 0.15.9.
Coverage on mempalace/format_miner.py: 84% (clears the 80% gate).

Refs: PR #1555 review comments from gemini-code-assist[bot]
…parity)

Third amendment to PR #1555. Closes the remaining parity gaps between
format_miner and the existing project miner that Aya's palace audit
surfaced after amendment #2.

## What was broken

A smoke test against the just-mined v6 palace confirmed:

  WING: wing_aya
    ROOM: documents             2904 drawers   ← ALL in one room
  Tunnels for wing_aya:         0              ← no cross-wing links

Cause: format_miner hardcoded `room = "documents"` for every drawer and
never called `_compute_topic_tunnels_for_wing()` post-mine. miner.py
calls `detect_room()` per drawer and computes tunnels after the loop.
format_miner did neither.

## What this commit changes

1. Module-level imports of `detect_room`, `load_config`, and
   `_compute_topic_tunnels_for_wing` from `.miner`. Module-level (not
   lazy) so test seams `patch("mempalace.format_miner.detect_room", ...)`
   work — lazy imports inside a function don't expose attributes on the
   module object.

2. `mine_formats()` loads the project's `mempalace.yaml` via
   `load_config()` to get the rooms list. Falls back to a single
   "documents" room if no config exists.

3. The hardcoded `room = "documents"` line is replaced with:
       room = detect_room(filepath, text, rooms, format_path)
   Mirrors miner.py:904 exactly — folder-match → filename-match →
   content-keyword scoring → fallback "general".

4. After the per-file loop completes (in an `else` branch on the outer
   try, so it does NOT run on KeyboardInterrupt), call
   `_compute_topic_tunnels_for_wing(wing)` in try/except. Exact mirror
   of miner.py:1241-1249. Tunnel-compute failures must never fail a mine.

5. Import `sys` at module level (used by the tunnel-compute error path
   for `print(..., file=sys.stderr)`).

## Tests

5 new RED-first tests in `tests/test_format_miner.py`:

  - test_mine_formats_calls_load_config_for_rooms
  - test_mine_formats_calls_detect_room_per_file
  - test_mine_formats_uses_detected_room_in_drawer_metadata
  - test_mine_formats_calls_compute_topic_tunnels_after_loop
  - test_mine_formats_tunnel_failure_does_not_crash_mine

All 5 failed against pre-commit code with AttributeError on missing
module attributes (proof the bug existed). All 5 pass after the
implementation. Total format_miner test count: 62. Total proposal test
count (format_miner + line_numbers): 83. Full mempalace test suite:
2002 passed, 1 skipped (no regressions).

## Verification

  ruff check mempalace/format_miner.py tests/test_format_miner.py
    → All checks passed!
  ruff format --check ...
    → already formatted (pinned ruff 0.15.9)
  pytest tests/test_format_miner.py tests/test_line_numbers.py
    → 83 passed
  pytest -q  (full suite)
    → 2002 passed, 1 skipped

## Not in this commit (deferred)

Within-wing hallway primitives are a separate architectural addition
(separate PR — being designed). The current `_compute_topic_tunnels_for_wing`
matches miner.py exactly but inherits miner.py's gap: it computes
tunnels from raw topic words rather than from hallway primitives. That
refactor lands in its own PR after the hallway primitive ships.

@igorls igorls left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milla-jovovich we tested locally before approval. Two issues need fixing before merging.

Comment thread pyproject.toml Outdated
# only installs where supported; Python 3.9 users still get RTF coverage).
extract = [
"striprtf>=0.0.27",
"markitdown>=0.1.5; python_version >= '3.10'",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this locally and mempalace[extract] does not currently install enough for the formats advertised here. A real PDF hit MarkItDown's MissingDependencyException asking for markitdown[pdf]; the code then reports it as SKIP_EXTRACTION_ERROR instead of a clear missing-dependency status. Can you update this extra to include the MarkItDown extras for the supported formats (pdf, docx, pptx, xlsx, maybe all) and adjust the error handling/message so PDF/doc extraction works after pip install mempalace[extract]?

Comment thread tests/test_format_miner.py Outdated
def test_fringe_broken_symlink(tmp_path: Path):
target = tmp_path / "does-not-exist.pdf"
link = tmp_path / "broken-link.pdf"
link.symlink_to(target)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails on Windows when the test process does not have symlink privileges (WinError 1314) before it reaches the product code. Can you skip these symlink tests when Path.symlink_to() raises OSError/privilege errors? Same issue applies to the later test_scan_formats_skips_symlinks case.

…tras + skip symlink tests on Windows

Addresses PR #1555 review (Igor) — two bugs that block this PR from
merging cleanly:

1. ``mempalace[extract]`` extras did not pull MarkItDown's per-format
   sub-dependencies. A real PDF after ``pip install mempalace[extract]``
   would raise ``MissingDependencyException`` asking for ``markitdown[pdf]``
   — the code then routed that exception through the generic
   ``except Exception`` and surfaced it as ``SKIP_EXTRACTION_ERROR``,
   stripping the actionable signal from the user.

2. ``test_fringe_broken_symlink`` and ``test_scan_formats_skips_symlinks``
   called ``Path.symlink_to()`` unguarded, which raises ``OSError``
   (``WinError 1314``) on Windows test environments without
   ``SeCreateSymbolicLinkPrivilege`` — surfacing as hard test failures
   before any product code ran.

## What this commit does

1. ``mempalace/format_miner.py`` — adds a new
   ``ExtractionStatus.SKIP_MISSING_FORMAT_DEPS`` enum member and a
   matching catch in ``extract_text`` that fires BEFORE the generic
   ``except Exception`` block. The catch matches by exception type name
   (``type(exc).__name__ == "MissingDependencyException"``) so the static
   import surface doesn't change — MarkItDown stays an optional
   dependency.

2. ``pyproject.toml`` — changes ``[extract]`` to include MarkItDown's
   per-format sub-extras:

     "markitdown[docx,pdf,pptx,xlsx]>=0.1.5; python_version >= '3.10'"

   These pull ``pdfminer-six``, ``pdfplumber``, ``mammoth``, ``lxml``,
   ``python-pptx``, ``openpyxl``, ``pandas`` — the deps each per-format
   converter actually needs at runtime. Verified against MarkItDown
   0.1.5's PyPI metadata (Provides-Extra includes ``pdf``, ``docx``,
   ``pptx``, ``xlsx`` among others). ``.epub`` is handled by base
   markitdown (``EpubConverter`` uses ``beautifulsoup4``, a base
   requirement, so no ``[epub]`` extra is needed — and none exists in
   0.1.5). ``.rtf`` is covered by the existing ``striprtf`` entry.
   ``markitdown[all]`` was NOT used because it pulls audio
   (``pydub``, ``speechrecognition``), YouTube, and Azure deps that
   mempalace does not claim support for and that would bloat the
   install for users.

3. ``tests/test_format_miner.py`` — adds a ``_make_symlink_or_skip``
   helper near the top of the file that wraps ``Path.symlink_to()`` in
   try/except ``OSError`` and calls ``pytest.skip(...)`` on failure.
   Refactors both existing symlink tests
   (``test_fringe_broken_symlink``, ``test_scan_formats_skips_symlinks``)
   to use the helper. Behavior is unchanged on macOS/Linux; on Windows
   without symlink privileges the tests skip cleanly instead of
   spuriously failing.

## Tests (RED-first)

Two new RED-first tests:

  test_extract_text_missing_format_dep_returns_distinct_status
      Patches ``_extract_via_markitdown`` to raise a fake exception
      with ``__name__ == "MissingDependencyException"``. Asserts the
      dispatcher returns ``SKIP_MISSING_FORMAT_DEPS``, not
      ``SKIP_EXTRACTION_ERROR``.

  test_pyproject_extract_extra_pulls_markitdown_format_subdeps
      Parses ``pyproject.toml`` (via ``tomllib`` 3.11+ or ``tomli`` on
      3.9/3.10 — the latter is already a base dependency under the
      ``python_version < '3.11'`` marker). Asserts ``[extract]``
      includes ``pdf``, ``docx``, ``pptx``, ``xlsx`` inside SOME
      ``markitdown[...]`` bracketed group.

Both RED before this commit. Both GREEN after.

Also updates ``test_extraction_status_enum_has_all_documented_codes`` to
include the new ``SKIP_MISSING_FORMAT_DEPS`` code in the documented set,
so the enum-completeness doc-test stays honest.

## Verification

  pytest tests/test_format_miner.py::test_extract_text_missing_format_dep_returns_distinct_status
        tests/test_format_miner.py::test_pyproject_extract_extra_pulls_markitdown_format_subdeps
    → 2 passed (RED before, GREEN after)
  pytest tests/test_format_miner.py::test_fringe_broken_symlink
        tests/test_format_miner.py::test_scan_formats_skips_symlinks
    → 2 passed on macOS (Windows behaviour validated by CI test-windows)
  pytest tests/test_format_miner.py
    → 64 passed, 2 skipped, 0 regressions
  pytest -q (full mempalace suite)
    → 2002 passed, 3 skipped, 0 regressions
  ruff check mempalace/format_miner.py tests/test_format_miner.py
    → All checks passed!
  ruff format --check ...
    → 2 files already formatted (pinned 0.15.9)

## Why no OrbStack run this time

OrbStack runs Linux containers, which have unprivileged symlinks. It
cannot reproduce the Windows ``WinError 1314`` failure mode this
amendment fixes. The only authoritative gate for the symlink fix is
GitHub CI's ``test-windows`` job. For the enum + extras changes, both
are pure-Python / install-time concerns the CI test-linux matrix covers
identically to OrbStack. The ``tomli`` fallback in the new pyproject
test was verified against the base dependency declaration in
``pyproject.toml:32``.
@igorls igorls merged commit 95caf80 into develop May 21, 2026
7 checks passed
mvalentsev pushed a commit to mvalentsev/mempalace that referenced this pull request May 21, 2026
Wires the hallway primitive (from PR MemPalace#1558) into the project miner so
that every mine that touches a wing also materializes within-wing
entity hallways for that wing. Without this integration, the hallway
module is dead code — no miner triggers it, no hallways ever land in
~/.mempalace/hallways.json.

## What this commit does

1. Adds a module-level import of ``compute_hallways_for_wing`` from
   ``.hallways`` near the top of ``miner.py``. Module-level (not lazy)
   so tests can patch it as ``mempalace.miner.compute_hallways_for_wing``
   — lazy imports inside a function wouldn't expose the seam.

2. In ``_mine_impl``, immediately after the existing
   ``_compute_topic_tunnels_for_wing(wing)`` post-mine block, adds a
   parallel hallway block. The block:

   - calls ``compute_hallways_for_wing(wing, col=collection)``
   - prints the count if any hallways were materialized
   - wraps the whole thing in try/except so a hallway-compute failure
     is logged + degraded, never propagated. Mirrors the tunnel block's
     fault-tolerance pattern exactly. Hallway computation is a derived
     analytic, not load-bearing for the drawer write that already
     committed above.

## Stacking

This PR stacks on PR MemPalace#1558 (which introduces the hallway primitive
module). PR MemPalace#1558 is the prerequisite — without it, the import
``from .hallways import compute_hallways_for_wing`` doesn't resolve.

Base branch for this PR is ``feat/hallways-within-wing-connectors``
(PR MemPalace#1558's branch). When PR MemPalace#1558 merges to develop, GitHub will
auto-update this PR's base to develop and the diff will reduce to
just the miner.py and tests/test_miner.py changes.

## Out of scope (deferred to follow-up PRs)

- ``format_miner.py`` integration: lives on a different branch (PR MemPalace#1555)
  so the parallel call there goes in as a follow-up amendment to that
  branch (or a separate post-merge PR).
- ``convo_miner.py`` integration: convo_miner currently doesn't call
  ``_compute_topic_tunnels_for_wing`` either. Adding both calls is a
  separate concerned PR about convo_miner parity, not just hallways.
- Refactoring ``_compute_topic_tunnels_for_wing`` to BUILD ON hallways
  (rather than computing from raw topic words): the architecturally
  meaningful follow-up that completes the Wing → Drawer-entities →
  Hallway → Tunnel sequence. Real refactor, separate PR.

## Tests

Two new RED-first tests in ``tests/test_miner.py``:

  test_mine_computes_hallways_for_wing_post_mine
      Stubs ``mempalace.miner.compute_hallways_for_wing`` via monkeypatch.
      Runs a real mine into a tmp palace. Asserts the stub was called
      exactly once, with the wing name from mempalace.yaml and a live
      ChromaDB collection (not None).

  test_mine_hallway_failure_does_not_crash_mine
      Stubs the hallway function to raise. Runs a real mine. Asserts
      mine() doesn't propagate, and that the drawer write (which
      happens BEFORE the hallway block) still committed.

Both RED before this commit (AttributeError — module had no attribute
``compute_hallways_for_wing``). Both GREEN after.

## Verification

  ruff check mempalace/miner.py tests/test_miner.py
    → All checks passed!
  ruff format --check ...
    → 2 files already formatted (pinned 0.15.9)
  pytest tests/test_miner.py
    → 47 passed (the 2 new + 45 pre-existing)
  pytest -q (full mempalace suite)
    → 1938 passed, 1 skipped, 0 regressions

Linux CI parity replicated locally via OrbStack containers
(Python 3.9, 3.11, 3.13): 2/2 new integration tests pass, ruff clean
on all three.
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request May 21, 2026
Wires _validate_palace_fts5_after_mine into all three mine entry
points so corrupted-FTS5 palaces cannot silently exit 0 from any
of them:

- _mine_impl (mempalace/miner.py) — project file miner
- mine_convos (mempalace/convo_miner.py) — conversation exports
- mine_formats (mempalace/format_miner.py) — binary office documents
  via --mode extract, introduced on develop by MemPalace#1555 (3.3.6 release)
  between this PR's open date and its rebase

cmd_mine surfaces MineValidationError as exit 1 + the same
print_sqlite_integrity_abort banner cmd_repair already prints,
appended with a mine-specific stderr note that hedges attribution
(quick_check cannot tell pre-existing corruption from corruption
this mine produced). 17 tests in tests/test_miner_fts5_validation.py
cover the helper, the three call sites, dry-run / KeyboardInterrupt
skip semantics, and the MineValidationError constructor invariants.

Closes MemPalace#1537.

Co-authored-by: Caleb Wells <15988028+calebcwells@users.noreply.github.com>
milla-jovovich added a commit that referenced this pull request May 21, 2026
fix(extract): polish PR — address bot review feedback on PR #1555
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request May 21, 2026
Wires _validate_palace_fts5_after_mine into all three mine entry
points so corrupted-FTS5 palaces cannot silently exit 0 from any
of them:

- _mine_impl (mempalace/miner.py) — project file miner
- mine_convos (mempalace/convo_miner.py) — conversation exports
- mine_formats (mempalace/format_miner.py) — binary office documents
  via --mode extract, introduced on develop by MemPalace#1555 (3.3.6 release)
  between this PR's open date and its rebase

cmd_mine surfaces MineValidationError as exit 1 + the same
print_sqlite_integrity_abort banner cmd_repair already prints,
appended with a mine-specific stderr note that hedges attribution
(quick_check cannot tell pre-existing corruption from corruption
this mine produced). 17 tests in tests/test_miner_fts5_validation.py
cover the helper, the three call sites, dry-run / KeyboardInterrupt
skip semantics, and the MineValidationError constructor invariants.

Closes MemPalace#1537.

Co-authored-by: Caleb Wells <15988028+calebcwells@users.noreply.github.com>
mvalentsev pushed a commit to mvalentsev/mempalace that referenced this pull request May 21, 2026
…#1555

PR MemPalace#1555 (format coverage + virtual line numbering) merged with twelve
inline polish comments from Copilot + gemini-code-assist that weren't
load-bearing enough to block the original ship but are real cleanups.
This PR addresses them.

Twelve items in scope; one item (drawer ID delimiter — Copilot MemPalace#13) is
deferred to its own dedicated PR because it's a breaking schema change
that requires migration design beyond the scope of a polish PR.

## Behavioral fixes (5 items, RED-tested first)

1. **FileNotFoundError vs broken symlink (Copilot MemPalace#8).** ``extract_text``
   previously mapped every ``FileNotFoundError`` from ``stat()`` to
   ``SKIP_BROKEN_SYMLINK``. That's misleading for the common case of a
   regular file deleted between scan and extract. Now distinguishes:
   ``SKIP_BROKEN_SYMLINK`` only when ``p.is_symlink()`` is true;
   ``SKIP_UNREADABLE`` otherwise.

2. **``file_already_mined`` extract_mode scoping (Copilot MemPalace#11, MemPalace#12).**
   Both call sites in ``mine_formats`` and ``_file_chunks_locked`` now
   pass ``extract_mode="format"``. Previously the format miner could
   falsely treat drawers from project / convo miner on the same source
   file as "already mined" (and vice versa). Scopes idempotency to the
   correct drawer subset.

3. **Sentinel skip for transient missing-dep statuses (Copilot MemPalace#14).**
   New ``_TRANSIENT_MISSING_DEP_STATUSES`` set + ``_register_skip_sentinel_if_appropriate``
   helper. Skip variants like ``SKIP_NO_MARKITDOWN`` /
   ``SKIP_NO_STRIPRTF`` / ``SKIP_MISSING_FORMAT_DEPS`` /
   ``SKIP_NETWORK_TIMEOUT`` no longer write the "already-mined" sentinel.
   Otherwise installing the missing extra later wouldn't trigger a re-mine.

4. **Outer ``except Exception`` in ``mine_formats`` (Gemini MemPalace#5).** The
   outer try around the loop previously caught only ``KeyboardInterrupt``,
   leaving any setup-time error (e.g., ``scan_formats`` raising) to
   propagate as a bare traceback. Now catches ``Exception`` defensively,
   logs it, prints a partial-progress summary, and lets the ``finally``
   PID-cleanup run. Mirrors miner.py's belt-and-suspenders pattern.

5. **Thread user's ``chunk_size`` / ``chunk_overlap`` / ``min_chunk_size``
   through to ``chunk_text`` (Gemini MemPalace#3).** ``MempalaceConfig`` was loaded
   only to validate readability; users who tuned their config saw no
   effect in format-mode mining. Now properly threaded.

## Trivial cleanups (5 items)

6. **Path expanduser in ``extract_text`` (Copilot MemPalace#7).** ``Path(path)`` →
   ``Path(path).expanduser()`` so CLI inputs like ``~/docs/file.pdf``
   resolve correctly.

7. **Path expanduser+resolve in ``scan_formats`` (Copilot MemPalace#9).** Same
   fix; ``~/docs`` and relative paths now work consistently.

8. **Use resolved ``format_path`` in ``mine_formats`` (Copilot MemPalace#10).**
   ``scan_formats(format_dir)`` → ``scan_formats(format_path)`` so the
   already-resolved path is used.

9. **``render_with_line_numbers`` type annotation (Copilot MemPalace#15).**
   ``text: "str | None"`` reflects the documented + tested ``None``
   handling.

10. **Test + docs claims (Copilot MemPalace#16, MemPalace#17, MemPalace#18).** Stale framings
    removed:
    - ``docs/format-coverage.md`` — 14 fringe cases + "see the file for
      the current test inventory" (no more frozen test count).
    - ``tests/test_line_numbers.py`` — drops "proposed for mempalace
      3.3.6" + "run from the proposal directory" references.
    - ``tests/test_format_miner.py`` — drops "MarkItDown is mocked
      throughout" (live integration tests exist) + proposal-directory
      framing.

## Module-level hoists (enables clean test patching)

- ``MempalaceConfig`` (from ``.config``) hoisted from lazy local import
  to module-level so tests can patch ``mempalace.format_miner.MempalaceConfig``.
- ``chunk_text`` (from ``.miner``) hoisted similarly.

Both follow the pattern PR MemPalace#1565 used for ``compute_hallways_for_wing``.

## Complexity refactor

Extracted ``_print_mine_summary`` from ``mine_formats`` so the orchestrator
stays under the project's ``max-complexity = 25`` ceiling (per
``pyproject.toml [tool.ruff.lint.mccabe]``). Behavior unchanged; pure
extraction.

## Out of scope (intentionally deferred)

- **Drawer ID delimiter collision (Copilot MemPalace#13)** — ``f"{source_file}{chunk_index}"``
  can theoretically collide (``"/path/a1" + "23"`` == ``"/path/a" + "123"``).
  Fixing this is a breaking schema change to drawer IDs and requires a
  migration plan; will land as its own PR after design.

- The four bot comments that were ALREADY addressed by amendment MemPalace#3
  before the PR MemPalace#1555 merge (``_SKIP_DIRS`` dedup, ``scan_formats``
  symlink skip, ``source_mtime`` tracking, hall+entities metadata) —
  no action needed; verified during audit.

## Tests (RED-first)

Six new RED-first tests in ``tests/test_format_miner.py``:

  test_extract_text_nonexistent_regular_file_returns_unreadable_not_broken_symlink
  test_mine_formats_passes_extract_mode_format_to_file_already_mined
  test_mine_formats_does_not_write_sentinel_for_skip_no_markitdown
  test_mine_formats_does_not_write_sentinel_for_skip_missing_format_deps
  test_mine_formats_catches_unexpected_exception_and_prints_summary
  test_mine_formats_threads_chunk_size_from_user_config

All six RED before this commit (failures correctly identified the bugs
they're targeting), all six GREEN after.

One existing test (``test_mine_formats_continues_after_per_file_error``)
updated to patch the new module-level binding
``mempalace.format_miner.chunk_text`` instead of the old
``mempalace.miner.chunk_text`` source location, and to accept the
``**kwargs`` the call now passes through. Behavior unchanged.

## Verification

  pytest -q (full mempalace suite)
    → 2065 passed, 3 skipped, 0 regressions
  ruff check mempalace/format_miner.py mempalace/searcher.py tests/
    → All checks passed!
  ruff format --check ...
    → 4 files already formatted (pinned 0.15.9)
  mine_formats complexity
    → ≤ 25 (under the project ceiling)
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request May 22, 2026
Wires _validate_palace_fts5_after_mine into all three mine entry
points so corrupted-FTS5 palaces cannot silently exit 0 from any
of them:

- _mine_impl (mempalace/miner.py) — project file miner
- mine_convos (mempalace/convo_miner.py) — conversation exports
- mine_formats (mempalace/format_miner.py) — binary office documents
  via --mode extract, introduced on develop by MemPalace#1555 (3.3.6 release)
  between this PR's open date and its rebase

cmd_mine surfaces MineValidationError as exit 1 + the same
print_sqlite_integrity_abort banner cmd_repair already prints,
appended with a mine-specific stderr note that hedges attribution
(quick_check cannot tell pre-existing corruption from corruption
this mine produced). 17 tests in tests/test_miner_fts5_validation.py
cover the helper, the three call sites, dry-run / KeyboardInterrupt
skip semantics, and the MineValidationError constructor invariants.

Closes MemPalace#1537.

Co-authored-by: Caleb Wells <15988028+calebcwells@users.noreply.github.com>
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request May 23, 2026
Wires _validate_palace_fts5_after_mine into all three mine entry
points so corrupted-FTS5 palaces cannot silently exit 0 from any
of them:

- _mine_impl (mempalace/miner.py) — project file miner
- mine_convos (mempalace/convo_miner.py) — conversation exports
- mine_formats (mempalace/format_miner.py) — binary office documents
  via --mode extract, introduced on develop by MemPalace#1555 (3.3.6 release)
  between this PR's open date and its rebase

cmd_mine surfaces MineValidationError as exit 1 + the same
print_sqlite_integrity_abort banner cmd_repair already prints,
appended with a mine-specific stderr note that hedges attribution
(quick_check cannot tell pre-existing corruption from corruption
this mine produced). 17 tests in tests/test_miner_fts5_validation.py
cover the helper, the three call sites, dry-run / KeyboardInterrupt
skip semantics, and the MineValidationError constructor invariants.

Closes MemPalace#1537.

Co-authored-by: Caleb Wells <15988028+calebcwells@users.noreply.github.com>
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request May 24, 2026
Wires _validate_palace_fts5_after_mine into all three mine entry
points so corrupted-FTS5 palaces cannot silently exit 0 from any
of them:

- _mine_impl (mempalace/miner.py) — project file miner
- mine_convos (mempalace/convo_miner.py) — conversation exports
- mine_formats (mempalace/format_miner.py) — binary office documents
  via --mode extract, introduced on develop by MemPalace#1555 (3.3.6 release)
  between this PR's open date and its rebase

cmd_mine surfaces MineValidationError as exit 1 + the same
print_sqlite_integrity_abort banner cmd_repair already prints,
appended with a mine-specific stderr note that hedges attribution
(quick_check cannot tell pre-existing corruption from corruption
this mine produced). 17 tests in tests/test_miner_fts5_validation.py
cover the helper, the three call sites, dry-run / KeyboardInterrupt
skip semantics, and the MineValidationError constructor invariants.

Closes MemPalace#1537.

Co-authored-by: Caleb Wells <15988028+calebcwells@users.noreply.github.com>
@igorls igorls mentioned this pull request May 24, 2026
3 tasks
arnoldwender pushed a commit to arnoldwender/mempalace that referenced this pull request May 24, 2026
Bumps version 3.3.5 → 3.3.6 across pyproject.toml, version.py, plugin
manifests (.claude-plugin/plugin.json, .claude-plugin/marketplace.json,
.codex-plugin/plugin.json), README badge, and uv.lock. Flips CHANGELOG.md
from ``[Unreleased]`` to ``[3.3.6] — 2026-05-24`` and backfills the
major user-facing entries that landed without changelog entries during
the cycle:

Features:
- MemPalace#1555 office-document mining via --mode extract + virtual line numbers
- MemPalace#1584 surgical closet pointers with date+line locators (Tier 6a)
- MemPalace#1558 + MemPalace#1560 within-wing hallways (entity co-occurrence graph)
- MemPalace#1565 cross-wing tunnels auto-promoted from hallways
- MemPalace#1578 Hebbian potentiation + Ebbinghaus decay on hallways/tunnels
- MemPalace#1236 API-tool transcripts auto-route to wing_api
- MemPalace#711 hooks.auto_save toggle for silent-mode sessions
- MemPalace#1605 COCA content-word filter for entity detection
- MemPalace#1557 case-insensitive entity matching at mine time
- MemPalace#1483 multilingual embeddings (embeddinggemma-300m) by default

Bug Fixes (selected, user-visible):
- MemPalace#1540 silent data loss in three unchunked upsert sites
- MemPalace#1538 paragraph chunker oversized chunks
- MemPalace#1554 per-file chunk cap too low for transcripts
- MemPalace#1562 Windows hook subprocess/ChromaDB deadlock
- MemPalace#1529 create_tunnel corrupted hyphenated wing names
- MemPalace#1424 save-hook truncated hyphenated project folders
- MemPalace#1383 KG cache duplicated graphs for symlinked/cased paths
- MemPalace#1466 silent symlink skip now logged
- MemPalace#1441 macOS stock-bash 3.2 hook compatibility
- MemPalace#1500 / MemPalace#1513 structured JSON-RPC errors on bad MCP input
- MemPalace#1523 VACUUM + FTS5 rebuild after repair
- MemPalace#1548 FTS5 validation at end of mine
- plus MemPalace#1216, MemPalace#1408, MemPalace#1438, MemPalace#1439, MemPalace#1445, MemPalace#1452, MemPalace#1459, MemPalace#1461, MemPalace#1466,
  MemPalace#1470, MemPalace#1477, MemPalace#1485, MemPalace#1500, MemPalace#1513, MemPalace#1528, MemPalace#1532, MemPalace#1543, MemPalace#1546, MemPalace#1585

Performance:
- MemPalace#1474 convo miner pre-fetches mined-set
- MemPalace#1487 rebuild_index progress callback
- MemPalace#1530 MCP cold-start diagnostics + opt-in warmup

Lint passes (ruff 0.15.14); mempalace-mcp entry point alignment
verified per RELEASING.md.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants