Feat: respect .gitignore files during mining by sha2fiddy · Pull Request #80 · MemPalace/mempalace

sha2fiddy · 2026-04-07T14:31:13Z

Summary

Closes #56.

Adds .gitignore support to scan_project() so that files and directories already ignored by each repo's .gitignore are automatically excluded during mining. This is especially useful when mining a parent directory containing multiple repos — previously, thousands of build artifacts, caches, and data files would get indexed.

Uses the pathspec library (pure Python, no transitive deps) for gitignore pattern matching
Stack-based: .gitignore files are loaded as os.walk descends, so nested rules in subdirectories are honored additively
SKIP_DIRS remains as a fast baseline filter applied first
On by default, with --no-gitignore CLI flag to opt out
Expands SKIP_DIRS with common Python/JS/IDE cache directories (.ruff_cache, .mypy_cache, .pytest_cache, etc.)

Note on new dependency: CONTRIBUTING.md says "ChromaDB + PyYAML only — don't add new deps without discussion." pathspec is lightweight (pure Python, zero transitive deps, ~30KB), and is the de facto standard for gitignore pattern matching. A stdlib-only approach would require hand-rolling the full gitignore spec (negation patterns, ** globbing, directory-only patterns, rooted patterns), which is non-trivial and error-prone.

Before / After (real-world test)

Mining a directory containing 14 git repos:

	Files	Drawers
Before	7,431	130,953
After	3,683	89,103
Reduction	-50%	-32%

Test plan

7 new/existing tests pass (pytest tests/test_miner.py -v)
- .gitignore patterns exclude files and directories
- --no-gitignore flag disables the feature
- Nested .gitignore files stack correctly
- No .gitignore present = current behavior unchanged
- SKIP_DIRS still apply alongside gitignore
- Wildcard patterns work
Dry run verified against real multi-repo directory
No dbt/build artifacts in dry run output

🤖 Generated with Claude Code

Add pathspec-based .gitignore support to scan_project(). Gitignore files are loaded as a stack during os.walk, so nested .gitignore rules in subdirectories are honored additively. SKIP_DIRS remains as a fast baseline filter applied first. - Add pathspec>=0.11.0 dependency - Add _load_gitignore() and _is_gitignored() helpers - Add respect_gitignore param to scan_project() and mine() - Add --no-gitignore CLI flag to opt out - Expand SKIP_DIRS with common cache/build directories - Add 6 new scan_project() tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds pathspec-based .gitignore support to scan_project(). Nested .gitignore files are honored additively. Expands SKIP_DIRS with common cache/build directories. Use --no-gitignore to override. Upstream: MemPalace#80 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

bensig · 2026-04-07T19:34:27Z

Thanks for this — we merged #78 which covers the same gitignore support. Closing as duplicate.

Closet pointer lines gain a 4th pipe-separated segment of shape ``YYYY-MM-DD:Lstart-Lend`` so retrieval can jump straight to the right span in the source file instead of opening the whole drawer. The date is selected via a content-aware hierarchy so legacy ingests (where ``filed_at`` is misleading) still produce honest pointers when the content itself carries a date. The legacy 3-segment format remains the fallback for drawers without the new metadata, so existing palaces keep working without a re-mine. ## Pointer shape Before: topic|entities|→drawer_ids After (when drawer metadata carries line range + a date): topic|entities|2024-11-08:L42-L78|→drawer_ids Backward compat: when ``drawer_metas`` is not passed, or the first meta lacks ``line_start`` / ``line_end``, ``build_closet_lines`` emits the legacy 3-segment form unchanged. ## Date-source hierarchy (NEW) A new helper ``mempalace.miner._extract_content_date(source_file, content)`` selects a date once per file and stores it as ``content_date`` on every drawer mined from that file. Priority order, first match wins: 1. **Filename** — ISO regex on stem, then dateutil fuzzy parse for natural-language formats. Handles ``2024-11-08.md``, ``April-6th-2011-notes.md``, ``Nov-8-2024.md``, etc. 2. **YAML frontmatter** — looks at ``date`` / ``created`` / ``published`` fields. yaml.safe_load already parses ISO dates as ``datetime.date`` objects; dateutil handles string values. 3. **Content body, first ~10 lines** — a. ISO regex (``YYYY-MM-DD`` / ``YYYY/MM/DD`` / ``YYYY.MM.DD``) b. Slash dates with locale auto-disambiguation: if any first-number > 12 appears anywhere in the head, lock the file's locale to DD/MM (the only consistent reading); otherwise default to US MM/DD for two-digit-year compactness. c. dateutil fuzzy parse for natural-language (``November 8, 2024``, ``8th November 2024``, etc.) 4. **Filesystem mtime** — ``os.path.getmtime`` formatted as ``YYYY-MM-DD``. 5. **None** — caller (``_build_drawer_metadata``) leaves ``content_date`` absent; ``build_closet_lines`` falls back to the ``filed_at`` ingestion timestamp's date portion. ``_build_date_line_segment`` in ``palace.py`` now prefers ``content_date`` when present, otherwise uses ``filed_at[:10]``. This is the load-bearing change that makes Tier 6a meaningful for legacy content — a Nov 8 2024 transcript mined today now emits ``2024-11-08:L42-L78`` instead of ``2026-05-22:L42-L78``. ## Changes 1. ``mempalace/miner.py::chunk_text`` — each chunk dict now also carries ``line_start`` / ``line_end`` (1-indexed line numbers in the stripped source). Computed via ``content[:offset].count("\n") + 1`` — approximate locator accurate to ±1 line at chunk boundaries. 2. ``mempalace/miner.py::_extract_content_date`` (NEW) + helpers (``_try_filename_date``, ``_try_frontmatter_date``, ``_try_content_body_date``, ``_try_mtime_date``). Pure functions, no I/O beyond what's named. Uses lazy imports of ``dateutil`` and ``yaml`` to keep top-level import surface unchanged. 3. ``mempalace/miner.py::_build_drawer_metadata`` — gains ``line_start`` / ``line_end`` / ``content_date`` optional kwargs. When provided, stored in drawer metadata; when omitted, keys are absent from the dict — backward compatible. 4. ``mempalace/miner.py`` batched-upsert path — calls ``_extract_content_date(source_file, content)`` ONCE per file (not once per chunk) and passes the result through to every chunk's ``_build_drawer_metadata`` call. 5. ``mempalace/format_miner.py::_file_chunks_locked`` — gains a ``content`` param so the format-miner path can also run ``_extract_content_date`` on the full extracted text. Plumbs ``content_date`` into every batched drawer's metadata. Caller passes ``text`` (the full extracted markdown). 6. ``mempalace/palace.py::build_closet_lines`` — accepts optional ``drawer_metas`` (parallel to ``drawer_ids``). When present and the first meta has both ``line_start`` and ``line_end``, the new segment is spliced in; otherwise legacy 3-segment form. Helper ``_build_date_line_segment`` prefers ``content_date`` over ``filed_at`` for the date portion. ## Tests added (RED-first, then GREEN) tests/test_miner.py::TestChunkTextLineRanges (4 tests) tests/test_miner.py::TestBuildDrawerMetadataLineRange (2 tests) tests/test_miner.py::TestExtractContentDate (15 tests) - filename ISO / natural-language / compact - YAML frontmatter date / created - Claude session preamble - ISO + natural-language content body - Slash-date locale disambiguation (DD/MM lock vs default MM/DD) - Priority order (filename > frontmatter > content > mtime) - mtime fallback - Graceful None returns for missing file / empty content tests/test_closets.py::TestBuildClosetLines (5 new tests) - includes_date_line_segment_when_metas_provided - falls_back_to_3_segment_format_when_metas_missing (compat) - falls_back_to_3_segment_format_when_metas_lack_line_keys (compat) - date_segment_uses_filed_at_date_portion_only - content_date_preferred_over_filed_at (new — Tier 6a hierarchy) 26 new tests total. All RED before this commit, all GREEN after. ## Verification pytest tests/test_miner.py::TestChunkTextLineRanges tests/test_miner.py::TestBuildDrawerMetadataLineRange tests/test_miner.py::TestExtractContentDate tests/test_closets.py::TestBuildClosetLines → 32 passed pytest tests/test_miner.py tests/test_closets.py tests/test_format_miner.py tests/test_palace.py → 230 passed, 2 skipped, 0 regressions in directly-affected files ruff check + ruff format --check → All checks passed, 5 files already formatted End-to-end smoke test on a YAML-frontmatter file (date: 2024-11-08, mined "today"): content_date extracted: 2024-11-08 chunks produced: 1 drawer meta: content_date=2024-11-08, filed_at='2026-05-22...', lines=1-15 closet pointer: "conversation with lumi about brands of dog food|Filler;Lumi;Aya|2024-11-08:L1-L15|→drawer_a,drawer_b" → content_date correctly preferred over filed_at end-to-end. OrbStack triple-Linux verify (Py 3.9 / 3.11 / 3.13) → 32 targeted Tier 6a + content-date tests pass on each. ## Out of scope (deliberate) - No schema migration of existing drawers. Pre-Tier-6a drawers lack ``line_start`` / ``content_date`` and silently fall back to the legacy 3-segment pointer shape. - No re-mine required. Users keep existing palaces; new files mined after this lands carry the new fields. - No drawer ID change (Task MemPalace#80, separate concern). - No exact-quote line positioning within a chunk (future tier). - No NLP-based date extraction (only regex + dateutil fuzzy). - Linguistic-similarity / concept-recognition retrieval (L5 zettal layer) is the natural next step but lives in its own PR family.

Drawer/closet/triple IDs built by concatenating strings without a delimiter before hashing can collide: hash((s1 + str(i1))) == hash((s2 + str(i2))) whenever s1+str(i1) == s2+str(i2). Concretely, source_file="foo" + chunk_index=12 produces the same hash input as source_file="foo1" + chunk_index=2 — both yield "foo12". Under ChromaDB's primary-key constraint the second write silently overwrites the first; one chunk's content is lost without surfacing an error (the surrounding broad `except Exception:` blocks swallow any backend complaint). The defect class is "string + string concat fed to a hash without a delimiter." Per the styleguide's partial-scope-key-migration rule, every site sharing the pattern is a candidate and must be triaged — not just the ones the original issue named. FIX — 6 sites - mempalace/miner.py:1253 drawer_id, add_drawer() back-compat - mempalace/miner.py:1386 drawer_id, batched mine loop - mempalace/miner.py:1416 drawer_ids, closet-build re-derivation - mempalace/format_miner.py:643 drawer_id, format_miner batched loop - mempalace/mcp_server.py:1136 drawer_id, tool_add_drawer - mempalace/knowledge_graph.py:305 triple_id, KG triple insertion MIGRATED FOR CONSISTENCY — 2 sites - mempalace/convo_miner.py:87 sentinel_key — was `:`, now `|` - mempalace/convo_miner.py:422 drawer_key — was `:`, now `|` Pre-existing delimiter was `:`. Migrated because (a) source_file can contain literal `:` on Windows paths (`C:\Users\…`) and URL-like sources (`https://host:8080`), weakening `:` as a separator; (b) `|` is reserved in Windows filenames so source_file cannot contain it, making it strictly safer. DELIMITER CHOICE - `|` chosen over `:` (Windows / URL source_file edge cases) - `|` chosen over `\x00` (NUL breaks log greppability + print() debug + downstream string handling for marginal extra safety) - Matches the established precedent in mempalace/diary_ingest.py (lines 52, 76, 91, 98 — all already on `|`) EXEMPT — audited and correct as-is Single-input hashes (nothing to delimit): - mempalace/miner.py:1432 closet_id (source_file only) - mempalace/format_miner.py:559 sentinel_id (source_file only) - mempalace/palace.py:433 lock filename (source_file only) - mempalace/palace.py:629 palace_key (lock_key_source only) - mempalace/diary_ingest.py:158 content_hash (text only) - mempalace/hooks_cli.py:329 pidfile digest (joined cmd only) - mempalace/sources/context.py:141 record digest (source_file only) Already correctly delimited: - mempalace/hallways.py:157 `f"{wing}::{a}::{b}"` (`::`) - mempalace/palace_graph.py:454 `f"{a}↔{b}"` (`↔`) - mempalace/diary_ingest.py:52,76,91,98 (`|` precedent) Protected by composition (uniqueness guaranteed by the ID prefix, not by the hash slice): - mempalace/mcp_server.py:1635 entry_id is `diary_{wing}_{strftime('%Y%m%d_%H%M%S%f')}_{sha[:12]}`. Microsecond-resolution timestamp prefix supplies uniqueness; the trailing hash is a content-discriminator, not the write-time uniqueness guarantor. NEW MODULES - mempalace/ids.py — single source of truth for ID construction. Five helpers (make_drawer_id_from_chunk, make_drawer_id_from_content, make_convo_drawer_id, make_convo_sentinel_id, make_triple_id) plus an ID_RECIPE = "v2" constant. - mempalace/collision_scan.py — pre-mining defense. Runs immediately before each batched ChromaDB upsert; raises CollisionError naming the colliding (source_file, chunk_index) pairs if any proposed drawer_id appears more than once with conflicting metadata across the union of incoming and existing rows. DETECTION ChromaDB's primary-key constraint means past collisions were silently resolved at upsert time (last-write-wins), erasing evidence. A post-hoc audit cannot reconstruct what was lost from palace state alone. Therefore: - Pre-mining risk scan. Before each batched upsert, compute the proposed drawer_ids for the incoming chunk set AND query existing drawer_ids from the collection. If any proposed id appears more than once in the union (incoming-vs-incoming or incoming-vs- existing) with conflicting (source_file, chunk_index), abort the mine with an actionable error naming the colliding pairs. Collision is caught BEFORE it destroys data, which is the only point at which palace state still carries the evidence. - New metadata key: `"id_recipe": "v2"` on every drawer written under the delimited recipe. Audits compare like-for-like; drawers without `id_recipe` are treated as v1 legacy (undelimited or `:`-delimited), not as collisions. - Honest disclosure: palaces mined under any pre-v2 mempalace may carry silent past collisions whose original content is unrecoverable from palace state. Future library tier work will give users a per-drawer audit + opt-in archival path. TESTS - tests/test_ids.py: 17 unit tests covering all 5 helpers, the ID_RECIPE constant, the private `_delimited_sha256` helper, and the four defect-class collision shapes (chunk_index boundary, content boundary, extract_mode boundary, ISO datetime boundary). RED collision tests confirm the v2 recipe breaks the defect class. - tests/test_collision_scan.py: 10 tests covering clean batches, idempotent re-mines, incoming-vs-incoming collisions, incoming-vs- existing collisions, error-message quality, empty batches, metadata without chunk_index, and ChromaDB backend errors propagating cleanly. - tests/test_convo_miner_unit.py: FakeCol gains a stub `get()` so the pre-mining scan can probe an empty in-test collection. BACKWARDS COMPATIBILITY - Existing v1 drawers remain queryable; no rewrite of stored IDs. - New mining passes write v2 drawers alongside v1 via PR #1628's additive-mining model. - No user action required; opt-in cleanup ships separately. VERIFICATION - macOS Python 3.12 (local): 2304 passed, 3 skipped, 0 failed, coverage 85.44% (above 80% threshold). - Linux Python 3.9/3.11/3.13 (OrbStack with full `pip install -e '.[dev]'` flow): see PR body for per-version pass counts. - `ruff check .` + `ruff format --check .`: clean. - pylint score: ids.py 10.00/10, collision_scan.py 10.00/10; all modified files >= 8.92 (no regression). - bandit: clean on all new files; the one MEDIUM finding in knowledge_graph.py is on lines 385/407 (pre-existing SQL string construction in timeline queries), not in code touched by this PR. - pre_push_check.sh: ALL CHECKS PASSED. Refs: deferred from PR #1572 (Copilot finding #13). Styleguide audit documented above: 8 fix/migration + 11 enumerated exempt sites.

…limiter fix(ids): delimit hash inputs to prevent drawer_id collisions (#80)

sha2fiddy mentioned this pull request Apr 7, 2026

Respect nested .gitignore rules when mining project files #78

Merged

3 tasks

bensig closed this Apr 7, 2026

sha2fiddy deleted the feature/gitignore-support branch April 18, 2026 16:20

igorls added a commit that referenced this pull request Jun 6, 2026

Merge pull request #1666 from MemPalace/fix/80-drawer-id-collision-de…

6434370

…limiter fix(ids): delimit hash inputs to prevent drawer_id collisions (#80)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: respect .gitignore files during mining#80

Feat: respect .gitignore files during mining#80
sha2fiddy wants to merge 1 commit into
MemPalace:mainfrom
sha2fiddy:feature/gitignore-support

sha2fiddy commented Apr 7, 2026

Uh oh!

bensig commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sha2fiddy commented Apr 7, 2026

Summary

Before / After (real-world test)

Test plan

Uh oh!

bensig commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants