Skip to content

Feat: respect .gitignore files during mining#80

Closed
sha2fiddy wants to merge 1 commit into
MemPalace:mainfrom
sha2fiddy:feature/gitignore-support
Closed

Feat: respect .gitignore files during mining#80
sha2fiddy wants to merge 1 commit into
MemPalace:mainfrom
sha2fiddy:feature/gitignore-support

Conversation

@sha2fiddy

Copy link
Copy Markdown
Contributor

Summary

Closes #56.

Adds .gitignore support to scan_project() so that files and directories already ignored by each repo's .gitignore are automatically excluded during mining. This is especially useful when mining a parent directory containing multiple repos — previously, thousands of build artifacts, caches, and data files would get indexed.

  • Uses the pathspec library (pure Python, no transitive deps) for gitignore pattern matching
  • Stack-based: .gitignore files are loaded as os.walk descends, so nested rules in subdirectories are honored additively
  • SKIP_DIRS remains as a fast baseline filter applied first
  • On by default, with --no-gitignore CLI flag to opt out
  • Expands SKIP_DIRS with common Python/JS/IDE cache directories (.ruff_cache, .mypy_cache, .pytest_cache, etc.)

Note on new dependency: CONTRIBUTING.md says "ChromaDB + PyYAML only — don't add new deps without discussion." pathspec is lightweight (pure Python, zero transitive deps, ~30KB), and is the de facto standard for gitignore pattern matching. A stdlib-only approach would require hand-rolling the full gitignore spec (negation patterns, ** globbing, directory-only patterns, rooted patterns), which is non-trivial and error-prone.

Before / After (real-world test)

Mining a directory containing 14 git repos:

Files Drawers
Before 7,431 130,953
After 3,683 89,103
Reduction -50% -32%

Test plan

  • 7 new/existing tests pass (pytest tests/test_miner.py -v)
    • .gitignore patterns exclude files and directories
    • --no-gitignore flag disables the feature
    • Nested .gitignore files stack correctly
    • No .gitignore present = current behavior unchanged
    • SKIP_DIRS still apply alongside gitignore
    • Wildcard patterns work
  • Dry run verified against real multi-repo directory
  • No dbt/build artifacts in dry run output

🤖 Generated with Claude Code

Add pathspec-based .gitignore support to scan_project(). Gitignore
files are loaded as a stack during os.walk, so nested .gitignore
rules in subdirectories are honored additively. SKIP_DIRS remains
as a fast baseline filter applied first.

- Add pathspec>=0.11.0 dependency
- Add _load_gitignore() and _is_gitignored() helpers
- Add respect_gitignore param to scan_project() and mine()
- Add --no-gitignore CLI flag to opt out
- Expand SKIP_DIRS with common cache/build directories
- Add 6 new scan_project() tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GoodOlClint added a commit to GoodOlClint/mempalace that referenced this pull request Apr 7, 2026
Adds pathspec-based .gitignore support to scan_project(). Nested
.gitignore files are honored additively. Expands SKIP_DIRS with
common cache/build directories. Use --no-gitignore to override.

Upstream: MemPalace#80

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bensig

bensig commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

Thanks for this — we merged #78 which covers the same gitignore support. Closing as duplicate.

@bensig bensig closed this Apr 7, 2026
@sha2fiddy sha2fiddy deleted the feature/gitignore-support branch April 18, 2026 16:20
sha2fiddy pushed a commit to sha2fiddy/mempalace that referenced this pull request May 23, 2026
Closet pointer lines gain a 4th pipe-separated segment of shape
``YYYY-MM-DD:Lstart-Lend`` so retrieval can jump straight to the right
span in the source file instead of opening the whole drawer. The date
is selected via a content-aware hierarchy so legacy ingests (where
``filed_at`` is misleading) still produce honest pointers when the
content itself carries a date.

The legacy 3-segment format remains the fallback for drawers without
the new metadata, so existing palaces keep working without a re-mine.

## Pointer shape

Before:
  topic|entities|→drawer_ids

After (when drawer metadata carries line range + a date):
  topic|entities|2024-11-08:L42-L78|→drawer_ids

Backward compat: when ``drawer_metas`` is not passed, or the first
meta lacks ``line_start`` / ``line_end``, ``build_closet_lines`` emits
the legacy 3-segment form unchanged.

## Date-source hierarchy (NEW)

A new helper ``mempalace.miner._extract_content_date(source_file, content)``
selects a date once per file and stores it as ``content_date`` on every
drawer mined from that file. Priority order, first match wins:

  1. **Filename** — ISO regex on stem, then dateutil fuzzy parse for
     natural-language formats. Handles ``2024-11-08.md``,
     ``April-6th-2011-notes.md``, ``Nov-8-2024.md``, etc.

  2. **YAML frontmatter** — looks at ``date`` / ``created`` /
     ``published`` fields. yaml.safe_load already parses ISO dates as
     ``datetime.date`` objects; dateutil handles string values.

  3. **Content body, first ~10 lines** —
       a. ISO regex (``YYYY-MM-DD`` / ``YYYY/MM/DD`` / ``YYYY.MM.DD``)
       b. Slash dates with locale auto-disambiguation: if any
          first-number > 12 appears anywhere in the head, lock the
          file's locale to DD/MM (the only consistent reading);
          otherwise default to US MM/DD for two-digit-year compactness.
       c. dateutil fuzzy parse for natural-language
          (``November 8, 2024``, ``8th November 2024``, etc.)

  4. **Filesystem mtime** — ``os.path.getmtime`` formatted as
     ``YYYY-MM-DD``.

  5. **None** — caller (``_build_drawer_metadata``) leaves
     ``content_date`` absent; ``build_closet_lines`` falls back to the
     ``filed_at`` ingestion timestamp's date portion.

``_build_date_line_segment`` in ``palace.py`` now prefers ``content_date``
when present, otherwise uses ``filed_at[:10]``. This is the load-bearing
change that makes Tier 6a meaningful for legacy content — a Nov 8 2024
transcript mined today now emits ``2024-11-08:L42-L78`` instead of
``2026-05-22:L42-L78``.

## Changes

1. ``mempalace/miner.py::chunk_text`` — each chunk dict now also carries
   ``line_start`` / ``line_end`` (1-indexed line numbers in the stripped
   source). Computed via ``content[:offset].count("\n") + 1`` —
   approximate locator accurate to ±1 line at chunk boundaries.

2. ``mempalace/miner.py::_extract_content_date`` (NEW) + helpers
   (``_try_filename_date``, ``_try_frontmatter_date``,
   ``_try_content_body_date``, ``_try_mtime_date``). Pure functions, no
   I/O beyond what's named. Uses lazy imports of ``dateutil`` and
   ``yaml`` to keep top-level import surface unchanged.

3. ``mempalace/miner.py::_build_drawer_metadata`` — gains
   ``line_start`` / ``line_end`` / ``content_date`` optional kwargs.
   When provided, stored in drawer metadata; when omitted, keys are
   absent from the dict — backward compatible.

4. ``mempalace/miner.py`` batched-upsert path — calls
   ``_extract_content_date(source_file, content)`` ONCE per file (not
   once per chunk) and passes the result through to every chunk's
   ``_build_drawer_metadata`` call.

5. ``mempalace/format_miner.py::_file_chunks_locked`` — gains a
   ``content`` param so the format-miner path can also run
   ``_extract_content_date`` on the full extracted text. Plumbs
   ``content_date`` into every batched drawer's metadata. Caller passes
   ``text`` (the full extracted markdown).

6. ``mempalace/palace.py::build_closet_lines`` — accepts optional
   ``drawer_metas`` (parallel to ``drawer_ids``). When present and the
   first meta has both ``line_start`` and ``line_end``, the new
   segment is spliced in; otherwise legacy 3-segment form. Helper
   ``_build_date_line_segment`` prefers ``content_date`` over
   ``filed_at`` for the date portion.

## Tests added (RED-first, then GREEN)

  tests/test_miner.py::TestChunkTextLineRanges          (4 tests)
  tests/test_miner.py::TestBuildDrawerMetadataLineRange (2 tests)
  tests/test_miner.py::TestExtractContentDate           (15 tests)
    - filename ISO / natural-language / compact
    - YAML frontmatter date / created
    - Claude session preamble
    - ISO + natural-language content body
    - Slash-date locale disambiguation (DD/MM lock vs default MM/DD)
    - Priority order (filename > frontmatter > content > mtime)
    - mtime fallback
    - Graceful None returns for missing file / empty content

  tests/test_closets.py::TestBuildClosetLines           (5 new tests)
    - includes_date_line_segment_when_metas_provided
    - falls_back_to_3_segment_format_when_metas_missing  (compat)
    - falls_back_to_3_segment_format_when_metas_lack_line_keys (compat)
    - date_segment_uses_filed_at_date_portion_only
    - content_date_preferred_over_filed_at               (new — Tier 6a hierarchy)

26 new tests total. All RED before this commit, all GREEN after.

## Verification

  pytest tests/test_miner.py::TestChunkTextLineRanges
         tests/test_miner.py::TestBuildDrawerMetadataLineRange
         tests/test_miner.py::TestExtractContentDate
         tests/test_closets.py::TestBuildClosetLines
    → 32 passed

  pytest tests/test_miner.py tests/test_closets.py
         tests/test_format_miner.py tests/test_palace.py
    → 230 passed, 2 skipped, 0 regressions in directly-affected files

  ruff check + ruff format --check
    → All checks passed, 5 files already formatted

  End-to-end smoke test on a YAML-frontmatter file (date: 2024-11-08,
  mined "today"):
    content_date extracted: 2024-11-08
    chunks produced: 1
    drawer meta: content_date=2024-11-08, filed_at='2026-05-22...', lines=1-15
    closet pointer:
      "conversation with lumi about brands of dog food|Filler;Lumi;Aya|2024-11-08:L1-L15|→drawer_a,drawer_b"
    → content_date correctly preferred over filed_at end-to-end.

  OrbStack triple-Linux verify (Py 3.9 / 3.11 / 3.13)
    → 32 targeted Tier 6a + content-date tests pass on each.

## Out of scope (deliberate)

- No schema migration of existing drawers. Pre-Tier-6a drawers lack
  ``line_start`` / ``content_date`` and silently fall back to the
  legacy 3-segment pointer shape.
- No re-mine required. Users keep existing palaces; new files mined
  after this lands carry the new fields.
- No drawer ID change (Task MemPalace#80, separate concern).
- No exact-quote line positioning within a chunk (future tier).
- No NLP-based date extraction (only regex + dateutil fuzzy).
- Linguistic-similarity / concept-recognition retrieval (L5 zettal
  layer) is the natural next step but lives in its own PR family.
milla-jovovich added a commit that referenced this pull request May 30, 2026
Drawer/closet/triple IDs built by concatenating strings without a
delimiter before hashing can collide: hash((s1 + str(i1))) ==
hash((s2 + str(i2))) whenever s1+str(i1) == s2+str(i2). Concretely,
source_file="foo" + chunk_index=12 produces the same hash input as
source_file="foo1" + chunk_index=2 — both yield "foo12". Under
ChromaDB's primary-key constraint the second write silently
overwrites the first; one chunk's content is lost without surfacing
an error (the surrounding broad `except Exception:` blocks swallow
any backend complaint).

The defect class is "string + string concat fed to a hash without
a delimiter." Per the styleguide's partial-scope-key-migration rule,
every site sharing the pattern is a candidate and must be triaged —
not just the ones the original issue named.

FIX — 6 sites
- mempalace/miner.py:1253        drawer_id, add_drawer() back-compat
- mempalace/miner.py:1386        drawer_id, batched mine loop
- mempalace/miner.py:1416        drawer_ids, closet-build re-derivation
- mempalace/format_miner.py:643  drawer_id, format_miner batched loop
- mempalace/mcp_server.py:1136   drawer_id, tool_add_drawer
- mempalace/knowledge_graph.py:305  triple_id, KG triple insertion

MIGRATED FOR CONSISTENCY — 2 sites
- mempalace/convo_miner.py:87    sentinel_key — was `:`, now `|`
- mempalace/convo_miner.py:422   drawer_key   — was `:`, now `|`

Pre-existing delimiter was `:`. Migrated because (a) source_file can
contain literal `:` on Windows paths (`C:\Users\…`) and URL-like
sources (`https://host:8080`), weakening `:` as a separator; (b) `|`
is reserved in Windows filenames so source_file cannot contain it,
making it strictly safer.

DELIMITER CHOICE
- `|` chosen over `:` (Windows / URL source_file edge cases)
- `|` chosen over `\x00` (NUL breaks log greppability + print()
  debug + downstream string handling for marginal extra safety)
- Matches the established precedent in mempalace/diary_ingest.py
  (lines 52, 76, 91, 98 — all already on `|`)

EXEMPT — audited and correct as-is
  Single-input hashes (nothing to delimit):
    - mempalace/miner.py:1432         closet_id (source_file only)
    - mempalace/format_miner.py:559   sentinel_id (source_file only)
    - mempalace/palace.py:433         lock filename (source_file only)
    - mempalace/palace.py:629         palace_key (lock_key_source only)
    - mempalace/diary_ingest.py:158   content_hash (text only)
    - mempalace/hooks_cli.py:329      pidfile digest (joined cmd only)
    - mempalace/sources/context.py:141  record digest (source_file only)

  Already correctly delimited:
    - mempalace/hallways.py:157     `f"{wing}::{a}::{b}"`  (`::`)
    - mempalace/palace_graph.py:454 `f"{a}↔{b}"`           (`↔`)
    - mempalace/diary_ingest.py:52,76,91,98                (`|` precedent)

  Protected by composition (uniqueness guaranteed by the ID prefix,
  not by the hash slice):
    - mempalace/mcp_server.py:1635  entry_id is
      `diary_{wing}_{strftime('%Y%m%d_%H%M%S%f')}_{sha[:12]}`.
      Microsecond-resolution timestamp prefix supplies uniqueness;
      the trailing hash is a content-discriminator, not the
      write-time uniqueness guarantor.

NEW MODULES
- mempalace/ids.py — single source of truth for ID construction.
  Five helpers (make_drawer_id_from_chunk, make_drawer_id_from_content,
  make_convo_drawer_id, make_convo_sentinel_id, make_triple_id) plus
  an ID_RECIPE = "v2" constant.
- mempalace/collision_scan.py — pre-mining defense. Runs immediately
  before each batched ChromaDB upsert; raises CollisionError naming
  the colliding (source_file, chunk_index) pairs if any proposed
  drawer_id appears more than once with conflicting metadata across
  the union of incoming and existing rows.

DETECTION
ChromaDB's primary-key constraint means past collisions were
silently resolved at upsert time (last-write-wins), erasing
evidence. A post-hoc audit cannot reconstruct what was lost from
palace state alone. Therefore:

- Pre-mining risk scan. Before each batched upsert, compute the
  proposed drawer_ids for the incoming chunk set AND query existing
  drawer_ids from the collection. If any proposed id appears more
  than once in the union (incoming-vs-incoming or incoming-vs-
  existing) with conflicting (source_file, chunk_index), abort the
  mine with an actionable error naming the colliding pairs.
  Collision is caught BEFORE it destroys data, which is the only
  point at which palace state still carries the evidence.

- New metadata key: `"id_recipe": "v2"` on every drawer written
  under the delimited recipe. Audits compare like-for-like;
  drawers without `id_recipe` are treated as v1 legacy (undelimited
  or `:`-delimited), not as collisions.

- Honest disclosure: palaces mined under any pre-v2 mempalace may
  carry silent past collisions whose original content is
  unrecoverable from palace state. Future library tier work will
  give users a per-drawer audit + opt-in archival path.

TESTS
- tests/test_ids.py: 17 unit tests covering all 5 helpers, the
  ID_RECIPE constant, the private `_delimited_sha256` helper, and
  the four defect-class collision shapes (chunk_index boundary,
  content boundary, extract_mode boundary, ISO datetime boundary).
  RED collision tests confirm the v2 recipe breaks the defect class.
- tests/test_collision_scan.py: 10 tests covering clean batches,
  idempotent re-mines, incoming-vs-incoming collisions, incoming-vs-
  existing collisions, error-message quality, empty batches,
  metadata without chunk_index, and ChromaDB backend errors
  propagating cleanly.
- tests/test_convo_miner_unit.py: FakeCol gains a stub `get()` so
  the pre-mining scan can probe an empty in-test collection.

BACKWARDS COMPATIBILITY
- Existing v1 drawers remain queryable; no rewrite of stored IDs.
- New mining passes write v2 drawers alongside v1 via PR #1628's
  additive-mining model.
- No user action required; opt-in cleanup ships separately.

VERIFICATION
- macOS Python 3.12 (local): 2304 passed, 3 skipped, 0 failed,
  coverage 85.44% (above 80% threshold).
- Linux Python 3.9/3.11/3.13 (OrbStack with full `pip install -e
  '.[dev]'` flow): see PR body for per-version pass counts.
- `ruff check .` + `ruff format --check .`: clean.
- pylint score: ids.py 10.00/10, collision_scan.py 10.00/10; all
  modified files >= 8.92 (no regression).
- bandit: clean on all new files; the one MEDIUM finding in
  knowledge_graph.py is on lines 385/407 (pre-existing SQL string
  construction in timeline queries), not in code touched by this PR.
- pre_push_check.sh: ALL CHECKS PASSED.

Refs: deferred from PR #1572 (Copilot finding #13). Styleguide audit
documented above: 8 fix/migration + 11 enumerated exempt sites.
igorls added a commit that referenced this pull request Jun 6, 2026
…limiter

fix(ids): delimit hash inputs to prevent drawer_id collisions (#80)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow external exlude list for mining operations

2 participants