Feat: respect .gitignore files during mining#80
Closed
sha2fiddy wants to merge 1 commit into
Closed
Conversation
Add pathspec-based .gitignore support to scan_project(). Gitignore files are loaded as a stack during os.walk, so nested .gitignore rules in subdirectories are honored additively. SKIP_DIRS remains as a fast baseline filter applied first. - Add pathspec>=0.11.0 dependency - Add _load_gitignore() and _is_gitignored() helpers - Add respect_gitignore param to scan_project() and mine() - Add --no-gitignore CLI flag to opt out - Expand SKIP_DIRS with common cache/build directories - Add 6 new scan_project() tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 tasks
GoodOlClint
added a commit
to GoodOlClint/mempalace
that referenced
this pull request
Apr 7, 2026
Adds pathspec-based .gitignore support to scan_project(). Nested .gitignore files are honored additively. Expands SKIP_DIRS with common cache/build directories. Use --no-gitignore to override. Upstream: MemPalace#80 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
|
Thanks for this — we merged #78 which covers the same gitignore support. Closing as duplicate. |
sha2fiddy
pushed a commit
to sha2fiddy/mempalace
that referenced
this pull request
May 23, 2026
Closet pointer lines gain a 4th pipe-separated segment of shape
``YYYY-MM-DD:Lstart-Lend`` so retrieval can jump straight to the right
span in the source file instead of opening the whole drawer. The date
is selected via a content-aware hierarchy so legacy ingests (where
``filed_at`` is misleading) still produce honest pointers when the
content itself carries a date.
The legacy 3-segment format remains the fallback for drawers without
the new metadata, so existing palaces keep working without a re-mine.
## Pointer shape
Before:
topic|entities|→drawer_ids
After (when drawer metadata carries line range + a date):
topic|entities|2024-11-08:L42-L78|→drawer_ids
Backward compat: when ``drawer_metas`` is not passed, or the first
meta lacks ``line_start`` / ``line_end``, ``build_closet_lines`` emits
the legacy 3-segment form unchanged.
## Date-source hierarchy (NEW)
A new helper ``mempalace.miner._extract_content_date(source_file, content)``
selects a date once per file and stores it as ``content_date`` on every
drawer mined from that file. Priority order, first match wins:
1. **Filename** — ISO regex on stem, then dateutil fuzzy parse for
natural-language formats. Handles ``2024-11-08.md``,
``April-6th-2011-notes.md``, ``Nov-8-2024.md``, etc.
2. **YAML frontmatter** — looks at ``date`` / ``created`` /
``published`` fields. yaml.safe_load already parses ISO dates as
``datetime.date`` objects; dateutil handles string values.
3. **Content body, first ~10 lines** —
a. ISO regex (``YYYY-MM-DD`` / ``YYYY/MM/DD`` / ``YYYY.MM.DD``)
b. Slash dates with locale auto-disambiguation: if any
first-number > 12 appears anywhere in the head, lock the
file's locale to DD/MM (the only consistent reading);
otherwise default to US MM/DD for two-digit-year compactness.
c. dateutil fuzzy parse for natural-language
(``November 8, 2024``, ``8th November 2024``, etc.)
4. **Filesystem mtime** — ``os.path.getmtime`` formatted as
``YYYY-MM-DD``.
5. **None** — caller (``_build_drawer_metadata``) leaves
``content_date`` absent; ``build_closet_lines`` falls back to the
``filed_at`` ingestion timestamp's date portion.
``_build_date_line_segment`` in ``palace.py`` now prefers ``content_date``
when present, otherwise uses ``filed_at[:10]``. This is the load-bearing
change that makes Tier 6a meaningful for legacy content — a Nov 8 2024
transcript mined today now emits ``2024-11-08:L42-L78`` instead of
``2026-05-22:L42-L78``.
## Changes
1. ``mempalace/miner.py::chunk_text`` — each chunk dict now also carries
``line_start`` / ``line_end`` (1-indexed line numbers in the stripped
source). Computed via ``content[:offset].count("\n") + 1`` —
approximate locator accurate to ±1 line at chunk boundaries.
2. ``mempalace/miner.py::_extract_content_date`` (NEW) + helpers
(``_try_filename_date``, ``_try_frontmatter_date``,
``_try_content_body_date``, ``_try_mtime_date``). Pure functions, no
I/O beyond what's named. Uses lazy imports of ``dateutil`` and
``yaml`` to keep top-level import surface unchanged.
3. ``mempalace/miner.py::_build_drawer_metadata`` — gains
``line_start`` / ``line_end`` / ``content_date`` optional kwargs.
When provided, stored in drawer metadata; when omitted, keys are
absent from the dict — backward compatible.
4. ``mempalace/miner.py`` batched-upsert path — calls
``_extract_content_date(source_file, content)`` ONCE per file (not
once per chunk) and passes the result through to every chunk's
``_build_drawer_metadata`` call.
5. ``mempalace/format_miner.py::_file_chunks_locked`` — gains a
``content`` param so the format-miner path can also run
``_extract_content_date`` on the full extracted text. Plumbs
``content_date`` into every batched drawer's metadata. Caller passes
``text`` (the full extracted markdown).
6. ``mempalace/palace.py::build_closet_lines`` — accepts optional
``drawer_metas`` (parallel to ``drawer_ids``). When present and the
first meta has both ``line_start`` and ``line_end``, the new
segment is spliced in; otherwise legacy 3-segment form. Helper
``_build_date_line_segment`` prefers ``content_date`` over
``filed_at`` for the date portion.
## Tests added (RED-first, then GREEN)
tests/test_miner.py::TestChunkTextLineRanges (4 tests)
tests/test_miner.py::TestBuildDrawerMetadataLineRange (2 tests)
tests/test_miner.py::TestExtractContentDate (15 tests)
- filename ISO / natural-language / compact
- YAML frontmatter date / created
- Claude session preamble
- ISO + natural-language content body
- Slash-date locale disambiguation (DD/MM lock vs default MM/DD)
- Priority order (filename > frontmatter > content > mtime)
- mtime fallback
- Graceful None returns for missing file / empty content
tests/test_closets.py::TestBuildClosetLines (5 new tests)
- includes_date_line_segment_when_metas_provided
- falls_back_to_3_segment_format_when_metas_missing (compat)
- falls_back_to_3_segment_format_when_metas_lack_line_keys (compat)
- date_segment_uses_filed_at_date_portion_only
- content_date_preferred_over_filed_at (new — Tier 6a hierarchy)
26 new tests total. All RED before this commit, all GREEN after.
## Verification
pytest tests/test_miner.py::TestChunkTextLineRanges
tests/test_miner.py::TestBuildDrawerMetadataLineRange
tests/test_miner.py::TestExtractContentDate
tests/test_closets.py::TestBuildClosetLines
→ 32 passed
pytest tests/test_miner.py tests/test_closets.py
tests/test_format_miner.py tests/test_palace.py
→ 230 passed, 2 skipped, 0 regressions in directly-affected files
ruff check + ruff format --check
→ All checks passed, 5 files already formatted
End-to-end smoke test on a YAML-frontmatter file (date: 2024-11-08,
mined "today"):
content_date extracted: 2024-11-08
chunks produced: 1
drawer meta: content_date=2024-11-08, filed_at='2026-05-22...', lines=1-15
closet pointer:
"conversation with lumi about brands of dog food|Filler;Lumi;Aya|2024-11-08:L1-L15|→drawer_a,drawer_b"
→ content_date correctly preferred over filed_at end-to-end.
OrbStack triple-Linux verify (Py 3.9 / 3.11 / 3.13)
→ 32 targeted Tier 6a + content-date tests pass on each.
## Out of scope (deliberate)
- No schema migration of existing drawers. Pre-Tier-6a drawers lack
``line_start`` / ``content_date`` and silently fall back to the
legacy 3-segment pointer shape.
- No re-mine required. Users keep existing palaces; new files mined
after this lands carry the new fields.
- No drawer ID change (Task MemPalace#80, separate concern).
- No exact-quote line positioning within a chunk (future tier).
- No NLP-based date extraction (only regex + dateutil fuzzy).
- Linguistic-similarity / concept-recognition retrieval (L5 zettal
layer) is the natural next step but lives in its own PR family.
milla-jovovich
added a commit
that referenced
this pull request
May 30, 2026
Drawer/closet/triple IDs built by concatenating strings without a
delimiter before hashing can collide: hash((s1 + str(i1))) ==
hash((s2 + str(i2))) whenever s1+str(i1) == s2+str(i2). Concretely,
source_file="foo" + chunk_index=12 produces the same hash input as
source_file="foo1" + chunk_index=2 — both yield "foo12". Under
ChromaDB's primary-key constraint the second write silently
overwrites the first; one chunk's content is lost without surfacing
an error (the surrounding broad `except Exception:` blocks swallow
any backend complaint).
The defect class is "string + string concat fed to a hash without
a delimiter." Per the styleguide's partial-scope-key-migration rule,
every site sharing the pattern is a candidate and must be triaged —
not just the ones the original issue named.
FIX — 6 sites
- mempalace/miner.py:1253 drawer_id, add_drawer() back-compat
- mempalace/miner.py:1386 drawer_id, batched mine loop
- mempalace/miner.py:1416 drawer_ids, closet-build re-derivation
- mempalace/format_miner.py:643 drawer_id, format_miner batched loop
- mempalace/mcp_server.py:1136 drawer_id, tool_add_drawer
- mempalace/knowledge_graph.py:305 triple_id, KG triple insertion
MIGRATED FOR CONSISTENCY — 2 sites
- mempalace/convo_miner.py:87 sentinel_key — was `:`, now `|`
- mempalace/convo_miner.py:422 drawer_key — was `:`, now `|`
Pre-existing delimiter was `:`. Migrated because (a) source_file can
contain literal `:` on Windows paths (`C:\Users\…`) and URL-like
sources (`https://host:8080`), weakening `:` as a separator; (b) `|`
is reserved in Windows filenames so source_file cannot contain it,
making it strictly safer.
DELIMITER CHOICE
- `|` chosen over `:` (Windows / URL source_file edge cases)
- `|` chosen over `\x00` (NUL breaks log greppability + print()
debug + downstream string handling for marginal extra safety)
- Matches the established precedent in mempalace/diary_ingest.py
(lines 52, 76, 91, 98 — all already on `|`)
EXEMPT — audited and correct as-is
Single-input hashes (nothing to delimit):
- mempalace/miner.py:1432 closet_id (source_file only)
- mempalace/format_miner.py:559 sentinel_id (source_file only)
- mempalace/palace.py:433 lock filename (source_file only)
- mempalace/palace.py:629 palace_key (lock_key_source only)
- mempalace/diary_ingest.py:158 content_hash (text only)
- mempalace/hooks_cli.py:329 pidfile digest (joined cmd only)
- mempalace/sources/context.py:141 record digest (source_file only)
Already correctly delimited:
- mempalace/hallways.py:157 `f"{wing}::{a}::{b}"` (`::`)
- mempalace/palace_graph.py:454 `f"{a}↔{b}"` (`↔`)
- mempalace/diary_ingest.py:52,76,91,98 (`|` precedent)
Protected by composition (uniqueness guaranteed by the ID prefix,
not by the hash slice):
- mempalace/mcp_server.py:1635 entry_id is
`diary_{wing}_{strftime('%Y%m%d_%H%M%S%f')}_{sha[:12]}`.
Microsecond-resolution timestamp prefix supplies uniqueness;
the trailing hash is a content-discriminator, not the
write-time uniqueness guarantor.
NEW MODULES
- mempalace/ids.py — single source of truth for ID construction.
Five helpers (make_drawer_id_from_chunk, make_drawer_id_from_content,
make_convo_drawer_id, make_convo_sentinel_id, make_triple_id) plus
an ID_RECIPE = "v2" constant.
- mempalace/collision_scan.py — pre-mining defense. Runs immediately
before each batched ChromaDB upsert; raises CollisionError naming
the colliding (source_file, chunk_index) pairs if any proposed
drawer_id appears more than once with conflicting metadata across
the union of incoming and existing rows.
DETECTION
ChromaDB's primary-key constraint means past collisions were
silently resolved at upsert time (last-write-wins), erasing
evidence. A post-hoc audit cannot reconstruct what was lost from
palace state alone. Therefore:
- Pre-mining risk scan. Before each batched upsert, compute the
proposed drawer_ids for the incoming chunk set AND query existing
drawer_ids from the collection. If any proposed id appears more
than once in the union (incoming-vs-incoming or incoming-vs-
existing) with conflicting (source_file, chunk_index), abort the
mine with an actionable error naming the colliding pairs.
Collision is caught BEFORE it destroys data, which is the only
point at which palace state still carries the evidence.
- New metadata key: `"id_recipe": "v2"` on every drawer written
under the delimited recipe. Audits compare like-for-like;
drawers without `id_recipe` are treated as v1 legacy (undelimited
or `:`-delimited), not as collisions.
- Honest disclosure: palaces mined under any pre-v2 mempalace may
carry silent past collisions whose original content is
unrecoverable from palace state. Future library tier work will
give users a per-drawer audit + opt-in archival path.
TESTS
- tests/test_ids.py: 17 unit tests covering all 5 helpers, the
ID_RECIPE constant, the private `_delimited_sha256` helper, and
the four defect-class collision shapes (chunk_index boundary,
content boundary, extract_mode boundary, ISO datetime boundary).
RED collision tests confirm the v2 recipe breaks the defect class.
- tests/test_collision_scan.py: 10 tests covering clean batches,
idempotent re-mines, incoming-vs-incoming collisions, incoming-vs-
existing collisions, error-message quality, empty batches,
metadata without chunk_index, and ChromaDB backend errors
propagating cleanly.
- tests/test_convo_miner_unit.py: FakeCol gains a stub `get()` so
the pre-mining scan can probe an empty in-test collection.
BACKWARDS COMPATIBILITY
- Existing v1 drawers remain queryable; no rewrite of stored IDs.
- New mining passes write v2 drawers alongside v1 via PR #1628's
additive-mining model.
- No user action required; opt-in cleanup ships separately.
VERIFICATION
- macOS Python 3.12 (local): 2304 passed, 3 skipped, 0 failed,
coverage 85.44% (above 80% threshold).
- Linux Python 3.9/3.11/3.13 (OrbStack with full `pip install -e
'.[dev]'` flow): see PR body for per-version pass counts.
- `ruff check .` + `ruff format --check .`: clean.
- pylint score: ids.py 10.00/10, collision_scan.py 10.00/10; all
modified files >= 8.92 (no regression).
- bandit: clean on all new files; the one MEDIUM finding in
knowledge_graph.py is on lines 385/407 (pre-existing SQL string
construction in timeline queries), not in code touched by this PR.
- pre_push_check.sh: ALL CHECKS PASSED.
Refs: deferred from PR #1572 (Copilot finding #13). Styleguide audit
documented above: 8 fix/migration + 11 enumerated exempt sites.
igorls
added a commit
that referenced
this pull request
Jun 6, 2026
…limiter fix(ids): delimit hash inputs to prevent drawer_id collisions (#80)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #56.
Adds
.gitignoresupport toscan_project()so that files and directories already ignored by each repo's.gitignoreare automatically excluded during mining. This is especially useful when mining a parent directory containing multiple repos — previously, thousands of build artifacts, caches, and data files would get indexed.pathspeclibrary (pure Python, no transitive deps) for gitignore pattern matching.gitignorefiles are loaded asos.walkdescends, so nested rules in subdirectories are honored additivelySKIP_DIRSremains as a fast baseline filter applied first--no-gitignoreCLI flag to opt outSKIP_DIRSwith common Python/JS/IDE cache directories (.ruff_cache,.mypy_cache,.pytest_cache, etc.)Note on new dependency: CONTRIBUTING.md says "ChromaDB + PyYAML only — don't add new deps without discussion."
pathspecis lightweight (pure Python, zero transitive deps, ~30KB), and is the de facto standard for gitignore pattern matching. A stdlib-only approach would require hand-rolling the full gitignore spec (negation patterns,**globbing, directory-only patterns, rooted patterns), which is non-trivial and error-prone.Before / After (real-world test)
Mining a directory containing 14 git repos:
Test plan
pytest tests/test_miner.py -v).gitignorepatterns exclude files and directories--no-gitignoreflag disables the feature.gitignorefiles stack correctly.gitignorepresent = current behavior unchangedSKIP_DIRSstill apply alongside gitignore🤖 Generated with Claude Code