Skip to content

feat: add mempalace prune to detect and remove stale drawers#522

Open
vanachterjacob wants to merge 1 commit into
MemPalace:developfrom
vanachterjacob:feat/prune-stale-drawers
Open

feat: add mempalace prune to detect and remove stale drawers#522
vanachterjacob wants to merge 1 commit into
MemPalace:developfrom
vanachterjacob:feat/prune-stale-drawers

Conversation

@vanachterjacob

Copy link
Copy Markdown

Summary

Stale drawers accumulate when source files are deleted or modified after mining. This causes outdated content to surface in mempalace_search results, which can inject contradictory information into agent context — a memory correctness risk, not just a maintenance inconvenience.

This PR adds:

  • mempalace/pruner.py — core prune logic with three detection strategies:
    • existence: finds drawers whose source file no longer exists on disk
    • mtime: finds drawers whose source file has been modified since mining
    • orphans: finds leftover chunks when a file shrinks after re-mining
  • CLI command: mempalace prune --strategy <all|existence|mtime|orphans> --wing <w> --dry-run
  • MCP tool: mempalace_prune — callable from Claude Code / Cursor
  • Clean-before-remine: patches miner.py to delete all old drawers for a source file before re-mining, preventing orphaned chunks from accumulating in the first place

Addresses

Usage

# Preview stale drawers without deleting
mempalace prune --dry-run

# Delete drawers from deleted source files only
mempalace prune --strategy existence

# Full cleanup: deleted files + modified files + orphaned chunks
mempalace prune --strategy all

# Scope to one wing
mempalace prune --wing my_project --dry-run

Via MCP:

mempalace_prune(strategy="all", dry_run=true)
mempalace_prune(strategy="existence", dry_run=false)

Test plan

  • Verified existence strategy detects drawers with missing source files
  • Verified mtime strategy detects drawers with outdated modification times
  • Verified --wing filter scopes detection correctly
  • Verified --dry-run reports without deleting
  • Verified actual deletion removes only stale drawers, keeps fresh ones
  • Verified CLI output formatting
  • Clean-before-remine prevents orphaned chunks on modified files

🤖 Generated with Claude Code

Stale drawers accumulate when source files are deleted or modified after
mining. This causes outdated content to surface in search results, which
can inject contradictory information into agent context.

Adds three detection strategies:
- existence: finds drawers whose source file no longer exists on disk
- mtime: finds drawers whose source file has been modified since mining
- orphans: finds leftover chunks when a file shrinks after re-mining

Also patches the project miner to clean old drawers before re-mining a
modified file (clean-before-remine), preventing orphaned chunks from
accumulating in the first place.

Available as CLI (`mempalace prune --dry-run`) and MCP tool
(`mempalace_prune`).

Addresses MemPalace#224, MemPalace#420, MemPalace#331

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@web3guru888

Copy link
Copy Markdown

Stale drawer cleanup has been on my wishlist since before soft-archive (#336) landed — glad to see a first-class prune command. A few notes from running on a corpus that gets re-mined regularly:

The clean-before-remine patch to miner.py is the most important part of this PR. Orphaned chunks from shrinking files are a subtle correctness problem — they don't show up in status outputs, they just silently contribute irrelevant hits to search results. The delete-before-insert approach is the right fix.

One concern: the mtime strategy and float precision. PR #518 (merged?) just fixed the == float comparison in file_already_mined() — the same epsilon-comparison fix should apply here if you're comparing stored mtime metadata to os.path.getmtime(). The epsilon from #475/#518 (abs(diff) < 0.01) is the right baseline.

orphans detection and chunk numbering: the orphan detection relies on chunk index continuity (finding chunk N where N-1 doesn't exist). This works but it's brittle if chunk IDs aren't generated sequentially or if a re-mine leaves gaps for another reason. Would be worth documenting the invariant this relies on, or adding a per-file chunk count stored in metadata that the orphan check can use directly.

Interaction with soft-archive (#336) and Synapse consolidation candidates (#451): a drawers that's been soft-archived probably shouldn't be flagged as stale by the existence check — it was intentionally removed from active retrieval. The prune logic should check drawer status metadata (status: "archived") and skip archived drawers from the existence strategy (mtime/orphan checks are still valid). Otherwise prune will eagerly delete what archive intentionally demoted.

Dry-run first: the --dry-run flag is essential — +1 for including it. I'd also add a --min-age-days filter so prune doesn't flag drawers that were just mined (e.g., give new drawers a 24h grace period before they're eligible for existence pruning). Helps avoid race conditions during active mining sessions.

Good foundation — once the archive/status interaction is handled, this fills a real gap.


MemPalace-AGI dashboard

@StefanKremen

Copy link
Copy Markdown

Heads up: this PR's miner.py clean-before-remine change also happens to fix the hnswlib updatePoint / repairConnectionsForUpdate race documented in #521 (PR #523).

Mechanism: modified-file re-mines previously upsert'd over existing deterministic drawer IDs, pushing ChromaDB through hnswlib's thread-unsafe updatePointrepairConnectionsForUpdate path, which reliably segfaults the mining subprocess on macOS ARM64 / Python 3.13 / chromadb 0.6.3. The unambiguous fingerprint is repairConnectionsForUpdate in the crashed thread's stack — that function is only called on the update path.

Deleting by source_file before the re-insert loop (what both PRs do) converts re-mines into pure INSERT operations, bypassing the update path entirely. Different motivation (stale-drawer correctness here vs. segfault in #521), same effective fix on the miner hunk.

Coordination:

Urgency note: PR #518 (still open) fixes the float-mtime epsilon in file_already_mined(). Once that merges, modified-file re-mines will fire more reliably — which fires the hnswlib race more reliably too. Worth landing either this PR or #523 before #518.

Regression test worth adding (not currently in either PR's test plan): mine a file → os.utime() → re-mine → assert no crash and new chunks replace the old ones with the updated mtime. That's the scenario where the race fires.

@web3guru888

Copy link
Copy Markdown

@StefanKremen — this is exactly the kind of cross-PR analysis that saves reviewers time. The mechanism is right: pre-delete converts re-mines from update operations to pure inserts, which never touches hnswlib's updatePoint path. Same fix, different motivations.

The merge-order recommendation is important and I want to amplify it: land this PR or #523 before #518. The #518 float-mtime epsilon fix makes file_already_mined() return False more accurately for modified files — which is correct behavior, but it increases the frequency of re-mines on files that previously slipped through, which fires the hnswlib race more reliably on affected platforms (macOS ARM64 / Python 3.13 / chromadb 0.6.3).

The regression test scenario you described is exactly right:

mine file → os.utime(file, now+1) → re-mine → assert no crash + new chunks replace old

That test would have caught this race in CI. Worth adding to whichever PR lands first.

This was referenced Apr 10, 2026
@nanoscopic

Copy link
Copy Markdown

Isn't pruning stale data against the stated purpose of the project "to retain everything"?

Instead of pruning, perhaps archiving instead, so that information could theoretically be brought back if desired later?

@web3guru888

Copy link
Copy Markdown

@nanoscopic — this is a fair tension to flag, and it's worth distinguishing between the two use cases:

Prune targets genuinely stale metadata, not chosen memories. When a source file is deleted from disk, the drawers it generated have no recoverable original — the text is already in ChromaDB but the file it came from no longer exists. Pruning those is less like "discarding a memory" and more like "removing a dangling reference." The content isn't gone from the palace yet — it's the source link that's broken.

The dry-run default is key. This PR defaults to preview mode. Nothing is deleted without explicit --confirm. So the danger of accidental loss is low.

The soft-archive approach you're describing is real though. There's ongoing discussion in #336 about status: archived as a first-class metadata value for "retired but retained" content. A prune command could soft-archive by default (set status=archived) rather than hard-delete, and let users hard-delete separately. That would satisfy both retention for recall and stale content not surfacing in search — archived drawers could be excluded from default search unless include_archived=True.

That said, for the existence strategy (source file no longer on disk at all) — I'd argue hard delete is appropriate since re-mine is impossible and the content would just accumulate forever.

@wafuzio

wafuzio commented Apr 10, 2026

Copy link
Copy Markdown

watching this one! thanks

@bensig bensig changed the base branch from main to develop April 11, 2026 22:21
@igorls igorls added area/cli CLI commands area/mcp MCP server and tools area/mining File and conversation mining enhancement New feature or request labels Apr 14, 2026
@igorls

igorls commented May 8, 2026

Copy link
Copy Markdown
Member

Hi, thanks for the contribution.

This PR has merge conflicts with develop, and the branch has not been updated in over 7 days, which puts it before our most recent release. The conflicts are likely against work that landed in that release.

Could you rebase onto develop so we can take another look?

If this change is no longer relevant, feel free to close the PR.

(This message is part of a periodic backlog pass, sent to all open PRs that match this state.)

@igorls igorls added the needs-rebase PR has merge conflicts with develop and needs rebase label May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli CLI commands area/mcp MCP server and tools area/mining File and conversation mining enhancement New feature or request needs-rebase PR has merge conflicts with develop and needs rebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants