Envelope D3: provenance mining integration — wing_lineage from convo_miner#6
Merged
Merged
Conversation
…iner ENVELOPE D3 from 2026-05-11 paired build (Task MemPalace#15, Phase 1 final). Wires extract_candidates + qwen3_classifier into mempalace.convo_miner so new diary mining produces wing_lineage drawers in addition to the operational wing. Phase 1 of Task MemPalace#15 closes with this PR — Phase 2 (60k existing-drawer backfill) is its own scoping task. Changes: - New mempalace/provenance/mining.py with mine_chunk_for_provenance: take a chunk, run extract_candidates -> validate with classifier (default: qwen3_classifier from D2) -> rewrite transitive attributions -> dedupe -> upsert into wing_lineage. - Transitive-attribution rewrite (architect-flagged from D2 calibration case MemPalace#11): when classifier returns speaker name (e.g., "James") for text containing "<possessive> <relation>'s" (e.g., "his father's saying"), redirect to room=<relation> (e.g., "father"). Without rewrite, "Tonight James reminded me: 'measure twice' — his father's saying" files under room='james' and a future search for "father saying" misses it. - Dedup by (person, quote, source_file) hash baked into the drawer_id. Re-mining same source -> existing drawer; same attribution in different source files -> distinct drawers (intentional — distinct attribution events tracked separately). - MEMPALACE_PROVENANCE_DISABLED env var (truthy: 1/true/yes, case-insensitive) makes mine_chunk_for_provenance a no-op. For environments where the classifier substrate is unavailable, CI, fresh checkouts, or backfill jobs that handle their own pass. - convo_miner._file_chunks_locked: after the operational upsert inside the per-chunk loop, call mine_chunk_for_provenance. Run AFTER operational durability is established so a slow classifier call doesn't delay the canonical write. Failure-soft at three layers: the inner call is itself failure-soft, the convo_miner wrapper catches anything that escapes, operational mining proceeds regardless. - DEFAULT_CONFIDENCE_THRESHOLD = 0.7 per design doc §D1. D2 calibration showed positives at 0.90-0.95 and negatives at 0.00 — 0.7 sits cleanly in the gap. Tunable via kwarg. Schema (per Provenance-Preservation-Design §D3): Drawer content rendered as YAML-ish PROVENANCE: block with Person / Relation / Quote / Context / Source lines. Metadata includes wing=wing_lineage, room=<person_slug>, person, relation_type, is_quote, confidence, extracted_by, source_file, source_session, filed_at, filed_at_ts. Tests (14 new in test_provenance_mining.py; 62 total mempalace provenance tests): - Happy path: chunk + accepting classifier -> 1 wing_lineage drawer with correct meta + design-doc content shape. - Threshold: below-default-threshold rejected; custom threshold lets lower-confidence through. - Dedup: same chunk+source twice -> 1 drawer; different sources -> distinct drawers. - Disabled mode: MEMPALACE_PROVENANCE_DISABLED with 1/true/yes variants all yield 0 drawers. - No-candidates returns 0; operational mining unaffected. - Failure-soft: classifier raising -> 0 drawers, no crash. - Transitive-attribution rewrite (case MemPalace#11): classifier surfaces speaker name, _rewrite_speaker_to_source redirects to relation when "<possessive> <relation>'s" appears in candidate or context. - Unit tests on _rewrite_speaker_to_source directly (positive, negative, None-input cases). - End-to-end convo_miner integration: _file_chunks_locked with a chunk produces BOTH operational drawer (wing=wing_test) AND wing_lineage drawer (wing=wing_lineage). 62/62 pass in <100ms (no live substrate required — tests inject mock classifiers). Phase 1 status after this merges: - D1 (PR #4): heuristic + classifier interface — MERGED - D2 (PR #5): qwen3_classifier + Pass-3 + calibration — MERGED - D3 (this PR): mining integration — pending After merge: forward-only provenance preservation is operational. No new diary mining loses biographical/relational lineage. Phase 2 (60k existing-drawer backfill) is a separate scoped task.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Envelope D3 from 2026-05-11 paired build (Task MemPalace#15, Phase 1 final). Wires extract_candidates + qwen3_classifier into
mempalace.convo_minerso new diary mining produceswing_lineagedrawers in addition to operational wings. After this merges, forward-only provenance preservation is operational — no new diary mining loses biographical/relational lineage. Phase 2 (60k existing-drawer backfill) is a separate scoped task.Changes
mempalace/provenance/mining.py(new module)mine_chunk_for_provenance(collection, chunk_content, source_file, *, source_session=None, classifier=None, confidence_threshold=0.7, extractor_label=None) -> int— extracts candidates from a chunk, validates each with the classifier (defaults toqwen3_classifierfrom D2), applies the transitive-attribution rewrite, dedupes by(person, quote, source_file), upserts records ≥ threshold intowing_lineage.<possessive> <relation>'s(e.g., "his father's saying"), redirect the lineageroomto the relation. Without this, "Tonight James reminded me: 'measure twice' — his father's saying" files underroom='james'and a future search forfather sayingmisses it.sha256(person | quote | source_file). Re-mining same source → existing drawer (chromadb upsert is idempotent on id). Different sources → distinct drawers (intentional — distinct attribution events tracked separately).MEMPALACE_PROVENANCE_DISABLED(truthy:1/true/yes, case-insensitive) → no-op return 0. For environments without the classifier substrate (CI / fresh checkouts / backfill jobs that handle their own pass).mempalace/convo_miner.py— hook in_file_chunks_lockedmine_chunk_for_provenance(collection, chunk_content=..., source_file=...).Wing_lineage drawer schema (per design doc §D3, implemented)
Drawer content rendered as
PROVENANCE:block withPerson / Relation / Quote / Context / Sourcelines. Metadata:wing=wing_lineage,room=<person_slug>, plusperson,relation_type,is_quote,confidence,extracted_by,source_file,source_session,filed_at,filed_at_ts.Tests (14 new in
test_provenance_mining.py, 62 total mempalace provenance suite)_rewrite_speaker_to_sourceredirects to relation when possessive-source pattern present in candidate or context_rewrite_speaker_to_sourcedirectly (positive, negative, None-input)_file_chunks_lockedwith a chunk produces BOTH operational drawer (wing=wing_test) AND wing_lineage drawer (wing=wing_lineage)62/62 pass in <100ms (no live substrate required — tests inject mock classifiers).
Phase 1 status after merge
After merge: any session mined through
mempalace.convo_minerautomatically produceswing_lineagedrawers for person-attributions in its chunks.mempalace_search wing=wing_lineage room=fatherreturns father attributions across all mined sessions.Discipline
~/mempalace-worktrees/d3.jpwinans/mempalace(used-Rflag).