Envelope D3: provenance mining integration — wing_lineage from convo_miner by jpwinans · Pull Request #6 · jpwinans/mempalace

jpwinans · 2026-05-11T22:10:00Z

Summary

Envelope D3 from 2026-05-11 paired build (Task MemPalace#15, Phase 1 final). Wires extract_candidates + qwen3_classifier into mempalace.convo_miner so new diary mining produces wing_lineage drawers in addition to operational wings. After this merges, forward-only provenance preservation is operational — no new diary mining loses biographical/relational lineage. Phase 2 (60k existing-drawer backfill) is a separate scoped task.

Changes

`mempalace/provenance/mining.py` (new module)

mine_chunk_for_provenance(collection, chunk_content, source_file, *, source_session=None, classifier=None, confidence_threshold=0.7, extractor_label=None) -> int — extracts candidates from a chunk, validates each with the classifier (defaults to qwen3_classifier from D2), applies the transitive-attribution rewrite, dedupes by (person, quote, source_file), upserts records ≥ threshold into wing_lineage.
Transitive-attribution rewrite (architect-flagged from D2 calibration case Knowledge graph: auto-resolve conflicting triples, not just detect them MemPalace/mempalace#11): when the classifier returns a speaker name (e.g., "James") for text containing <possessive> <relation>'s (e.g., "his father's saying"), redirect the lineage room to the relation. Without this, "Tonight James reminded me: 'measure twice' — his father's saying" files under room='james' and a future search for father saying misses it.
Dedup: drawer_id derived from sha256(person | quote | source_file). Re-mining same source → existing drawer (chromadb upsert is idempotent on id). Different sources → distinct drawers (intentional — distinct attribution events tracked separately).
Env disable: MEMPALACE_PROVENANCE_DISABLED (truthy: 1/true/yes, case-insensitive) → no-op return 0. For environments without the classifier substrate (CI / fresh checkouts / backfill jobs that handle their own pass).
Failure-soft at every error path: classifier import error / extract exception / validate exception / upsert exception all yield 0 and log at DEBUG. Operational mining is never affected.

`mempalace/convo_miner.py` — hook in `_file_chunks_locked`

After the operational upsert inside the per-chunk loop, call mine_chunk_for_provenance(collection, chunk_content=..., source_file=...).
Run after operational durability is established so a slow classifier call doesn't delay the canonical write.
Failure-soft at three layers: inner call is failure-soft, convo_miner wrapper catches anything that escapes, operational mining proceeds regardless.

Wing_lineage drawer schema (per design doc §D3, implemented)

Drawer content rendered as PROVENANCE: block with Person / Relation / Quote / Context / Source lines. Metadata: wing=wing_lineage, room=<person_slug>, plus person, relation_type, is_quote, confidence, extracted_by, source_file, source_session, filed_at, filed_at_ts.

Tests (14 new in `test_provenance_mining.py`, 62 total mempalace provenance suite)

Happy path: chunk + accepting classifier → 1 wing_lineage drawer with correct meta + design-doc content shape
Threshold gating (below-default rejected; custom threshold lowers floor)
Dedup: same chunk+source twice → 1 drawer; different sources → distinct drawers
Disabled mode: all truthy env-var variants yield 0 drawers
No-candidates returns 0; operational mining unaffected
Failure-soft: classifier raising → 0 drawers, no crash
Transitive-attribution rewrite (case Knowledge graph: auto-resolve conflicting triples, not just detect them MemPalace/mempalace#11): classifier surfaces speaker name → _rewrite_speaker_to_source redirects to relation when possessive-source pattern present in candidate or context
Unit tests on _rewrite_speaker_to_source directly (positive, negative, None-input)
End-to-end convo_miner integration: _file_chunks_locked with a chunk produces BOTH operational drawer (wing=wing_test) AND wing_lineage drawer (wing=wing_lineage)

62/62 pass in <100ms (no live substrate required — tests inject mock classifiers).

Phase 1 status after merge

Envelope	PR	Status
D1 — heuristic + classifier interface	#4	MERGED
D2 — Qwen3 classifier + Pass-3 + calibration (precision 1.0, recall 1.0)	#5	MERGED
D3 — mining integration (this PR)	#6 (this)	pending

After merge: any session mined through mempalace.convo_miner automatically produces wing_lineage drawers for person-attributions in its chunks. mempalace_search wing=wing_lineage room=father returns father attributions across all mined sessions.

Discipline

Branch base: jpwinans/mempalace main (9349760, post-PR Envelope D2: provenance classifier (Qwen3 substrate) + Pass-3 regex #5 merge).
Fresh worktree ~/mempalace-worktrees/d3.
PR targets jpwinans/mempalace (used -R flag).
Tests cover behavior without requiring live substrate (mock classifiers throughout); calibration test from D2 still passes against live substrate.
Cross-coder LGTM requested via hearing channel.

…iner ENVELOPE D3 from 2026-05-11 paired build (Task MemPalace#15, Phase 1 final). Wires extract_candidates + qwen3_classifier into mempalace.convo_miner so new diary mining produces wing_lineage drawers in addition to the operational wing. Phase 1 of Task MemPalace#15 closes with this PR — Phase 2 (60k existing-drawer backfill) is its own scoping task. Changes: - New mempalace/provenance/mining.py with mine_chunk_for_provenance: take a chunk, run extract_candidates -> validate with classifier (default: qwen3_classifier from D2) -> rewrite transitive attributions -> dedupe -> upsert into wing_lineage. - Transitive-attribution rewrite (architect-flagged from D2 calibration case MemPalace#11): when classifier returns speaker name (e.g., "James") for text containing "<possessive> <relation>'s" (e.g., "his father's saying"), redirect to room=<relation> (e.g., "father"). Without rewrite, "Tonight James reminded me: 'measure twice' — his father's saying" files under room='james' and a future search for "father saying" misses it. - Dedup by (person, quote, source_file) hash baked into the drawer_id. Re-mining same source -> existing drawer; same attribution in different source files -> distinct drawers (intentional — distinct attribution events tracked separately). - MEMPALACE_PROVENANCE_DISABLED env var (truthy: 1/true/yes, case-insensitive) makes mine_chunk_for_provenance a no-op. For environments where the classifier substrate is unavailable, CI, fresh checkouts, or backfill jobs that handle their own pass. - convo_miner._file_chunks_locked: after the operational upsert inside the per-chunk loop, call mine_chunk_for_provenance. Run AFTER operational durability is established so a slow classifier call doesn't delay the canonical write. Failure-soft at three layers: the inner call is itself failure-soft, the convo_miner wrapper catches anything that escapes, operational mining proceeds regardless. - DEFAULT_CONFIDENCE_THRESHOLD = 0.7 per design doc §D1. D2 calibration showed positives at 0.90-0.95 and negatives at 0.00 — 0.7 sits cleanly in the gap. Tunable via kwarg. Schema (per Provenance-Preservation-Design §D3): Drawer content rendered as YAML-ish PROVENANCE: block with Person / Relation / Quote / Context / Source lines. Metadata includes wing=wing_lineage, room=<person_slug>, person, relation_type, is_quote, confidence, extracted_by, source_file, source_session, filed_at, filed_at_ts. Tests (14 new in test_provenance_mining.py; 62 total mempalace provenance tests): - Happy path: chunk + accepting classifier -> 1 wing_lineage drawer with correct meta + design-doc content shape. - Threshold: below-default-threshold rejected; custom threshold lets lower-confidence through. - Dedup: same chunk+source twice -> 1 drawer; different sources -> distinct drawers. - Disabled mode: MEMPALACE_PROVENANCE_DISABLED with 1/true/yes variants all yield 0 drawers. - No-candidates returns 0; operational mining unaffected. - Failure-soft: classifier raising -> 0 drawers, no crash. - Transitive-attribution rewrite (case MemPalace#11): classifier surfaces speaker name, _rewrite_speaker_to_source redirects to relation when "<possessive> <relation>'s" appears in candidate or context. - Unit tests on _rewrite_speaker_to_source directly (positive, negative, None-input cases). - End-to-end convo_miner integration: _file_chunks_locked with a chunk produces BOTH operational drawer (wing=wing_test) AND wing_lineage drawer (wing=wing_lineage). 62/62 pass in <100ms (no live substrate required — tests inject mock classifiers). Phase 1 status after this merges: - D1 (PR #4): heuristic + classifier interface — MERGED - D2 (PR #5): qwen3_classifier + Pass-3 + calibration — MERGED - D3 (this PR): mining integration — pending After merge: forward-only provenance preservation is operational. No new diary mining loses biographical/relational lineage. Phase 2 (60k existing-drawer backfill) is a separate scoped task.

jpwinans merged commit c645d00 into main May 11, 2026
0 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Envelope D3: provenance mining integration — wing_lineage from convo_miner#6

Envelope D3: provenance mining integration — wing_lineage from convo_miner#6
jpwinans merged 1 commit into
mainfrom
feat/mempalace-mining-provenance

jpwinans commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jpwinans commented May 11, 2026

Summary

Changes

mempalace/provenance/mining.py (new module)

mempalace/convo_miner.py — hook in _file_chunks_locked

Wing_lineage drawer schema (per design doc §D3, implemented)

Tests (14 new in test_provenance_mining.py, 62 total mempalace provenance suite)

Phase 1 status after merge

Discipline

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`mempalace/provenance/mining.py` (new module)

`mempalace/convo_miner.py` — hook in `_file_chunks_locked`

Tests (14 new in `test_provenance_mining.py`, 62 total mempalace provenance suite)