Skip to content

Envelope D3: provenance mining integration — wing_lineage from convo_miner#6

Merged
jpwinans merged 1 commit into
mainfrom
feat/mempalace-mining-provenance
May 11, 2026
Merged

Envelope D3: provenance mining integration — wing_lineage from convo_miner#6
jpwinans merged 1 commit into
mainfrom
feat/mempalace-mining-provenance

Conversation

@jpwinans

Copy link
Copy Markdown
Owner

Summary

Envelope D3 from 2026-05-11 paired build (Task MemPalace#15, Phase 1 final). Wires extract_candidates + qwen3_classifier into mempalace.convo_miner so new diary mining produces wing_lineage drawers in addition to operational wings. After this merges, forward-only provenance preservation is operational — no new diary mining loses biographical/relational lineage. Phase 2 (60k existing-drawer backfill) is a separate scoped task.

Changes

mempalace/provenance/mining.py (new module)

  • mine_chunk_for_provenance(collection, chunk_content, source_file, *, source_session=None, classifier=None, confidence_threshold=0.7, extractor_label=None) -> int — extracts candidates from a chunk, validates each with the classifier (defaults to qwen3_classifier from D2), applies the transitive-attribution rewrite, dedupes by (person, quote, source_file), upserts records ≥ threshold into wing_lineage.
  • Transitive-attribution rewrite (architect-flagged from D2 calibration case Knowledge graph: auto-resolve conflicting triples, not just detect them MemPalace/mempalace#11): when the classifier returns a speaker name (e.g., "James") for text containing <possessive> <relation>'s (e.g., "his father's saying"), redirect the lineage room to the relation. Without this, "Tonight James reminded me: 'measure twice' — his father's saying" files under room='james' and a future search for father saying misses it.
  • Dedup: drawer_id derived from sha256(person | quote | source_file). Re-mining same source → existing drawer (chromadb upsert is idempotent on id). Different sources → distinct drawers (intentional — distinct attribution events tracked separately).
  • Env disable: MEMPALACE_PROVENANCE_DISABLED (truthy: 1/true/yes, case-insensitive) → no-op return 0. For environments without the classifier substrate (CI / fresh checkouts / backfill jobs that handle their own pass).
  • Failure-soft at every error path: classifier import error / extract exception / validate exception / upsert exception all yield 0 and log at DEBUG. Operational mining is never affected.

mempalace/convo_miner.py — hook in _file_chunks_locked

  • After the operational upsert inside the per-chunk loop, call mine_chunk_for_provenance(collection, chunk_content=..., source_file=...).
  • Run after operational durability is established so a slow classifier call doesn't delay the canonical write.
  • Failure-soft at three layers: inner call is failure-soft, convo_miner wrapper catches anything that escapes, operational mining proceeds regardless.

Wing_lineage drawer schema (per design doc §D3, implemented)

Drawer content rendered as PROVENANCE: block with Person / Relation / Quote / Context / Source lines. Metadata: wing=wing_lineage, room=<person_slug>, plus person, relation_type, is_quote, confidence, extracted_by, source_file, source_session, filed_at, filed_at_ts.

Tests (14 new in test_provenance_mining.py, 62 total mempalace provenance suite)

  • Happy path: chunk + accepting classifier → 1 wing_lineage drawer with correct meta + design-doc content shape
  • Threshold gating (below-default rejected; custom threshold lowers floor)
  • Dedup: same chunk+source twice → 1 drawer; different sources → distinct drawers
  • Disabled mode: all truthy env-var variants yield 0 drawers
  • No-candidates returns 0; operational mining unaffected
  • Failure-soft: classifier raising → 0 drawers, no crash
  • Transitive-attribution rewrite (case Knowledge graph: auto-resolve conflicting triples, not just detect them MemPalace/mempalace#11): classifier surfaces speaker name → _rewrite_speaker_to_source redirects to relation when possessive-source pattern present in candidate or context
  • Unit tests on _rewrite_speaker_to_source directly (positive, negative, None-input)
  • End-to-end convo_miner integration: _file_chunks_locked with a chunk produces BOTH operational drawer (wing=wing_test) AND wing_lineage drawer (wing=wing_lineage)

62/62 pass in <100ms (no live substrate required — tests inject mock classifiers).

Phase 1 status after merge

Envelope PR Status
D1 — heuristic + classifier interface #4 MERGED
D2 — Qwen3 classifier + Pass-3 + calibration (precision 1.0, recall 1.0) #5 MERGED
D3 — mining integration (this PR) #6 (this) pending

After merge: any session mined through mempalace.convo_miner automatically produces wing_lineage drawers for person-attributions in its chunks. mempalace_search wing=wing_lineage room=father returns father attributions across all mined sessions.

Discipline

  • Branch base: jpwinans/mempalace main (9349760, post-PR Envelope D2: provenance classifier (Qwen3 substrate) + Pass-3 regex #5 merge).
  • Fresh worktree ~/mempalace-worktrees/d3.
  • PR targets jpwinans/mempalace (used -R flag).
  • Tests cover behavior without requiring live substrate (mock classifiers throughout); calibration test from D2 still passes against live substrate.
  • Cross-coder LGTM requested via hearing channel.

…iner

ENVELOPE D3 from 2026-05-11 paired build (Task MemPalace#15, Phase 1 final).

Wires extract_candidates + qwen3_classifier into mempalace.convo_miner
so new diary mining produces wing_lineage drawers in addition to the
operational wing. Phase 1 of Task MemPalace#15 closes with this PR — Phase 2
(60k existing-drawer backfill) is its own scoping task.

Changes:

  - New mempalace/provenance/mining.py with mine_chunk_for_provenance:
    take a chunk, run extract_candidates -> validate with classifier
    (default: qwen3_classifier from D2) -> rewrite transitive
    attributions -> dedupe -> upsert into wing_lineage.

  - Transitive-attribution rewrite (architect-flagged from D2
    calibration case MemPalace#11): when classifier returns speaker name
    (e.g., "James") for text containing "<possessive> <relation>'s"
    (e.g., "his father's saying"), redirect to room=<relation>
    (e.g., "father"). Without rewrite, "Tonight James reminded me:
    'measure twice' — his father's saying" files under room='james'
    and a future search for "father saying" misses it.

  - Dedup by (person, quote, source_file) hash baked into the
    drawer_id. Re-mining same source -> existing drawer; same
    attribution in different source files -> distinct drawers
    (intentional — distinct attribution events tracked separately).

  - MEMPALACE_PROVENANCE_DISABLED env var (truthy: 1/true/yes,
    case-insensitive) makes mine_chunk_for_provenance a no-op. For
    environments where the classifier substrate is unavailable, CI,
    fresh checkouts, or backfill jobs that handle their own pass.

  - convo_miner._file_chunks_locked: after the operational upsert
    inside the per-chunk loop, call mine_chunk_for_provenance. Run
    AFTER operational durability is established so a slow classifier
    call doesn't delay the canonical write. Failure-soft at three
    layers: the inner call is itself failure-soft, the convo_miner
    wrapper catches anything that escapes, operational mining
    proceeds regardless.

  - DEFAULT_CONFIDENCE_THRESHOLD = 0.7 per design doc §D1.
    D2 calibration showed positives at 0.90-0.95 and negatives at
    0.00 — 0.7 sits cleanly in the gap. Tunable via kwarg.

Schema (per Provenance-Preservation-Design §D3):
  Drawer content rendered as YAML-ish PROVENANCE: block with
  Person / Relation / Quote / Context / Source lines. Metadata
  includes wing=wing_lineage, room=<person_slug>, person,
  relation_type, is_quote, confidence, extracted_by, source_file,
  source_session, filed_at, filed_at_ts.

Tests (14 new in test_provenance_mining.py; 62 total mempalace
provenance tests):

  - Happy path: chunk + accepting classifier -> 1 wing_lineage
    drawer with correct meta + design-doc content shape.
  - Threshold: below-default-threshold rejected; custom threshold
    lets lower-confidence through.
  - Dedup: same chunk+source twice -> 1 drawer; different sources
    -> distinct drawers.
  - Disabled mode: MEMPALACE_PROVENANCE_DISABLED with 1/true/yes
    variants all yield 0 drawers.
  - No-candidates returns 0; operational mining unaffected.
  - Failure-soft: classifier raising -> 0 drawers, no crash.
  - Transitive-attribution rewrite (case MemPalace#11): classifier surfaces
    speaker name, _rewrite_speaker_to_source redirects to relation
    when "<possessive> <relation>'s" appears in candidate or context.
  - Unit tests on _rewrite_speaker_to_source directly (positive,
    negative, None-input cases).
  - End-to-end convo_miner integration: _file_chunks_locked with a
    chunk produces BOTH operational drawer (wing=wing_test) AND
    wing_lineage drawer (wing=wing_lineage).

62/62 pass in <100ms (no live substrate required — tests inject
mock classifiers).

Phase 1 status after this merges:
  - D1 (PR #4): heuristic + classifier interface — MERGED
  - D2 (PR #5): qwen3_classifier + Pass-3 + calibration — MERGED
  - D3 (this PR): mining integration — pending
  After merge: forward-only provenance preservation is operational.
  No new diary mining loses biographical/relational lineage.
  Phase 2 (60k existing-drawer backfill) is a separate scoped task.
@jpwinans jpwinans merged commit c645d00 into main May 11, 2026
0 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant