Skip to content

Envelope D2: provenance classifier (Qwen3 substrate) + Pass-3 regex#5

Merged
jpwinans merged 1 commit into
mainfrom
feat/mempalace-provenance-classifier
May 11, 2026
Merged

Envelope D2: provenance classifier (Qwen3 substrate) + Pass-3 regex#5
jpwinans merged 1 commit into
mainfrom
feat/mempalace-provenance-classifier

Conversation

@jpwinans

Copy link
Copy Markdown
Owner

Summary

Envelope D2 from 2026-05-11 paired build (Task MemPalace#15, Phase 1). Wires the production local-substrate classifier (Qwen3-Coder-30B at mlx_lm.server :8802) for provenance candidate validation. D1 (PR #4) shipped the heuristic + stub; D2 ships the production classifier + calibration proof + Pass-3 regex extension for cases the heuristic was missing.

Changes

Package conversion

mempalace/provenance.pymempalace/provenance/__init__.py (via git mv, history preserved). Public import path unchanged — mempalace.provenance.extract_candidates / validate_candidate / ProvenanceCandidate / ProvenanceRecord / WING_LINEAGE_SCHEMA_DOC all still resolve.

mempalace.provenance.classifier (new module)

  • qwen3_classifier(context) -> dict — production classifier. POSTs to /v1/chat/completions with a strict JSON-output prompt; parses response; handles markdown-fence stripping; returns the dict shape validate_candidate expects.
  • Failure-soft: network / HTTP / decode / shape errors return the rejection dict {is_provenance: False, confidence: 0.0} rather than raising. Mining pipeline never crashes when substrate is unavailable.
  • Env overrides: MEMPALACE_PROVENANCE_CLASSIFIER_URL, _MODEL, _TIMEOUT.
  • temperature=0 for reproducibility across mining batches.

Pass-3 in extract_candidates

Capitalized bare-relation as subject + attribution + quote. Catches calibration fixture MemPalace#14 ("Dad always told me 'never trust a smiling investor'") that Pass-1 (requires possessive prefix) and Pass-2 (also requires possessive) miss. Capitalize-only constraint keeps false-positives manageable; classifier filters lowercase mid-sentence cases via context.

Calibration result

14 hand-labeled fixtures from architect envelope. Target: precision ≥ 0.85, recall ≥ 0.85.

Metric Value
Precision 1.000
Recall 1.000
TP / FP / TN / FN 8 / 0 / 6 / 0
Confidence range (positives) 0.90 – 0.95
Confidence range (negatives) 0.00
Avg latency / call ~0.9 s

Clean separation between positives and negatives — no threshold ambiguity. Recalibrate if either metric drifts below 0.85 on a future model swap.

Tests (49 total, 49 pass in 17s)

  • test_provenance.py — 26 existing + 4 new Pass-3 (dad-always-told, mom-used-to-say, roshi-taught, capitalization-required-negative).
  • test_classifier.py — 18 unit tests with mocked urllib.request.urlopen. Happy path, code-fence stripping (both ```json and bare ```), request-shape (model id, temp=0, max_tokens), 6 failure-soft paths (URLError, HTTPError, TimeoutError, malformed outer JSON, missing choices, malformed inner JSON, missing is_provenance key, non-dict inner), confidence coercion (string→0, >1 clamp, <0 clamp), env-var overrides.
  • test_classifier_calibration.py — 1 live-substrate test against the pinned 14-fixture set. Auto-skipped when substrate unreachable (HEAD probe on /v1/models). Asserts precision and recall ≥ 0.85.

Scope honored

  • validate_candidate(classifier=None) default still uses the D1 stub — test path unchanged for downstream consumers.
  • Production paths explicitly pass qwen3_classifier. The D3 envelope wires this into mempalace.miner.convo_miner.
  • No changes outside mempalace/provenance/ + tests/. No new dependencies (stdlib urllib.request).

Discipline

ENVELOPE D2 from 2026-05-11 paired build (Task MemPalace#15, Phase 1).

Wires the real local-substrate classifier (Qwen3-Coder-30B at
mlx_lm.server :8802) for provenance candidate validation. D1 shipped
the heuristic + stub; D2 ships the production classifier + calibration
proof.

Changes:

  - Convert mempalace/provenance.py to mempalace/provenance/ package
    via git mv to __init__.py. Public import path unchanged
    (mempalace.provenance.extract_candidates / validate_candidate /
    ProvenanceCandidate / ProvenanceRecord still resolve).

  - Add mempalace.provenance.classifier module with:
      - qwen3_classifier(context) -> dict matching the
        validate_candidate classifier interface
      - urllib-based POST to /v1/chat/completions (mirrors
        closet_llm pattern in this repo)
      - Strict JSON prompt template with rules: vague "she said"
        rejected, operational content rejected, person-mention
        without attribution rejected, conservative when in doubt
      - Failure-soft: network/HTTP/decode/shape errors return the
        rejection dict {is_provenance: False, confidence: 0.0}
        rather than raising — mining pipeline never crashes on
        unavailable substrate
      - Env overrides: MEMPALACE_PROVENANCE_CLASSIFIER_URL,
        _MODEL, _TIMEOUT
      - temperature=0 for reproducibility across mining batches
      - Markdown code-fence stripping for ```json...``` wrapped
        outputs the model occasionally emits

  - Add Pass-3 to extract_candidates: capitalized bare-relation as
    subject + attribution + quote. Catches calibration fixture MemPalace#14
    ("Dad always told me 'never trust a smiling investor'") which
    Pass-1 (requires possessive prefix) and Pass-2 (also requires
    possessive) miss. Capitalize-only constraint keeps false-
    positives manageable; classifier filters lowercase mid-sentence
    cases via context.

Calibration (live mlx_lm.server, 2026-05-11):

  14 hand-labeled fixtures from architect envelope. Target:
  precision >= 0.85, recall >= 0.85.

  Result: precision = 1.000, recall = 1.000.
    TP=8 FP=0 TN=6 FN=0.
    Confidence range: 0.90-0.95 on positives, 0.00 on negatives —
    clean separation, no calibration ambiguity.
    Average latency: 0.9s per call.

Tests (49 total, 49 pass in 17s):

  - tests/test_provenance.py: 26 existing + 4 new Pass-3 tests
    (dad-always-told-me variant, mom-used-to-say, roshi-taught-me,
    capitalization-required-negative).

  - tests/test_classifier.py: 18 unit tests covering happy-path,
    code-fence stripping (both ```json``` and bare ```),
    request-shape validation (model id, temp=0, max_tokens),
    all six failure-soft paths (URLError, HTTPError, TimeoutError,
    malformed outer JSON, missing choices, malformed inner JSON,
    missing is_provenance key, non-dict inner), confidence
    coercion (string->0, >1 clamped, <0 clamped), env-var overrides
    for endpoint + model.

  - tests/test_classifier_calibration.py: 1 live-substrate test
    against the pinned 14-fixture set. Auto-skipped when substrate
    unreachable (HEAD probe on /v1/models). Asserts precision >=
    0.85 and recall >= 0.85.

Scope honored: validate_candidate's classifier=None default still
uses the stub (test path unchanged); production callers explicitly
pass qwen3_classifier. D3 mining integration is the next envelope.
@jpwinans jpwinans merged commit 9349760 into main May 11, 2026
0 of 6 checks passed
jpwinans added a commit that referenced this pull request May 11, 2026
…iner (#6)

ENVELOPE D3 from 2026-05-11 paired build (Task MemPalace#15, Phase 1 final).

Wires extract_candidates + qwen3_classifier into mempalace.convo_miner
so new diary mining produces wing_lineage drawers in addition to the
operational wing. Phase 1 of Task MemPalace#15 closes with this PR — Phase 2
(60k existing-drawer backfill) is its own scoping task.

Changes:

  - New mempalace/provenance/mining.py with mine_chunk_for_provenance:
    take a chunk, run extract_candidates -> validate with classifier
    (default: qwen3_classifier from D2) -> rewrite transitive
    attributions -> dedupe -> upsert into wing_lineage.

  - Transitive-attribution rewrite (architect-flagged from D2
    calibration case MemPalace#11): when classifier returns speaker name
    (e.g., "James") for text containing "<possessive> <relation>'s"
    (e.g., "his father's saying"), redirect to room=<relation>
    (e.g., "father"). Without rewrite, "Tonight James reminded me:
    'measure twice' — his father's saying" files under room='james'
    and a future search for "father saying" misses it.

  - Dedup by (person, quote, source_file) hash baked into the
    drawer_id. Re-mining same source -> existing drawer; same
    attribution in different source files -> distinct drawers
    (intentional — distinct attribution events tracked separately).

  - MEMPALACE_PROVENANCE_DISABLED env var (truthy: 1/true/yes,
    case-insensitive) makes mine_chunk_for_provenance a no-op. For
    environments where the classifier substrate is unavailable, CI,
    fresh checkouts, or backfill jobs that handle their own pass.

  - convo_miner._file_chunks_locked: after the operational upsert
    inside the per-chunk loop, call mine_chunk_for_provenance. Run
    AFTER operational durability is established so a slow classifier
    call doesn't delay the canonical write. Failure-soft at three
    layers: the inner call is itself failure-soft, the convo_miner
    wrapper catches anything that escapes, operational mining
    proceeds regardless.

  - DEFAULT_CONFIDENCE_THRESHOLD = 0.7 per design doc §D1.
    D2 calibration showed positives at 0.90-0.95 and negatives at
    0.00 — 0.7 sits cleanly in the gap. Tunable via kwarg.

Schema (per Provenance-Preservation-Design §D3):
  Drawer content rendered as YAML-ish PROVENANCE: block with
  Person / Relation / Quote / Context / Source lines. Metadata
  includes wing=wing_lineage, room=<person_slug>, person,
  relation_type, is_quote, confidence, extracted_by, source_file,
  source_session, filed_at, filed_at_ts.

Tests (14 new in test_provenance_mining.py; 62 total mempalace
provenance tests):

  - Happy path: chunk + accepting classifier -> 1 wing_lineage
    drawer with correct meta + design-doc content shape.
  - Threshold: below-default-threshold rejected; custom threshold
    lets lower-confidence through.
  - Dedup: same chunk+source twice -> 1 drawer; different sources
    -> distinct drawers.
  - Disabled mode: MEMPALACE_PROVENANCE_DISABLED with 1/true/yes
    variants all yield 0 drawers.
  - No-candidates returns 0; operational mining unaffected.
  - Failure-soft: classifier raising -> 0 drawers, no crash.
  - Transitive-attribution rewrite (case MemPalace#11): classifier surfaces
    speaker name, _rewrite_speaker_to_source redirects to relation
    when "<possessive> <relation>'s" appears in candidate or context.
  - Unit tests on _rewrite_speaker_to_source directly (positive,
    negative, None-input cases).
  - End-to-end convo_miner integration: _file_chunks_locked with a
    chunk produces BOTH operational drawer (wing=wing_test) AND
    wing_lineage drawer (wing=wing_lineage).

62/62 pass in <100ms (no live substrate required — tests inject
mock classifiers).

Phase 1 status after this merges:
  - D1 (PR #4): heuristic + classifier interface — MERGED
  - D2 (PR #5): qwen3_classifier + Pass-3 + calibration — MERGED
  - D3 (this PR): mining integration — pending
  After merge: forward-only provenance preservation is operational.
  No new diary mining loses biographical/relational lineage.
  Phase 2 (60k existing-drawer backfill) is a separate scoped task.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant