Skip to content

feat(aprender-core): tokenizer-bpe-v1 INV-BPE-003 PARTIAL_ALGORITHM_LEVEL (also FALSIFY-SHIP-012)#1159

Merged
noahgift merged 2 commits into
mainfrom
feat/bpe-inv-003-partial-discharge
Apr 30, 2026
Merged

feat(aprender-core): tokenizer-bpe-v1 INV-BPE-003 PARTIAL_ALGORITHM_LEVEL (also FALSIFY-SHIP-012)#1159
noahgift merged 2 commits into
mainfrom
feat/bpe-inv-003-partial-discharge

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

  • Algorithm-level PARTIAL discharge for INV-BPE-003 (round-trip byte-equality on 10K held-out docs) per contracts/tokenizer-bpe-v1.yaml. Also discharges FALSIFY-SHIP-012 directly.
  • New module crates/aprender-core/src/format/bpe_inv_003.rs exporting verdict_from_roundtrip_scan(docs_scanned, roundtrip_failures) -> BpeInv003Verdict.
  • Pinned constant: AC_BPE_INV_003_REQUIRED_DOCS = 10_000.
  • 17 unit tests across 7 mutation-survey sections.

Why zero-tolerance

decode(encode(nfc(text))) != nfc(text) means the tokenizer is not injective. Even one failure means MODEL-2's loss target is computed against unreconstructable tokens — silently corrupts cross-entropy gradient on that document.

Why 10K floor

Round-trip failures cluster around rare Unicode (emoji ZWJ, RTL combining marks, control sequences) at ~1-in-10K to 1-in-100K. 1K samples miss those classes; 100K over-taxes smoke runs.

Cross-reference

This verdict ALSO discharges FALSIFY-SHIP-012 (round-trip gate from SHIP-TWO-001 spec §15.4) — the contract identifies them as the same gate. Two contract IDs share one PARTIAL discharge.

Five-Whys (commit-message body has full chain)

  1. Bind now → round-trip injectivity = MODEL-2 loss target correctness; ships invisibly.
  2. (u64, u64) pair → algorithm-level pin; streaming scanner is FULL_DISCHARGE.
  3. Pin 10K → statistical-power match to contract.
  4. Partition violation → counter corruption mustn't pass.
  5. 17 tests → 7 sections.

Test plan

  • cargo test -p aprender-core --lib bpe_inv_003 → 17 passed
  • PMAT pre-commit gates pass

🤖 Generated with Claude Code

…EVEL

Algorithm-level PARTIAL discharge for FALSIFY/INV-BPE-003 (round-trip
byte-equality on 10K held-out docs) per
`contracts/tokenizer-bpe-v1.yaml`. Also FALSIFY-SHIP-012 directly.

## What this binds

`verdict_from_roundtrip_scan(docs_scanned, roundtrip_failures)`
returns Pass iff:

1. `docs_scanned >= 10_000` (contract statistical floor)
2. `roundtrip_failures == 0` (zero-tolerance — non-injective
   tokenization corrupts loss target)
3. `roundtrip_failures <= docs_scanned` (counter sanity)

Pinned constant: `AC_BPE_INV_003_REQUIRED_DOCS = 10_000`.

## Why zero-tolerance

`decode(encode(nfc(text))) != nfc(text)` means the tokenizer is not
injective on the input domain. Even one failure means MODEL-2's
training loss is computed against tokens that don't reconstruct the
original text — silently corrupting the cross-entropy gradient on
that document. The contract falsifier ("Any non-zero diff bytes
fails") admits no tolerance band.

## Why 10K floor

The contract specifies a 10K-doc held-out corpus. Round-trip
failures cluster around rare Unicode patterns (emoji ZWJ sequences,
RTL combining marks, control sequences) that occur at frequencies
of ~1-in-10K to 1-in-100K. A scan over 1K docs would miss those
classes entirely; a scan over 100K would over-tax every smoke run.
Pinning 10K matches the contract's stat-power requirement.

## Why partition-violation check

`roundtrip_failures > docs_scanned` indicates counter corruption —
the failure set cannot exceed the total scanned. Catches a counter-
rollover bug or double-count regression.

## Five-Whys

1. Why bind INV-BPE-003 now? — Round-trip injectivity is the
   foundation of MODEL-2's training-target correctness; without a
   verdict pin, a regression in NFC handling or BPE merge-table
   loading silently corrupts every gradient.
2. Why a (u64, u64) pair, not the actual scanner? —
   Algorithm-level pin; the streaming round-trip scanner is
   FULL_DISCHARGE work for the corpus-tokenize PR.
3. Why pin 10K floor? — Statistical-power match to contract;
   smaller samples miss rare-Unicode classes.
4. Why partition-violation check? — Counter corruption mustn't
   silently pass.
5. Why 17 tests across 7 sections? — Mutation survey: provenance
   pin (×1), pass band (×4: floor, +1, CSN-Python, huge), zero-
   tolerance fail (×4 incl. one-in-million), sample-size fail (×3),
   partition violation (×2), boundary sweep (×8 each dimension), and
   zero-tolerance property (×4 sizes).

## Cross-reference

This verdict ALSO discharges FALSIFY-SHIP-012 (round-trip gate from
SHIP-TWO-001 spec §15.4) — the contract identifies the two as the
same gate. Two contract IDs share one PARTIAL discharge.

## Scope

PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual streaming
round-trip scanner that produces `(docs_scanned, roundtrip_failures)`
is FULL_DISCHARGE work for the corpus-tokenize implementation PR.

## Tests

17 unit tests, all green.
@noahgift noahgift enabled auto-merge (squash) April 30, 2026 07:49
…ial-discharge

# Conflicts:
#	crates/aprender-core/src/format/mod.rs
@noahgift noahgift merged commit 8383743 into main Apr 30, 2026
10 checks passed
@noahgift noahgift deleted the feat/bpe-inv-003-partial-discharge branch April 30, 2026 08:46
noahgift added a commit that referenced this pull request May 12, 2026
…EVEL (#1159)

Algorithm-level PARTIAL discharge for FALSIFY/INV-BPE-003 (round-trip
byte-equality on 10K held-out docs) per
`contracts/tokenizer-bpe-v1.yaml`. Also FALSIFY-SHIP-012 directly.

## What this binds

`verdict_from_roundtrip_scan(docs_scanned, roundtrip_failures)`
returns Pass iff:

1. `docs_scanned >= 10_000` (contract statistical floor)
2. `roundtrip_failures == 0` (zero-tolerance — non-injective
   tokenization corrupts loss target)
3. `roundtrip_failures <= docs_scanned` (counter sanity)

Pinned constant: `AC_BPE_INV_003_REQUIRED_DOCS = 10_000`.

## Why zero-tolerance

`decode(encode(nfc(text))) != nfc(text)` means the tokenizer is not
injective on the input domain. Even one failure means MODEL-2's
training loss is computed against tokens that don't reconstruct the
original text — silently corrupting the cross-entropy gradient on
that document. The contract falsifier ("Any non-zero diff bytes
fails") admits no tolerance band.

## Why 10K floor

The contract specifies a 10K-doc held-out corpus. Round-trip
failures cluster around rare Unicode patterns (emoji ZWJ sequences,
RTL combining marks, control sequences) that occur at frequencies
of ~1-in-10K to 1-in-100K. A scan over 1K docs would miss those
classes entirely; a scan over 100K would over-tax every smoke run.
Pinning 10K matches the contract's stat-power requirement.

## Why partition-violation check

`roundtrip_failures > docs_scanned` indicates counter corruption —
the failure set cannot exceed the total scanned. Catches a counter-
rollover bug or double-count regression.

## Five-Whys

1. Why bind INV-BPE-003 now? — Round-trip injectivity is the
   foundation of MODEL-2's training-target correctness; without a
   verdict pin, a regression in NFC handling or BPE merge-table
   loading silently corrupts every gradient.
2. Why a (u64, u64) pair, not the actual scanner? —
   Algorithm-level pin; the streaming round-trip scanner is
   FULL_DISCHARGE work for the corpus-tokenize PR.
3. Why pin 10K floor? — Statistical-power match to contract;
   smaller samples miss rare-Unicode classes.
4. Why partition-violation check? — Counter corruption mustn't
   silently pass.
5. Why 17 tests across 7 sections? — Mutation survey: provenance
   pin (×1), pass band (×4: floor, +1, CSN-Python, huge), zero-
   tolerance fail (×4 incl. one-in-million), sample-size fail (×3),
   partition violation (×2), boundary sweep (×8 each dimension), and
   zero-tolerance property (×4 sizes).

## Cross-reference

This verdict ALSO discharges FALSIFY-SHIP-012 (round-trip gate from
SHIP-TWO-001 spec §15.4) — the contract identifies the two as the
same gate. Two contract IDs share one PARTIAL discharge.

## Scope

PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual streaming
round-trip scanner that produces `(docs_scanned, roundtrip_failures)`
is FULL_DISCHARGE work for the corpus-tokenize implementation PR.

## Tests

17 unit tests, all green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant