feat(aprender-core): tokenizer-bpe-v1 INV-BPE-003 PARTIAL_ALGORITHM_LEVEL (also FALSIFY-SHIP-012)#1159
Merged
Merged
Conversation
…EVEL
Algorithm-level PARTIAL discharge for FALSIFY/INV-BPE-003 (round-trip
byte-equality on 10K held-out docs) per
`contracts/tokenizer-bpe-v1.yaml`. Also FALSIFY-SHIP-012 directly.
## What this binds
`verdict_from_roundtrip_scan(docs_scanned, roundtrip_failures)`
returns Pass iff:
1. `docs_scanned >= 10_000` (contract statistical floor)
2. `roundtrip_failures == 0` (zero-tolerance — non-injective
tokenization corrupts loss target)
3. `roundtrip_failures <= docs_scanned` (counter sanity)
Pinned constant: `AC_BPE_INV_003_REQUIRED_DOCS = 10_000`.
## Why zero-tolerance
`decode(encode(nfc(text))) != nfc(text)` means the tokenizer is not
injective on the input domain. Even one failure means MODEL-2's
training loss is computed against tokens that don't reconstruct the
original text — silently corrupting the cross-entropy gradient on
that document. The contract falsifier ("Any non-zero diff bytes
fails") admits no tolerance band.
## Why 10K floor
The contract specifies a 10K-doc held-out corpus. Round-trip
failures cluster around rare Unicode patterns (emoji ZWJ sequences,
RTL combining marks, control sequences) that occur at frequencies
of ~1-in-10K to 1-in-100K. A scan over 1K docs would miss those
classes entirely; a scan over 100K would over-tax every smoke run.
Pinning 10K matches the contract's stat-power requirement.
## Why partition-violation check
`roundtrip_failures > docs_scanned` indicates counter corruption —
the failure set cannot exceed the total scanned. Catches a counter-
rollover bug or double-count regression.
## Five-Whys
1. Why bind INV-BPE-003 now? — Round-trip injectivity is the
foundation of MODEL-2's training-target correctness; without a
verdict pin, a regression in NFC handling or BPE merge-table
loading silently corrupts every gradient.
2. Why a (u64, u64) pair, not the actual scanner? —
Algorithm-level pin; the streaming round-trip scanner is
FULL_DISCHARGE work for the corpus-tokenize PR.
3. Why pin 10K floor? — Statistical-power match to contract;
smaller samples miss rare-Unicode classes.
4. Why partition-violation check? — Counter corruption mustn't
silently pass.
5. Why 17 tests across 7 sections? — Mutation survey: provenance
pin (×1), pass band (×4: floor, +1, CSN-Python, huge), zero-
tolerance fail (×4 incl. one-in-million), sample-size fail (×3),
partition violation (×2), boundary sweep (×8 each dimension), and
zero-tolerance property (×4 sizes).
## Cross-reference
This verdict ALSO discharges FALSIFY-SHIP-012 (round-trip gate from
SHIP-TWO-001 spec §15.4) — the contract identifies the two as the
same gate. Two contract IDs share one PARTIAL discharge.
## Scope
PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual streaming
round-trip scanner that produces `(docs_scanned, roundtrip_failures)`
is FULL_DISCHARGE work for the corpus-tokenize implementation PR.
## Tests
17 unit tests, all green.
…ial-discharge # Conflicts: # crates/aprender-core/src/format/mod.rs
noahgift
added a commit
that referenced
this pull request
May 12, 2026
…EVEL (#1159) Algorithm-level PARTIAL discharge for FALSIFY/INV-BPE-003 (round-trip byte-equality on 10K held-out docs) per `contracts/tokenizer-bpe-v1.yaml`. Also FALSIFY-SHIP-012 directly. ## What this binds `verdict_from_roundtrip_scan(docs_scanned, roundtrip_failures)` returns Pass iff: 1. `docs_scanned >= 10_000` (contract statistical floor) 2. `roundtrip_failures == 0` (zero-tolerance — non-injective tokenization corrupts loss target) 3. `roundtrip_failures <= docs_scanned` (counter sanity) Pinned constant: `AC_BPE_INV_003_REQUIRED_DOCS = 10_000`. ## Why zero-tolerance `decode(encode(nfc(text))) != nfc(text)` means the tokenizer is not injective on the input domain. Even one failure means MODEL-2's training loss is computed against tokens that don't reconstruct the original text — silently corrupting the cross-entropy gradient on that document. The contract falsifier ("Any non-zero diff bytes fails") admits no tolerance band. ## Why 10K floor The contract specifies a 10K-doc held-out corpus. Round-trip failures cluster around rare Unicode patterns (emoji ZWJ sequences, RTL combining marks, control sequences) that occur at frequencies of ~1-in-10K to 1-in-100K. A scan over 1K docs would miss those classes entirely; a scan over 100K would over-tax every smoke run. Pinning 10K matches the contract's stat-power requirement. ## Why partition-violation check `roundtrip_failures > docs_scanned` indicates counter corruption — the failure set cannot exceed the total scanned. Catches a counter- rollover bug or double-count regression. ## Five-Whys 1. Why bind INV-BPE-003 now? — Round-trip injectivity is the foundation of MODEL-2's training-target correctness; without a verdict pin, a regression in NFC handling or BPE merge-table loading silently corrupts every gradient. 2. Why a (u64, u64) pair, not the actual scanner? — Algorithm-level pin; the streaming round-trip scanner is FULL_DISCHARGE work for the corpus-tokenize PR. 3. Why pin 10K floor? — Statistical-power match to contract; smaller samples miss rare-Unicode classes. 4. Why partition-violation check? — Counter corruption mustn't silently pass. 5. Why 17 tests across 7 sections? — Mutation survey: provenance pin (×1), pass band (×4: floor, +1, CSN-Python, huge), zero- tolerance fail (×4 incl. one-in-million), sample-size fail (×3), partition violation (×2), boundary sweep (×8 each dimension), and zero-tolerance property (×4 sizes). ## Cross-reference This verdict ALSO discharges FALSIFY-SHIP-012 (round-trip gate from SHIP-TWO-001 spec §15.4) — the contract identifies the two as the same gate. Two contract IDs share one PARTIAL discharge. ## Scope PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual streaming round-trip scanner that produces `(docs_scanned, roundtrip_failures)` is FULL_DISCHARGE work for the corpus-tokenize implementation PR. ## Tests 17 unit tests, all green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
contracts/tokenizer-bpe-v1.yaml. Also discharges FALSIFY-SHIP-012 directly.crates/aprender-core/src/format/bpe_inv_003.rsexportingverdict_from_roundtrip_scan(docs_scanned, roundtrip_failures) -> BpeInv003Verdict.AC_BPE_INV_003_REQUIRED_DOCS = 10_000.Why zero-tolerance
decode(encode(nfc(text))) != nfc(text)means the tokenizer is not injective. Even one failure means MODEL-2's loss target is computed against unreconstructable tokens — silently corrupts cross-entropy gradient on that document.Why 10K floor
Round-trip failures cluster around rare Unicode (emoji ZWJ, RTL combining marks, control sequences) at ~1-in-10K to 1-in-100K. 1K samples miss those classes; 100K over-taxes smoke runs.
Cross-reference
This verdict ALSO discharges FALSIFY-SHIP-012 (round-trip gate from SHIP-TWO-001 spec §15.4) — the contract identifies them as the same gate. Two contract IDs share one PARTIAL discharge.
Five-Whys (commit-message body has full chain)
Test plan
cargo test -p aprender-core --lib bpe_inv_003→ 17 passed🤖 Generated with Claude Code