fix(apr-cli): upfront vocab-format detection unblocks Qwen encoding (PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001)#1598
Merged
Conversation
…PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001) ROOT CAUSE pinned + fixed. PR #1596 shipped a "try hex first, fall through on FALSIFY-001" strategy that depended on PR #1585's load-time fail-fast. With #1585 not yet merged, the hex loader silently succeeded on Qwen-format vocabs and produced 99% `<unk>` (entropy 0.111 bits / 17.21 max). The encoder itself was not the bug. Two new falsifier tests confirm `aprender::text::bpe::BpeTokenizer` works correctly: falsify_bpe_qwen_encode_python_does_not_unk_99pct — load_from_json on real Qwen2 tokenizer.json + encode Python: 0% unk, 43 tokens, 0/43 = 0% (was the predicted 99% RED) falsify_bpe_load_from_files_matches_load_from_json_encode — load_from_files vs load_from_json on same vocab: identical IDs `[750, 75698, 1445, 1648, 198, 220, 220, 220, 470, 308, 198]`, 0/11 unk in both paths Both tests host-gated on Qwen tokenizer.json presence (skip if missing). THE FIX Replace the dependency-on-#1585 dispatch with UPFRONT FORMAT DETECTION. Count canonical hex-byte tokens "00".."ff" in vocab.json directly. - ≥ 200 (legitimate hex vocabs always have all 256) → Hex path - < 200 (HF GPT-2 byte-level vocabs have ~36) → ByteLevel path Detection runs against vocab.json content, independent of any loader's behavior. Works whether or not PR #1585 has merged. LIVE EVIDENCE on lambda-vector RTX 4090 100-doc Python smoke from /mnt/.../python-permissive.jsonl: | Vocab format | BEFORE this PR | AFTER this PR | |---|---|---| | Hex (model-2-tokenizer-v1) | 12.009 bits, 13K distinct | 12.009 bits, 13K distinct (regression-free) | | GPT-2 byte-level (Qwen) | 0.111 bits, 16 distinct, 99.02% unk | 6.582 bits, 6118 distinct, 0.00% unk | The Qwen path now correctly produces real Python tokenization. This unblocks the canonical path forward for SHIP-TWO §60: re-tokenize the 5g.1 corpus → re-dispatch 5g.2 → honest val_loss → flip MODEL-2 ship % 57% → ≥58%. Five-Whys 1. Why was PR #1596's dispatch broken? It assumed PR #1585's fail-fast was on main, but #1585 was still OPEN. Hex loader silently accepted Qwen vocab → produced 99% unk → byte-level fallback never fired. 2. Why detect upfront instead of fixing the dependency chain? PR #1585's fail-fast is a load-time signal; this PR's detection is the same logic moved one level up. Now the dispatch works regardless of which path's loader runs first. Cleaner DAG. 3. Why count hex-byte tokens specifically? The presence of all 256 "00".."ff" hex strings is the canonical signature of `apr tokenize train`'s output. Any vocab without them is either GPT-2 byte-level or some other format → byte-level encoder is the correct choice (or refuse if even that fails). 4. Why prefer tokenizer.json when present? It's the canonical HF format with `added_tokens` registered. `load_from_files` on vocab.json+merges.txt also works (verified by upstream-002 test) but tokenizer.json is the higher-fidelity input. 5. Why ship the falsifier tests alongside? They CONFIRM the encoder works correctly when invoked properly. If a future refactor breaks the byte-level path (or the load functions diverge), the tests fail-fast. Drift prevention. Quality gates (all green) - cargo test -p aprender-core --lib falsify_bpe: 2 tests PASS - cargo test -p apr-cli --features training --lib: 5644/5644 PASS - cargo clippy -p apr-cli --features training --lib -- -D warnings: clean - cargo check --workspace: clean - rustfmt --check: clean - LIVE: hex format 12.009 bits (regression-free) - LIVE: byte-level format 6.582 bits, 0% unk (was 0.111 / 99% unk) SHIP-TWO impact - MODEL-1 ship %: unchanged at 91% - MODEL-2 ship %: unchanged at 57% — but the path forward is NOW TECHNICALLY UNBLOCKED. Re-tokenize 5g.1 corpus with this fix + re-dispatch 5g.2 produces a HONEST val_loss verdict. - §50.4 cascade: COMPLETE per #1577 - 5g.2 dispatch: OPERATOR-RUNNABLE end-to-end with WORKING encoder - This PR closes PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001 (task #20) - Next ship-mover: PMAT-CODE-PRETRAIN-FINETUNE-LIVE-003 (re-encode 5g.1, re-dispatch 5g.2 LIVE) — operator-dispatchable now. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5 tasks
5 tasks
noahgift
added a commit
that referenced
this pull request
May 11, 2026
…ETRAIN-INIT-LOAD-003) Bisects the §61 val_loss > ln(vocab) anomaly. Empirical findings on lambda-vector RTX 4090: H4 ROOT CAUSE #1: BF16 dtype mislabel ====================================== The OLD `qwen2.5-coder-0.5b-instruct-fp16.apr` (May-4 import) tags its tensors with dtype=F16 in the APR v2 header — but the SOURCE HF safetensors `model.safetensors` uses dtype=BF16. When the loader sees dtype=F16, it dequantizes via `f16_to_f32`, producing values that diverge from the BF16-correct decode. Element-0 cross-check on `model.norm.weight`: Safetensors source (BF16-decoded): 7.5625, 8.0, 7.21875, ... Old APR (loaded as F16): 7.0625, 7.125, 7.0, ... Fresh APR (loaded as BF16): 7.5625, 8.0, 7.21875, ... ← matches source Element-0 cross-check on `model.layers.0.self_attn.q_proj.bias`: Safetensors (BF16): 0.0674, -0.0859, 0.1104, -0.0605, ... Old APR (F16): (different, distorted) Fresh APR (BF16): 0.0674, -0.0859, 0.1104, -0.0605, ... ← matches source Fix: re-import Qwen safetensors via current `apr import`. The current `StreamingWriter::add_raw_f16_tensor` correctly preserves BF16 (line 100-104 of streaming_writer.rs). The old APR was created with a buggy import path that mis-tagged BF16 as F16. H4 ROOT CAUSE #2: STILL OPEN ============================= Even with correct BF16-decoded weights (fresh APR), val_loss at step 1 is **18.55** — still above ln(vocab)=17.21 (uniform-over-vocab baseline). The dtype fix moved the dial slightly (was 19.80) but did not resolve the sub-random predictions issue. Remaining hypotheses for the residual gap: H4B (layout): some tensor's row/col-major orientation may differ between Qwen export and aprender::Transformer expectations H4D (forward path): cuBLAS / CudaBlock forward may produce wrong logits despite correct weight values Other: tied embedding fall-through path (`lm_head: None` → embed_tokens reuse) may have a sign or scale issue This PR ships the diagnostic infrastructure that PROVED root cause #1 and provides the foundation for bisecting the remaining gap. What this PR ships =================== `falsify_h4_init_stats_qwen_embed_norm_sensible` — a host-gated diagnostic test that: - Loads the Qwen 0.5B init APR (prefers fresh, falls back to legacy) - Reports tensor stats (mean, std, min, max) for embed_tokens, final norm, per-layer norms, q/k/v projections, mlp gates - Asserts sensible bounds (embed std ∈ [0.005, 0.5], norm in [0.01, 100], etc.) - Dumps element-0 values for cross-comparison with safetensors source Industrial validation example output: embed_tokens.weight: mean=0.00014, std=0.0152, range[-0.196, 0.128] — sensible HF LLaMA init scale model.norm.weight: mean=7.46, std=0.84, range[-2.28, 17.38] — Qwen-typical (final norm scaled up) q_proj.bias L0: mean=0.03, std=7.88, range[-65.5, 128] — Qwen-typical (large attention biases) Five-Whys ========== 1. Why was the OLD Qwen APR tagged F16? Created by a buggy import path that didn't pass `is_bf16` flag through to the writer. Fixed in current apr-cli but the artifact is preserved on disk. 2. Why does the fresh APR not fully fix val_loss? The dtype fix makes loaded values match safetensors, but val_loss=18.55 still exceeds ln(vocab)=17.21 — meaning forward path or some other tensor is still producing sub-random predictions. 3. Why didn't existing falsifiers catch the dtype mislabel? No falsifier asserted "loaded values match safetensors source element-by-element". The PMAT-187 NaN/Inf/explosive-mean check passes because BF16-as-F16 distortion produces values that are neither NaN nor unusually large. 4. Why ship the diagnostic before the full H4 fix? The diagnostic itself proves H4 root cause #1 and provides the bisection foundation for #2. Per `feedback_falsifier_first_cascade_pattern.md`, 1 PR ≈ 1 falsifier discharge. The dtype-mislabel discharge is real progress. 5. Why does the operator need to know? They have an old Qwen APR on disk that mis-decodes silently. With this PR's diagnostic they can verify before training; without it, the silent error wastes ~17 hours of GPU time per cycle (per §60 evidence). Quality gates (all green) ========================== - cargo test -p aprender-train --lib falsify_h4_init_stats: PASS - cargo test -p aprender-train --lib: 7585+ tests PASS - cargo clippy -p aprender-train --lib -- -D warnings: clean (--tests has 4 PRE-EXISTING errors on main; not introduced by this PR) - rustfmt --check: clean SHIP-TWO impact ================ - MODEL-1 ship %: unchanged at 91% - MODEL-2 ship %: unchanged at 57% — H4 root cause #1 found and fix available (use fresh APR), but val_loss still >ln(vocab). The next-cycle bisection (H4B or H4D) is now well-targeted. - §60 H1C cascade: FULLY CLOSED per #1598 - §61 evidence: 5g.1-v2 corpus is 7.42 bits entropy / 0% unk - This PR closes part of PMAT-CODE-PRETRAIN-INIT-LOAD-003 (task #22) Out-of-scope follow-ups ======================== PMAT-CODE-PRETRAIN-INIT-LOAD-004 (H4 residual cascade): - Bisect H4B (layout): forward-pass element-wise compare against HF Qwen2 reference at each layer - Bisect H4D (forward path): instrument cuBLAS GEMM outputs against a CPU reference matmul - Fix root cause; flip MODEL-2 ship % 57% → ≥58% Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 13, 2026
…rfaces H4 (PMAT-CODE-PRETRAIN-INIT-LOAD-003) (#1600) Records the full discharge of PMAT-CODE-PRETRAIN-FINETUNE-LIVE-003 (task #21) and the new H4 defect surface that the honest data exposed. Two artifacts: 1. **5g.1 re-encode SUCCESS** — `apr tokenize encode-corpus` with PR #1598's upfront vocab-format detection produced a real Python corpus from the 3.0 GB JSONL source: - 1,241.7 M tokens - 405,944 documents - 126 shards × 10 M tokens each - Shard-0 first 32K: entropy 7.42 bits / 17.21 max; 3324 distinct tokens; **0% unk** (was 99.99% unk in §60's broken corpus) The data-bug from §60 is fully closed. 2. **5g.2 LIVE dispatch surfaces H4** — Re-running fine-tune from Qwen 0.5B init on the now-real corpus aborted at GATE-TRAIN-005: - 500-step run: val_loss = 11.55 at epoch 0 (> 10.0 threshold) - 1-step diagnostic: val_loss = 19.80 (> ln(vocab) = 17.21) val_loss > ln(vocab) means the model assigns LESS than uniform probability to true tokens — *worse than random init*. The Qwen init weights load (PR #1579's populate-coverage fix is in main) but produce sub-random predictions. Five-Whys 1. Why was val_loss = 19.80 at step 1? Industry baseline for Qwen 0.5B zero-shot on Python is ~1.5–3.0; uniform random over vocab is ln(151643) = 17.21. 19.80 > 17.21 means the model is *anti-aligned* with held-out tokens. 2. Why anti-aligned despite Qwen init being loaded? Some structural component of the init pipeline is broken at a layer that PR #1579 doesn't cover. 3. Four hypotheses for H4: A. Tied weights — `tie_word_embeddings: true` on Qwen 0.5B; if populate writes embed_tokens but doesn't propagate to lm_head (or writes them separately to random buffers), forward predictions are random while embeddings are correct. B. Layout mismatch — GGUF/APR are row-major (tensor-layout-v1); if init APR's lm_head is column-major, matmul produces wrong logits. C. Norm scale — RMSNorm weights loaded but rms_norm_eps mismatch cascades through forward. D. Residual stream — some block's residual contributes zero from an uninitialized buffer. 4. Why ship the diagnosis but not the H4 fix? Each hypothesis is its own falsifier-discharge cascade per `feedback_falsifier_first_cascade_pattern.md`. Multi-PR scope. 5. Why does this matter for ship %? FALSIFY-005 status flips from NUMERICALLY-PASSED-METHODOLOGY-SUSPECT (pre-§61, fake pass on broken corpus) to RED-WITH-METHODOLOGICALLY-HONEST (post-§61, real defect on real corpus). The honest RED is itself progress — the contract now reports the binding defect. SHIP-TWO impact - MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work) - MODEL-2 ship %: unchanged at 57% — diagnosis correct, H4 cascade is the gate - §60 H1C (data-bug) cascade: FULLY CLOSED. Encoder works end-to-end on real Qwen vocab + real Python corpus. Closes PMAT-CODE-PRETRAIN-FINETUNE-LIVE-003 (task #21). Tracking PMAT-CODE-PRETRAIN-INIT-LOAD-003 (H4 cascade) as the next ship-mover. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 13, 2026
…ETRAIN-INIT-LOAD-003) (#1601) Bisects the §61 val_loss > ln(vocab) anomaly. Empirical findings on lambda-vector RTX 4090: H4 ROOT CAUSE #1: BF16 dtype mislabel ====================================== The OLD `qwen2.5-coder-0.5b-instruct-fp16.apr` (May-4 import) tags its tensors with dtype=F16 in the APR v2 header — but the SOURCE HF safetensors `model.safetensors` uses dtype=BF16. When the loader sees dtype=F16, it dequantizes via `f16_to_f32`, producing values that diverge from the BF16-correct decode. Element-0 cross-check on `model.norm.weight`: Safetensors source (BF16-decoded): 7.5625, 8.0, 7.21875, ... Old APR (loaded as F16): 7.0625, 7.125, 7.0, ... Fresh APR (loaded as BF16): 7.5625, 8.0, 7.21875, ... ← matches source Element-0 cross-check on `model.layers.0.self_attn.q_proj.bias`: Safetensors (BF16): 0.0674, -0.0859, 0.1104, -0.0605, ... Old APR (F16): (different, distorted) Fresh APR (BF16): 0.0674, -0.0859, 0.1104, -0.0605, ... ← matches source Fix: re-import Qwen safetensors via current `apr import`. The current `StreamingWriter::add_raw_f16_tensor` correctly preserves BF16 (line 100-104 of streaming_writer.rs). The old APR was created with a buggy import path that mis-tagged BF16 as F16. H4 ROOT CAUSE #2: STILL OPEN ============================= Even with correct BF16-decoded weights (fresh APR), val_loss at step 1 is **18.55** — still above ln(vocab)=17.21 (uniform-over-vocab baseline). The dtype fix moved the dial slightly (was 19.80) but did not resolve the sub-random predictions issue. Remaining hypotheses for the residual gap: H4B (layout): some tensor's row/col-major orientation may differ between Qwen export and aprender::Transformer expectations H4D (forward path): cuBLAS / CudaBlock forward may produce wrong logits despite correct weight values Other: tied embedding fall-through path (`lm_head: None` → embed_tokens reuse) may have a sign or scale issue This PR ships the diagnostic infrastructure that PROVED root cause #1 and provides the foundation for bisecting the remaining gap. What this PR ships =================== `falsify_h4_init_stats_qwen_embed_norm_sensible` — a host-gated diagnostic test that: - Loads the Qwen 0.5B init APR (prefers fresh, falls back to legacy) - Reports tensor stats (mean, std, min, max) for embed_tokens, final norm, per-layer norms, q/k/v projections, mlp gates - Asserts sensible bounds (embed std ∈ [0.005, 0.5], norm in [0.01, 100], etc.) - Dumps element-0 values for cross-comparison with safetensors source Industrial validation example output: embed_tokens.weight: mean=0.00014, std=0.0152, range[-0.196, 0.128] — sensible HF LLaMA init scale model.norm.weight: mean=7.46, std=0.84, range[-2.28, 17.38] — Qwen-typical (final norm scaled up) q_proj.bias L0: mean=0.03, std=7.88, range[-65.5, 128] — Qwen-typical (large attention biases) Five-Whys ========== 1. Why was the OLD Qwen APR tagged F16? Created by a buggy import path that didn't pass `is_bf16` flag through to the writer. Fixed in current apr-cli but the artifact is preserved on disk. 2. Why does the fresh APR not fully fix val_loss? The dtype fix makes loaded values match safetensors, but val_loss=18.55 still exceeds ln(vocab)=17.21 — meaning forward path or some other tensor is still producing sub-random predictions. 3. Why didn't existing falsifiers catch the dtype mislabel? No falsifier asserted "loaded values match safetensors source element-by-element". The PMAT-187 NaN/Inf/explosive-mean check passes because BF16-as-F16 distortion produces values that are neither NaN nor unusually large. 4. Why ship the diagnostic before the full H4 fix? The diagnostic itself proves H4 root cause #1 and provides the bisection foundation for #2. Per `feedback_falsifier_first_cascade_pattern.md`, 1 PR ≈ 1 falsifier discharge. The dtype-mislabel discharge is real progress. 5. Why does the operator need to know? They have an old Qwen APR on disk that mis-decodes silently. With this PR's diagnostic they can verify before training; without it, the silent error wastes ~17 hours of GPU time per cycle (per §60 evidence). Quality gates (all green) ========================== - cargo test -p aprender-train --lib falsify_h4_init_stats: PASS - cargo test -p aprender-train --lib: 7585+ tests PASS - cargo clippy -p aprender-train --lib -- -D warnings: clean (--tests has 4 PRE-EXISTING errors on main; not introduced by this PR) - rustfmt --check: clean SHIP-TWO impact ================ - MODEL-1 ship %: unchanged at 91% - MODEL-2 ship %: unchanged at 57% — H4 root cause #1 found and fix available (use fresh APR), but val_loss still >ln(vocab). The next-cycle bisection (H4B or H4D) is now well-targeted. - §60 H1C cascade: FULLY CLOSED per #1598 - §61 evidence: 5g.1-v2 corpus is 7.42 bits entropy / 0% unk - This PR closes part of PMAT-CODE-PRETRAIN-INIT-LOAD-003 (task #22) Out-of-scope follow-ups ======================== PMAT-CODE-PRETRAIN-INIT-LOAD-004 (H4 residual cascade): - Bisect H4B (layout): forward-pass element-wise compare against HF Qwen2 reference at each layer - Bisect H4D (forward path): instrument cuBLAS GEMM outputs against a CPU reference matmul - Fix root cause; flip MODEL-2 ship % 57% → ≥58% Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ROOT CAUSE pinned + fixed. Qwen-format vocab encoding now works correctly: 0% unk, 6.582 bits entropy on Python source (was 99% unk / 0.111 bits before).
The bug
PR #1596 shipped a "try hex first, fall through on FALSIFY-001" strategy that depended on PR #1585's load-time fail-fast. With #1585 still OPEN, the hex loader silently succeeded on Qwen-format vocabs and produced 99%
<unk>(entropy 0.111 bits / 17.21 max).The encoder itself was not the bug. Two new falsifier tests confirm
aprender::text::bpe::BpeTokenizerworks correctly:falsify_bpe_qwen_encode_python_does_not_unk_99pct— load_from_json on real Qwen2 tokenizer.json: 0% unk on Python ✓falsify_bpe_load_from_files_matches_load_from_json_encode— both load paths produce identical IDs ✓The fix
Replace the dependency-on-#1585 dispatch with upfront format detection. Count canonical hex-byte tokens "00".."ff" in vocab.json directly:
Detection runs against vocab.json content, independent of any loader's behavior. Works whether or not PR #1585 has merged.
LIVE evidence (lambda-vector RTX 4090)
100-doc Python smoke from
python-permissive.jsonl:The Qwen path now correctly produces real Python tokenization. Path forward unblocked: re-tokenize 5g.1 corpus → re-dispatch 5g.2 → honest val_loss → flip MODEL-2 ship % 57% → ≥58%.
Five-Whys
apr tokenize train. Any vocab without them is byte-level format.added_tokensregistered.load_from_filesalso works (verified by upstream-002 test) but tokenizer.json is higher-fidelity input.Test plan
cargo test -p aprender-core --lib falsify_bpe: 2/2 PASScargo test -p apr-cli --features training --lib: 5644/5644 PASScargo clippy -p apr-cli --features training --lib -- -D warnings: cleancargo check --workspace: cleanrustfmt --check: cleanSHIP-TWO impact
Closes
PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001 (task #20).
Next ship-mover (operator-dispatchable)
PMAT-CODE-PRETRAIN-FINETUNE-LIVE-003: re-encode 5g.1 corpus with this fix (~17 hr wall on RTX 4090), then re-dispatch 5g.2 LIVE.
Files
crates/apr-cli/src/commands/tokenize.rs(+159 / -64, upfront format detection)crates/aprender-core/src/text/bpe/tests_encode_decode.rs(+128, two new falsifier tests)🤖 Generated with Claude Code