fix(apr-cli): upfront vocab-format detection unblocks Qwen encoding (PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001) by noahgift · Pull Request #1598 · paiml/aprender

noahgift · 2026-05-09T20:59:25Z

Summary

ROOT CAUSE pinned + fixed. Qwen-format vocab encoding now works correctly: 0% unk, 6.582 bits entropy on Python source (was 99% unk / 0.111 bits before).

The bug

PR #1596 shipped a "try hex first, fall through on FALSIFY-001" strategy that depended on PR #1585's load-time fail-fast. With #1585 still OPEN, the hex loader silently succeeded on Qwen-format vocabs and produced 99% <unk> (entropy 0.111 bits / 17.21 max).

The encoder itself was not the bug. Two new falsifier tests confirm aprender::text::bpe::BpeTokenizer works correctly:

falsify_bpe_qwen_encode_python_does_not_unk_99pct — load_from_json on real Qwen2 tokenizer.json: 0% unk on Python ✓
falsify_bpe_load_from_files_matches_load_from_json_encode — both load paths produce identical IDs ✓

The fix

Replace the dependency-on-#1585 dispatch with upfront format detection. Count canonical hex-byte tokens "00".."ff" in vocab.json directly:

≥ 200 (legitimate hex vocabs always have all 256) → Hex path
< 200 (HF GPT-2 byte-level vocabs have ~36) → ByteLevel path

Detection runs against vocab.json content, independent of any loader's behavior. Works whether or not PR #1585 has merged.

LIVE evidence (lambda-vector RTX 4090)

100-doc Python smoke from python-permissive.jsonl:

Vocab format	BEFORE this PR	AFTER this PR
Hex (model-2-tokenizer-v1)	12.009 bits, 13K distinct	12.009 bits, 13K distinct ✓ regression-free
GPT-2 byte-level (Qwen)	0.111 bits, 16 distinct, 99.02% unk	6.582 bits, 6118 distinct, 0.00% unk

The Qwen path now correctly produces real Python tokenization. Path forward unblocked: re-tokenize 5g.1 corpus → re-dispatch 5g.2 → honest val_loss → flip MODEL-2 ship % 57% → ≥58%.

Five-Whys

Why was PR feat(apr-cli): two-format tokenizer dispatch in encode-corpus (PMAT-CODE-TOKENIZE-BPE-FORMAT-001) #1596 broken? Assumed PR fix(tokenizer): fail-fast on GPT-2 byte-level vocab format mismatch (PMAT-CODE-TOKENIZE-BPE-FORMAT-001) #1585's fail-fast was on main; fix(tokenizer): fail-fast on GPT-2 byte-level vocab format mismatch (PMAT-CODE-TOKENIZE-BPE-FORMAT-001) #1585 is still OPEN.
Why detect upfront? PR fix(tokenizer): fail-fast on GPT-2 byte-level vocab format mismatch (PMAT-CODE-TOKENIZE-BPE-FORMAT-001) #1585's fail-fast is a load-time signal; this PR's detection is the same logic moved one level up. Cleaner DAG, no inter-PR dependency.
Why count hex-byte tokens specifically? All 256 "00".."ff" hex strings is the canonical signature of apr tokenize train. Any vocab without them is byte-level format.
Why prefer tokenizer.json when present? Canonical HF format with added_tokens registered. load_from_files also works (verified by upstream-002 test) but tokenizer.json is higher-fidelity input.
Why ship the falsifier tests? They CONFIRM the encoder works correctly. If a future refactor breaks the byte-level path or the load functions diverge, tests fail-fast. Drift prevention.

Test plan

cargo test -p aprender-core --lib falsify_bpe: 2/2 PASS
cargo test -p apr-cli --features training --lib: 5644/5644 PASS
cargo clippy -p apr-cli --features training --lib -- -D warnings: clean
cargo check --workspace: clean
rustfmt --check: clean
LIVE: hex format 12.009 bits (regression-free)
LIVE: byte-level format 6.582 bits, 0% unk (was 0.111 / 99% unk)

SHIP-TWO impact

MODEL-1 ship %: unchanged at 91%
MODEL-2 ship %: unchanged at 57% — but the path forward is now TECHNICALLY UNBLOCKED. Re-tokenize 5g.1 corpus with this fix + re-dispatch 5g.2 produces a HONEST val_loss verdict.
§50.4 cascade: COMPLETE per feat: §50.4 step 5f.5 CUDA --init wireup (PMAT-CODE-PRETRAIN-INIT-CUDA-WIREUP-001) #1577
5g.2 dispatch: OPERATOR-RUNNABLE end-to-end with WORKING encoder

Closes

PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001 (task #20).

Next ship-mover (operator-dispatchable)

PMAT-CODE-PRETRAIN-FINETUNE-LIVE-003: re-encode 5g.1 corpus with this fix (~17 hr wall on RTX 4090), then re-dispatch 5g.2 LIVE.

Files

crates/apr-cli/src/commands/tokenize.rs (+159 / -64, upfront format detection)
crates/aprender-core/src/text/bpe/tests_encode_decode.rs (+128, two new falsifier tests)

🤖 Generated with Claude Code

…PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001) ROOT CAUSE pinned + fixed. PR #1596 shipped a "try hex first, fall through on FALSIFY-001" strategy that depended on PR #1585's load-time fail-fast. With #1585 not yet merged, the hex loader silently succeeded on Qwen-format vocabs and produced 99% `<unk>` (entropy 0.111 bits / 17.21 max). The encoder itself was not the bug. Two new falsifier tests confirm `aprender::text::bpe::BpeTokenizer` works correctly: falsify_bpe_qwen_encode_python_does_not_unk_99pct — load_from_json on real Qwen2 tokenizer.json + encode Python: 0% unk, 43 tokens, 0/43 = 0% (was the predicted 99% RED) falsify_bpe_load_from_files_matches_load_from_json_encode — load_from_files vs load_from_json on same vocab: identical IDs `[750, 75698, 1445, 1648, 198, 220, 220, 220, 470, 308, 198]`, 0/11 unk in both paths Both tests host-gated on Qwen tokenizer.json presence (skip if missing). THE FIX Replace the dependency-on-#1585 dispatch with UPFRONT FORMAT DETECTION. Count canonical hex-byte tokens "00".."ff" in vocab.json directly. - ≥ 200 (legitimate hex vocabs always have all 256) → Hex path - < 200 (HF GPT-2 byte-level vocabs have ~36) → ByteLevel path Detection runs against vocab.json content, independent of any loader's behavior. Works whether or not PR #1585 has merged. LIVE EVIDENCE on lambda-vector RTX 4090 100-doc Python smoke from /mnt/.../python-permissive.jsonl: | Vocab format | BEFORE this PR | AFTER this PR | |---|---|---| | Hex (model-2-tokenizer-v1) | 12.009 bits, 13K distinct | 12.009 bits, 13K distinct (regression-free) | | GPT-2 byte-level (Qwen) | 0.111 bits, 16 distinct, 99.02% unk | 6.582 bits, 6118 distinct, 0.00% unk | The Qwen path now correctly produces real Python tokenization. This unblocks the canonical path forward for SHIP-TWO §60: re-tokenize the 5g.1 corpus → re-dispatch 5g.2 → honest val_loss → flip MODEL-2 ship % 57% → ≥58%. Five-Whys 1. Why was PR #1596's dispatch broken? It assumed PR #1585's fail-fast was on main, but #1585 was still OPEN. Hex loader silently accepted Qwen vocab → produced 99% unk → byte-level fallback never fired. 2. Why detect upfront instead of fixing the dependency chain? PR #1585's fail-fast is a load-time signal; this PR's detection is the same logic moved one level up. Now the dispatch works regardless of which path's loader runs first. Cleaner DAG. 3. Why count hex-byte tokens specifically? The presence of all 256 "00".."ff" hex strings is the canonical signature of `apr tokenize train`'s output. Any vocab without them is either GPT-2 byte-level or some other format → byte-level encoder is the correct choice (or refuse if even that fails). 4. Why prefer tokenizer.json when present? It's the canonical HF format with `added_tokens` registered. `load_from_files` on vocab.json+merges.txt also works (verified by upstream-002 test) but tokenizer.json is the higher-fidelity input. 5. Why ship the falsifier tests alongside? They CONFIRM the encoder works correctly when invoked properly. If a future refactor breaks the byte-level path (or the load functions diverge), the tests fail-fast. Drift prevention. Quality gates (all green) - cargo test -p aprender-core --lib falsify_bpe: 2 tests PASS - cargo test -p apr-cli --features training --lib: 5644/5644 PASS - cargo clippy -p apr-cli --features training --lib -- -D warnings: clean - cargo check --workspace: clean - rustfmt --check: clean - LIVE: hex format 12.009 bits (regression-free) - LIVE: byte-level format 6.582 bits, 0% unk (was 0.111 / 99% unk) SHIP-TWO impact - MODEL-1 ship %: unchanged at 91% - MODEL-2 ship %: unchanged at 57% — but the path forward is NOW TECHNICALLY UNBLOCKED. Re-tokenize 5g.1 corpus with this fix + re-dispatch 5g.2 produces a HONEST val_loss verdict. - §50.4 cascade: COMPLETE per #1577 - 5g.2 dispatch: OPERATOR-RUNNABLE end-to-end with WORKING encoder - This PR closes PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001 (task #20) - Next ship-mover: PMAT-CODE-PRETRAIN-FINETUNE-LIVE-003 (re-encode 5g.1, re-dispatch 5g.2 LIVE) — operator-dispatchable now. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ETRAIN-INIT-LOAD-003) Bisects the §61 val_loss > ln(vocab) anomaly. Empirical findings on lambda-vector RTX 4090: H4 ROOT CAUSE #1: BF16 dtype mislabel ====================================== The OLD `qwen2.5-coder-0.5b-instruct-fp16.apr` (May-4 import) tags its tensors with dtype=F16 in the APR v2 header — but the SOURCE HF safetensors `model.safetensors` uses dtype=BF16. When the loader sees dtype=F16, it dequantizes via `f16_to_f32`, producing values that diverge from the BF16-correct decode. Element-0 cross-check on `model.norm.weight`: Safetensors source (BF16-decoded): 7.5625, 8.0, 7.21875, ... Old APR (loaded as F16): 7.0625, 7.125, 7.0, ... Fresh APR (loaded as BF16): 7.5625, 8.0, 7.21875, ... ← matches source Element-0 cross-check on `model.layers.0.self_attn.q_proj.bias`: Safetensors (BF16): 0.0674, -0.0859, 0.1104, -0.0605, ... Old APR (F16): (different, distorted) Fresh APR (BF16): 0.0674, -0.0859, 0.1104, -0.0605, ... ← matches source Fix: re-import Qwen safetensors via current `apr import`. The current `StreamingWriter::add_raw_f16_tensor` correctly preserves BF16 (line 100-104 of streaming_writer.rs). The old APR was created with a buggy import path that mis-tagged BF16 as F16. H4 ROOT CAUSE #2: STILL OPEN ============================= Even with correct BF16-decoded weights (fresh APR), val_loss at step 1 is **18.55** — still above ln(vocab)=17.21 (uniform-over-vocab baseline). The dtype fix moved the dial slightly (was 19.80) but did not resolve the sub-random predictions issue. Remaining hypotheses for the residual gap: H4B (layout): some tensor's row/col-major orientation may differ between Qwen export and aprender::Transformer expectations H4D (forward path): cuBLAS / CudaBlock forward may produce wrong logits despite correct weight values Other: tied embedding fall-through path (`lm_head: None` → embed_tokens reuse) may have a sign or scale issue This PR ships the diagnostic infrastructure that PROVED root cause #1 and provides the foundation for bisecting the remaining gap. What this PR ships =================== `falsify_h4_init_stats_qwen_embed_norm_sensible` — a host-gated diagnostic test that: - Loads the Qwen 0.5B init APR (prefers fresh, falls back to legacy) - Reports tensor stats (mean, std, min, max) for embed_tokens, final norm, per-layer norms, q/k/v projections, mlp gates - Asserts sensible bounds (embed std ∈ [0.005, 0.5], norm in [0.01, 100], etc.) - Dumps element-0 values for cross-comparison with safetensors source Industrial validation example output: embed_tokens.weight: mean=0.00014, std=0.0152, range[-0.196, 0.128] — sensible HF LLaMA init scale model.norm.weight: mean=7.46, std=0.84, range[-2.28, 17.38] — Qwen-typical (final norm scaled up) q_proj.bias L0: mean=0.03, std=7.88, range[-65.5, 128] — Qwen-typical (large attention biases) Five-Whys ========== 1. Why was the OLD Qwen APR tagged F16? Created by a buggy import path that didn't pass `is_bf16` flag through to the writer. Fixed in current apr-cli but the artifact is preserved on disk. 2. Why does the fresh APR not fully fix val_loss? The dtype fix makes loaded values match safetensors, but val_loss=18.55 still exceeds ln(vocab)=17.21 — meaning forward path or some other tensor is still producing sub-random predictions. 3. Why didn't existing falsifiers catch the dtype mislabel? No falsifier asserted "loaded values match safetensors source element-by-element". The PMAT-187 NaN/Inf/explosive-mean check passes because BF16-as-F16 distortion produces values that are neither NaN nor unusually large. 4. Why ship the diagnostic before the full H4 fix? The diagnostic itself proves H4 root cause #1 and provides the bisection foundation for #2. Per `feedback_falsifier_first_cascade_pattern.md`, 1 PR ≈ 1 falsifier discharge. The dtype-mislabel discharge is real progress. 5. Why does the operator need to know? They have an old Qwen APR on disk that mis-decodes silently. With this PR's diagnostic they can verify before training; without it, the silent error wastes ~17 hours of GPU time per cycle (per §60 evidence). Quality gates (all green) ========================== - cargo test -p aprender-train --lib falsify_h4_init_stats: PASS - cargo test -p aprender-train --lib: 7585+ tests PASS - cargo clippy -p aprender-train --lib -- -D warnings: clean (--tests has 4 PRE-EXISTING errors on main; not introduced by this PR) - rustfmt --check: clean SHIP-TWO impact ================ - MODEL-1 ship %: unchanged at 91% - MODEL-2 ship %: unchanged at 57% — H4 root cause #1 found and fix available (use fresh APR), but val_loss still >ln(vocab). The next-cycle bisection (H4B or H4D) is now well-targeted. - §60 H1C cascade: FULLY CLOSED per #1598 - §61 evidence: 5g.1-v2 corpus is 7.42 bits entropy / 0% unk - This PR closes part of PMAT-CODE-PRETRAIN-INIT-LOAD-003 (task #22) Out-of-scope follow-ups ======================== PMAT-CODE-PRETRAIN-INIT-LOAD-004 (H4 residual cascade): - Bisect H4B (layout): forward-pass element-wise compare against HF Qwen2 reference at each layer - Bisect H4D (forward path): instrument cuBLAS GEMM outputs against a CPU reference matmul - Fix root cause; flip MODEL-2 ship % 57% → ≥58% Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…rfaces H4 (PMAT-CODE-PRETRAIN-INIT-LOAD-003) (#1600) Records the full discharge of PMAT-CODE-PRETRAIN-FINETUNE-LIVE-003 (task #21) and the new H4 defect surface that the honest data exposed. Two artifacts: 1. **5g.1 re-encode SUCCESS** — `apr tokenize encode-corpus` with PR #1598's upfront vocab-format detection produced a real Python corpus from the 3.0 GB JSONL source: - 1,241.7 M tokens - 405,944 documents - 126 shards × 10 M tokens each - Shard-0 first 32K: entropy 7.42 bits / 17.21 max; 3324 distinct tokens; **0% unk** (was 99.99% unk in §60's broken corpus) The data-bug from §60 is fully closed. 2. **5g.2 LIVE dispatch surfaces H4** — Re-running fine-tune from Qwen 0.5B init on the now-real corpus aborted at GATE-TRAIN-005: - 500-step run: val_loss = 11.55 at epoch 0 (> 10.0 threshold) - 1-step diagnostic: val_loss = 19.80 (> ln(vocab) = 17.21) val_loss > ln(vocab) means the model assigns LESS than uniform probability to true tokens — *worse than random init*. The Qwen init weights load (PR #1579's populate-coverage fix is in main) but produce sub-random predictions. Five-Whys 1. Why was val_loss = 19.80 at step 1? Industry baseline for Qwen 0.5B zero-shot on Python is ~1.5–3.0; uniform random over vocab is ln(151643) = 17.21. 19.80 > 17.21 means the model is *anti-aligned* with held-out tokens. 2. Why anti-aligned despite Qwen init being loaded? Some structural component of the init pipeline is broken at a layer that PR #1579 doesn't cover. 3. Four hypotheses for H4: A. Tied weights — `tie_word_embeddings: true` on Qwen 0.5B; if populate writes embed_tokens but doesn't propagate to lm_head (or writes them separately to random buffers), forward predictions are random while embeddings are correct. B. Layout mismatch — GGUF/APR are row-major (tensor-layout-v1); if init APR's lm_head is column-major, matmul produces wrong logits. C. Norm scale — RMSNorm weights loaded but rms_norm_eps mismatch cascades through forward. D. Residual stream — some block's residual contributes zero from an uninitialized buffer. 4. Why ship the diagnosis but not the H4 fix? Each hypothesis is its own falsifier-discharge cascade per `feedback_falsifier_first_cascade_pattern.md`. Multi-PR scope. 5. Why does this matter for ship %? FALSIFY-005 status flips from NUMERICALLY-PASSED-METHODOLOGY-SUSPECT (pre-§61, fake pass on broken corpus) to RED-WITH-METHODOLOGICALLY-HONEST (post-§61, real defect on real corpus). The honest RED is itself progress — the contract now reports the binding defect. SHIP-TWO impact - MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work) - MODEL-2 ship %: unchanged at 57% — diagnosis correct, H4 cascade is the gate - §60 H1C (data-bug) cascade: FULLY CLOSED. Encoder works end-to-end on real Qwen vocab + real Python corpus. Closes PMAT-CODE-PRETRAIN-FINETUNE-LIVE-003 (task #21). Tracking PMAT-CODE-PRETRAIN-INIT-LOAD-003 (H4 cascade) as the next ship-mover. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…ETRAIN-INIT-LOAD-003) (#1601) Bisects the §61 val_loss > ln(vocab) anomaly. Empirical findings on lambda-vector RTX 4090: H4 ROOT CAUSE #1: BF16 dtype mislabel ====================================== The OLD `qwen2.5-coder-0.5b-instruct-fp16.apr` (May-4 import) tags its tensors with dtype=F16 in the APR v2 header — but the SOURCE HF safetensors `model.safetensors` uses dtype=BF16. When the loader sees dtype=F16, it dequantizes via `f16_to_f32`, producing values that diverge from the BF16-correct decode. Element-0 cross-check on `model.norm.weight`: Safetensors source (BF16-decoded): 7.5625, 8.0, 7.21875, ... Old APR (loaded as F16): 7.0625, 7.125, 7.0, ... Fresh APR (loaded as BF16): 7.5625, 8.0, 7.21875, ... ← matches source Element-0 cross-check on `model.layers.0.self_attn.q_proj.bias`: Safetensors (BF16): 0.0674, -0.0859, 0.1104, -0.0605, ... Old APR (F16): (different, distorted) Fresh APR (BF16): 0.0674, -0.0859, 0.1104, -0.0605, ... ← matches source Fix: re-import Qwen safetensors via current `apr import`. The current `StreamingWriter::add_raw_f16_tensor` correctly preserves BF16 (line 100-104 of streaming_writer.rs). The old APR was created with a buggy import path that mis-tagged BF16 as F16. H4 ROOT CAUSE #2: STILL OPEN ============================= Even with correct BF16-decoded weights (fresh APR), val_loss at step 1 is **18.55** — still above ln(vocab)=17.21 (uniform-over-vocab baseline). The dtype fix moved the dial slightly (was 19.80) but did not resolve the sub-random predictions issue. Remaining hypotheses for the residual gap: H4B (layout): some tensor's row/col-major orientation may differ between Qwen export and aprender::Transformer expectations H4D (forward path): cuBLAS / CudaBlock forward may produce wrong logits despite correct weight values Other: tied embedding fall-through path (`lm_head: None` → embed_tokens reuse) may have a sign or scale issue This PR ships the diagnostic infrastructure that PROVED root cause #1 and provides the foundation for bisecting the remaining gap. What this PR ships =================== `falsify_h4_init_stats_qwen_embed_norm_sensible` — a host-gated diagnostic test that: - Loads the Qwen 0.5B init APR (prefers fresh, falls back to legacy) - Reports tensor stats (mean, std, min, max) for embed_tokens, final norm, per-layer norms, q/k/v projections, mlp gates - Asserts sensible bounds (embed std ∈ [0.005, 0.5], norm in [0.01, 100], etc.) - Dumps element-0 values for cross-comparison with safetensors source Industrial validation example output: embed_tokens.weight: mean=0.00014, std=0.0152, range[-0.196, 0.128] — sensible HF LLaMA init scale model.norm.weight: mean=7.46, std=0.84, range[-2.28, 17.38] — Qwen-typical (final norm scaled up) q_proj.bias L0: mean=0.03, std=7.88, range[-65.5, 128] — Qwen-typical (large attention biases) Five-Whys ========== 1. Why was the OLD Qwen APR tagged F16? Created by a buggy import path that didn't pass `is_bf16` flag through to the writer. Fixed in current apr-cli but the artifact is preserved on disk. 2. Why does the fresh APR not fully fix val_loss? The dtype fix makes loaded values match safetensors, but val_loss=18.55 still exceeds ln(vocab)=17.21 — meaning forward path or some other tensor is still producing sub-random predictions. 3. Why didn't existing falsifiers catch the dtype mislabel? No falsifier asserted "loaded values match safetensors source element-by-element". The PMAT-187 NaN/Inf/explosive-mean check passes because BF16-as-F16 distortion produces values that are neither NaN nor unusually large. 4. Why ship the diagnostic before the full H4 fix? The diagnostic itself proves H4 root cause #1 and provides the bisection foundation for #2. Per `feedback_falsifier_first_cascade_pattern.md`, 1 PR ≈ 1 falsifier discharge. The dtype-mislabel discharge is real progress. 5. Why does the operator need to know? They have an old Qwen APR on disk that mis-decodes silently. With this PR's diagnostic they can verify before training; without it, the silent error wastes ~17 hours of GPU time per cycle (per §60 evidence). Quality gates (all green) ========================== - cargo test -p aprender-train --lib falsify_h4_init_stats: PASS - cargo test -p aprender-train --lib: 7585+ tests PASS - cargo clippy -p aprender-train --lib -- -D warnings: clean (--tests has 4 PRE-EXISTING errors on main; not introduced by this PR) - rustfmt --check: clean SHIP-TWO impact ================ - MODEL-1 ship %: unchanged at 91% - MODEL-2 ship %: unchanged at 57% — H4 root cause #1 found and fix available (use fresh APR), but val_loss still >ln(vocab). The next-cycle bisection (H4B or H4D) is now well-targeted. - §60 H1C cascade: FULLY CLOSED per #1598 - §61 evidence: 5g.1-v2 corpus is 7.42 bits entropy / 0% unk - This PR closes part of PMAT-CODE-PRETRAIN-INIT-LOAD-003 (task #22) Out-of-scope follow-ups ======================== PMAT-CODE-PRETRAIN-INIT-LOAD-004 (H4 residual cascade): - Bisect H4B (layout): forward-pass element-wise compare against HF Qwen2 reference at each layer - Bisect H4D (forward path): instrument cuBLAS GEMM outputs against a CPU reference matmul - Fix root cause; flip MODEL-2 ship % 57% → ≥58% Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 9, 2026 20:59

Merge branch 'main' into feat/bpe-upstream-qwen-encode-fix-clean

3767f66

noahgift mentioned this pull request May 10, 2026

docs(evidence): §61 — 5g.1 re-encode SUCCESS, 5g.2 honest dispatch surfaces H4 (PMAT-CODE-PRETRAIN-INIT-LOAD-003) #1600

Merged

5 tasks

noahgift merged commit 4ef525c into main May 10, 2026
10 checks passed

noahgift deleted the feat/bpe-upstream-qwen-encode-fix-clean branch May 10, 2026 07:42

noahgift mentioned this pull request May 10, 2026

feat(aprender-train): H4 init-load diagnostic — finds BF16 dtype mislabel root cause #1 (PMAT-CODE-PRETRAIN-INIT-LOAD-003) #1601

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(apr-cli): upfront vocab-format detection unblocks Qwen encoding (PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001)#1598

fix(apr-cli): upfront vocab-format detection unblocks Qwen encoding (PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001)#1598
noahgift merged 2 commits into
mainfrom
feat/bpe-upstream-qwen-encode-fix-clean

noahgift commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 9, 2026

Summary

The bug

The fix

LIVE evidence (lambda-vector RTX 4090)

Five-Whys

Test plan

SHIP-TWO impact

Closes

Next ship-mover (operator-dispatchable)

Files

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant