Skip to content

fix(apr-cli): upfront vocab-format detection unblocks Qwen encoding (PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001)#1598

Merged
noahgift merged 2 commits into
mainfrom
feat/bpe-upstream-qwen-encode-fix-clean
May 10, 2026
Merged

fix(apr-cli): upfront vocab-format detection unblocks Qwen encoding (PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001)#1598
noahgift merged 2 commits into
mainfrom
feat/bpe-upstream-qwen-encode-fix-clean

Conversation

@noahgift

@noahgift noahgift commented May 9, 2026

Copy link
Copy Markdown
Contributor

Summary

ROOT CAUSE pinned + fixed. Qwen-format vocab encoding now works correctly: 0% unk, 6.582 bits entropy on Python source (was 99% unk / 0.111 bits before).

The bug

PR #1596 shipped a "try hex first, fall through on FALSIFY-001" strategy that depended on PR #1585's load-time fail-fast. With #1585 still OPEN, the hex loader silently succeeded on Qwen-format vocabs and produced 99% <unk> (entropy 0.111 bits / 17.21 max).

The encoder itself was not the bug. Two new falsifier tests confirm aprender::text::bpe::BpeTokenizer works correctly:

  • falsify_bpe_qwen_encode_python_does_not_unk_99pct — load_from_json on real Qwen2 tokenizer.json: 0% unk on Python ✓
  • falsify_bpe_load_from_files_matches_load_from_json_encode — both load paths produce identical IDs ✓

The fix

Replace the dependency-on-#1585 dispatch with upfront format detection. Count canonical hex-byte tokens "00".."ff" in vocab.json directly:

  • ≥ 200 (legitimate hex vocabs always have all 256) → Hex path
  • < 200 (HF GPT-2 byte-level vocabs have ~36) → ByteLevel path

Detection runs against vocab.json content, independent of any loader's behavior. Works whether or not PR #1585 has merged.

LIVE evidence (lambda-vector RTX 4090)

100-doc Python smoke from python-permissive.jsonl:

Vocab format BEFORE this PR AFTER this PR
Hex (model-2-tokenizer-v1) 12.009 bits, 13K distinct 12.009 bits, 13K distinct ✓ regression-free
GPT-2 byte-level (Qwen) 0.111 bits, 16 distinct, 99.02% unk 6.582 bits, 6118 distinct, 0.00% unk

The Qwen path now correctly produces real Python tokenization. Path forward unblocked: re-tokenize 5g.1 corpus → re-dispatch 5g.2 → honest val_loss → flip MODEL-2 ship % 57% → ≥58%.

Five-Whys

  1. Why was PR feat(apr-cli): two-format tokenizer dispatch in encode-corpus (PMAT-CODE-TOKENIZE-BPE-FORMAT-001) #1596 broken? Assumed PR fix(tokenizer): fail-fast on GPT-2 byte-level vocab format mismatch (PMAT-CODE-TOKENIZE-BPE-FORMAT-001) #1585's fail-fast was on main; fix(tokenizer): fail-fast on GPT-2 byte-level vocab format mismatch (PMAT-CODE-TOKENIZE-BPE-FORMAT-001) #1585 is still OPEN.
  2. Why detect upfront? PR fix(tokenizer): fail-fast on GPT-2 byte-level vocab format mismatch (PMAT-CODE-TOKENIZE-BPE-FORMAT-001) #1585's fail-fast is a load-time signal; this PR's detection is the same logic moved one level up. Cleaner DAG, no inter-PR dependency.
  3. Why count hex-byte tokens specifically? All 256 "00".."ff" hex strings is the canonical signature of apr tokenize train. Any vocab without them is byte-level format.
  4. Why prefer tokenizer.json when present? Canonical HF format with added_tokens registered. load_from_files also works (verified by upstream-002 test) but tokenizer.json is higher-fidelity input.
  5. Why ship the falsifier tests? They CONFIRM the encoder works correctly. If a future refactor breaks the byte-level path or the load functions diverge, tests fail-fast. Drift prevention.

Test plan

  • cargo test -p aprender-core --lib falsify_bpe: 2/2 PASS
  • cargo test -p apr-cli --features training --lib: 5644/5644 PASS
  • cargo clippy -p apr-cli --features training --lib -- -D warnings: clean
  • cargo check --workspace: clean
  • rustfmt --check: clean
  • LIVE: hex format 12.009 bits (regression-free)
  • LIVE: byte-level format 6.582 bits, 0% unk (was 0.111 / 99% unk)

SHIP-TWO impact

Closes

PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001 (task #20).

Next ship-mover (operator-dispatchable)

PMAT-CODE-PRETRAIN-FINETUNE-LIVE-003: re-encode 5g.1 corpus with this fix (~17 hr wall on RTX 4090), then re-dispatch 5g.2 LIVE.

Files

  • crates/apr-cli/src/commands/tokenize.rs (+159 / -64, upfront format detection)
  • crates/aprender-core/src/text/bpe/tests_encode_decode.rs (+128, two new falsifier tests)

🤖 Generated with Claude Code

…PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001)

ROOT CAUSE pinned + fixed.

PR #1596 shipped a "try hex first, fall through on FALSIFY-001"
strategy that depended on PR #1585's load-time fail-fast. With #1585
not yet merged, the hex loader silently succeeded on Qwen-format
vocabs and produced 99% `<unk>` (entropy 0.111 bits / 17.21 max).

The encoder itself was not the bug. Two new falsifier tests confirm
`aprender::text::bpe::BpeTokenizer` works correctly:

  falsify_bpe_qwen_encode_python_does_not_unk_99pct
    — load_from_json on real Qwen2 tokenizer.json + encode Python:
      0% unk, 43 tokens, 0/43 = 0% (was the predicted 99% RED)
  falsify_bpe_load_from_files_matches_load_from_json_encode
    — load_from_files vs load_from_json on same vocab:
      identical IDs `[750, 75698, 1445, 1648, 198, 220, 220, 220, 470, 308, 198]`,
      0/11 unk in both paths

Both tests host-gated on Qwen tokenizer.json presence (skip if missing).

THE FIX

Replace the dependency-on-#1585 dispatch with UPFRONT FORMAT DETECTION.
Count canonical hex-byte tokens "00".."ff" in vocab.json directly.
- ≥ 200 (legitimate hex vocabs always have all 256) → Hex path
- < 200 (HF GPT-2 byte-level vocabs have ~36) → ByteLevel path

Detection runs against vocab.json content, independent of any
loader's behavior. Works whether or not PR #1585 has merged.

LIVE EVIDENCE on lambda-vector RTX 4090

100-doc Python smoke from /mnt/.../python-permissive.jsonl:

| Vocab format | BEFORE this PR | AFTER this PR |
|---|---|---|
| Hex (model-2-tokenizer-v1) | 12.009 bits, 13K distinct | 12.009 bits, 13K distinct (regression-free) |
| GPT-2 byte-level (Qwen) | 0.111 bits, 16 distinct, 99.02% unk | 6.582 bits, 6118 distinct, 0.00% unk |

The Qwen path now correctly produces real Python tokenization. This
unblocks the canonical path forward for SHIP-TWO §60: re-tokenize
the 5g.1 corpus → re-dispatch 5g.2 → honest val_loss → flip
MODEL-2 ship % 57% → ≥58%.

Five-Whys

1. Why was PR #1596's dispatch broken? It assumed PR #1585's
   fail-fast was on main, but #1585 was still OPEN. Hex loader
   silently accepted Qwen vocab → produced 99% unk → byte-level
   fallback never fired.
2. Why detect upfront instead of fixing the dependency chain?
   PR #1585's fail-fast is a load-time signal; this PR's detection
   is the same logic moved one level up. Now the dispatch works
   regardless of which path's loader runs first. Cleaner DAG.
3. Why count hex-byte tokens specifically? The presence of all 256
   "00".."ff" hex strings is the canonical signature of `apr
   tokenize train`'s output. Any vocab without them is either GPT-2
   byte-level or some other format → byte-level encoder is the
   correct choice (or refuse if even that fails).
4. Why prefer tokenizer.json when present? It's the canonical HF
   format with `added_tokens` registered. `load_from_files` on
   vocab.json+merges.txt also works (verified by upstream-002 test)
   but tokenizer.json is the higher-fidelity input.
5. Why ship the falsifier tests alongside? They CONFIRM the
   encoder works correctly when invoked properly. If a future
   refactor breaks the byte-level path (or the load functions
   diverge), the tests fail-fast. Drift prevention.

Quality gates (all green)

- cargo test -p aprender-core --lib falsify_bpe: 2 tests PASS
- cargo test -p apr-cli --features training --lib: 5644/5644 PASS
- cargo clippy -p apr-cli --features training --lib -- -D warnings: clean
- cargo check --workspace: clean
- rustfmt --check: clean
- LIVE: hex format 12.009 bits (regression-free)
- LIVE: byte-level format 6.582 bits, 0% unk (was 0.111 / 99% unk)

SHIP-TWO impact

- MODEL-1 ship %: unchanged at 91%
- MODEL-2 ship %: unchanged at 57% — but the path forward is NOW
  TECHNICALLY UNBLOCKED. Re-tokenize 5g.1 corpus with this fix +
  re-dispatch 5g.2 produces a HONEST val_loss verdict.
- §50.4 cascade: COMPLETE per #1577
- 5g.2 dispatch: OPERATOR-RUNNABLE end-to-end with WORKING encoder
- This PR closes PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001 (task #20)
- Next ship-mover: PMAT-CODE-PRETRAIN-FINETUNE-LIVE-003 (re-encode
  5g.1, re-dispatch 5g.2 LIVE) — operator-dispatchable now.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 9, 2026 20:59
@noahgift noahgift merged commit 4ef525c into main May 10, 2026
10 checks passed
@noahgift noahgift deleted the feat/bpe-upstream-qwen-encode-fix-clean branch May 10, 2026 07:42
noahgift added a commit that referenced this pull request May 11, 2026
…ETRAIN-INIT-LOAD-003)

Bisects the §61 val_loss > ln(vocab) anomaly. Empirical findings on
lambda-vector RTX 4090:

H4 ROOT CAUSE #1: BF16 dtype mislabel
======================================

The OLD `qwen2.5-coder-0.5b-instruct-fp16.apr` (May-4 import) tags
its tensors with dtype=F16 in the APR v2 header — but the SOURCE
HF safetensors `model.safetensors` uses dtype=BF16. When the loader
sees dtype=F16, it dequantizes via `f16_to_f32`, producing values
that diverge from the BF16-correct decode.

Element-0 cross-check on `model.norm.weight`:
  Safetensors source (BF16-decoded): 7.5625, 8.0, 7.21875, ...
  Old APR (loaded as F16):           7.0625, 7.125, 7.0, ...
  Fresh APR (loaded as BF16):        7.5625, 8.0, 7.21875, ...  ← matches source

Element-0 cross-check on `model.layers.0.self_attn.q_proj.bias`:
  Safetensors (BF16):  0.0674, -0.0859, 0.1104, -0.0605, ...
  Old APR (F16):       (different, distorted)
  Fresh APR (BF16):    0.0674, -0.0859, 0.1104, -0.0605, ...  ← matches source

Fix: re-import Qwen safetensors via current `apr import`. The current
`StreamingWriter::add_raw_f16_tensor` correctly preserves BF16 (line
100-104 of streaming_writer.rs). The old APR was created with a buggy
import path that mis-tagged BF16 as F16.

H4 ROOT CAUSE #2: STILL OPEN
=============================

Even with correct BF16-decoded weights (fresh APR), val_loss at step 1
is **18.55** — still above ln(vocab)=17.21 (uniform-over-vocab
baseline). The dtype fix moved the dial slightly (was 19.80) but did
not resolve the sub-random predictions issue.

Remaining hypotheses for the residual gap:
  H4B (layout): some tensor's row/col-major orientation may differ
       between Qwen export and aprender::Transformer expectations
  H4D (forward path): cuBLAS / CudaBlock forward may produce
       wrong logits despite correct weight values
  Other: tied embedding fall-through path (`lm_head: None` →
       embed_tokens reuse) may have a sign or scale issue

This PR ships the diagnostic infrastructure that PROVED root cause #1
and provides the foundation for bisecting the remaining gap.

What this PR ships
===================

`falsify_h4_init_stats_qwen_embed_norm_sensible` — a host-gated
diagnostic test that:
  - Loads the Qwen 0.5B init APR (prefers fresh, falls back to legacy)
  - Reports tensor stats (mean, std, min, max) for embed_tokens,
    final norm, per-layer norms, q/k/v projections, mlp gates
  - Asserts sensible bounds (embed std ∈ [0.005, 0.5], norm in
    [0.01, 100], etc.)
  - Dumps element-0 values for cross-comparison with safetensors
    source

Industrial validation example output:
  embed_tokens.weight: mean=0.00014, std=0.0152, range[-0.196, 0.128]
                       — sensible HF LLaMA init scale
  model.norm.weight:   mean=7.46, std=0.84, range[-2.28, 17.38]
                       — Qwen-typical (final norm scaled up)
  q_proj.bias L0:      mean=0.03, std=7.88, range[-65.5, 128]
                       — Qwen-typical (large attention biases)

Five-Whys
==========

1. Why was the OLD Qwen APR tagged F16? Created by a buggy import
   path that didn't pass `is_bf16` flag through to the writer.
   Fixed in current apr-cli but the artifact is preserved on disk.
2. Why does the fresh APR not fully fix val_loss? The dtype fix
   makes loaded values match safetensors, but val_loss=18.55 still
   exceeds ln(vocab)=17.21 — meaning forward path or some other
   tensor is still producing sub-random predictions.
3. Why didn't existing falsifiers catch the dtype mislabel? No
   falsifier asserted "loaded values match safetensors source
   element-by-element". The PMAT-187 NaN/Inf/explosive-mean check
   passes because BF16-as-F16 distortion produces values that are
   neither NaN nor unusually large.
4. Why ship the diagnostic before the full H4 fix? The diagnostic
   itself proves H4 root cause #1 and provides the bisection
   foundation for #2. Per `feedback_falsifier_first_cascade_pattern.md`,
   1 PR ≈ 1 falsifier discharge. The dtype-mislabel discharge is
   real progress.
5. Why does the operator need to know? They have an old Qwen APR
   on disk that mis-decodes silently. With this PR's diagnostic
   they can verify before training; without it, the silent error
   wastes ~17 hours of GPU time per cycle (per §60 evidence).

Quality gates (all green)
==========================

- cargo test -p aprender-train --lib falsify_h4_init_stats: PASS
- cargo test -p aprender-train --lib: 7585+ tests PASS
- cargo clippy -p aprender-train --lib -- -D warnings: clean
  (--tests has 4 PRE-EXISTING errors on main; not introduced by this PR)
- rustfmt --check: clean

SHIP-TWO impact
================

- MODEL-1 ship %: unchanged at 91%
- MODEL-2 ship %: unchanged at 57% — H4 root cause #1 found and fix
  available (use fresh APR), but val_loss still >ln(vocab). The
  next-cycle bisection (H4B or H4D) is now well-targeted.
- §60 H1C cascade: FULLY CLOSED per #1598
- §61 evidence: 5g.1-v2 corpus is 7.42 bits entropy / 0% unk
- This PR closes part of PMAT-CODE-PRETRAIN-INIT-LOAD-003 (task #22)

Out-of-scope follow-ups
========================

PMAT-CODE-PRETRAIN-INIT-LOAD-004 (H4 residual cascade):
  - Bisect H4B (layout): forward-pass element-wise compare against
    HF Qwen2 reference at each layer
  - Bisect H4D (forward path): instrument cuBLAS GEMM outputs
    against a CPU reference matmul
  - Fix root cause; flip MODEL-2 ship % 57% → ≥58%

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 13, 2026
…rfaces H4 (PMAT-CODE-PRETRAIN-INIT-LOAD-003) (#1600)

Records the full discharge of PMAT-CODE-PRETRAIN-FINETUNE-LIVE-003
(task #21) and the new H4 defect surface that the honest data
exposed.

Two artifacts:

1. **5g.1 re-encode SUCCESS** — `apr tokenize encode-corpus` with
   PR #1598's upfront vocab-format detection produced a real Python
   corpus from the 3.0 GB JSONL source:
     - 1,241.7 M tokens
     - 405,944 documents
     - 126 shards × 10 M tokens each
     - Shard-0 first 32K: entropy 7.42 bits / 17.21 max; 3324 distinct
       tokens; **0% unk** (was 99.99% unk in §60's broken corpus)
   The data-bug from §60 is fully closed.

2. **5g.2 LIVE dispatch surfaces H4** — Re-running fine-tune from
   Qwen 0.5B init on the now-real corpus aborted at GATE-TRAIN-005:
     - 500-step run: val_loss = 11.55 at epoch 0 (> 10.0 threshold)
     - 1-step diagnostic: val_loss = 19.80 (> ln(vocab) = 17.21)
   val_loss > ln(vocab) means the model assigns LESS than uniform
   probability to true tokens — *worse than random init*. The Qwen
   init weights load (PR #1579's populate-coverage fix is in main)
   but produce sub-random predictions.

Five-Whys

1. Why was val_loss = 19.80 at step 1? Industry baseline for Qwen
   0.5B zero-shot on Python is ~1.5–3.0; uniform random over vocab
   is ln(151643) = 17.21. 19.80 > 17.21 means the model is
   *anti-aligned* with held-out tokens.
2. Why anti-aligned despite Qwen init being loaded? Some structural
   component of the init pipeline is broken at a layer that PR #1579
   doesn't cover.
3. Four hypotheses for H4:
     A. Tied weights — `tie_word_embeddings: true` on Qwen 0.5B; if
        populate writes embed_tokens but doesn't propagate to
        lm_head (or writes them separately to random buffers),
        forward predictions are random while embeddings are correct.
     B. Layout mismatch — GGUF/APR are row-major (tensor-layout-v1);
        if init APR's lm_head is column-major, matmul produces
        wrong logits.
     C. Norm scale — RMSNorm weights loaded but rms_norm_eps mismatch
        cascades through forward.
     D. Residual stream — some block's residual contributes zero from
        an uninitialized buffer.
4. Why ship the diagnosis but not the H4 fix? Each hypothesis is its
   own falsifier-discharge cascade per `feedback_falsifier_first_cascade_pattern.md`.
   Multi-PR scope.
5. Why does this matter for ship %? FALSIFY-005 status flips from
   NUMERICALLY-PASSED-METHODOLOGY-SUSPECT (pre-§61, fake pass on
   broken corpus) to RED-WITH-METHODOLOGICALLY-HONEST (post-§61,
   real defect on real corpus). The honest RED is itself progress
   — the contract now reports the binding defect.

SHIP-TWO impact

- MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work)
- MODEL-2 ship %: unchanged at 57% — diagnosis correct, H4 cascade
  is the gate
- §60 H1C (data-bug) cascade: FULLY CLOSED. Encoder works
  end-to-end on real Qwen vocab + real Python corpus.

Closes PMAT-CODE-PRETRAIN-FINETUNE-LIVE-003 (task #21).

Tracking PMAT-CODE-PRETRAIN-INIT-LOAD-003 (H4 cascade) as the next
ship-mover.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 13, 2026
…ETRAIN-INIT-LOAD-003) (#1601)

Bisects the §61 val_loss > ln(vocab) anomaly. Empirical findings on
lambda-vector RTX 4090:

H4 ROOT CAUSE #1: BF16 dtype mislabel
======================================

The OLD `qwen2.5-coder-0.5b-instruct-fp16.apr` (May-4 import) tags
its tensors with dtype=F16 in the APR v2 header — but the SOURCE
HF safetensors `model.safetensors` uses dtype=BF16. When the loader
sees dtype=F16, it dequantizes via `f16_to_f32`, producing values
that diverge from the BF16-correct decode.

Element-0 cross-check on `model.norm.weight`:
  Safetensors source (BF16-decoded): 7.5625, 8.0, 7.21875, ...
  Old APR (loaded as F16):           7.0625, 7.125, 7.0, ...
  Fresh APR (loaded as BF16):        7.5625, 8.0, 7.21875, ...  ← matches source

Element-0 cross-check on `model.layers.0.self_attn.q_proj.bias`:
  Safetensors (BF16):  0.0674, -0.0859, 0.1104, -0.0605, ...
  Old APR (F16):       (different, distorted)
  Fresh APR (BF16):    0.0674, -0.0859, 0.1104, -0.0605, ...  ← matches source

Fix: re-import Qwen safetensors via current `apr import`. The current
`StreamingWriter::add_raw_f16_tensor` correctly preserves BF16 (line
100-104 of streaming_writer.rs). The old APR was created with a buggy
import path that mis-tagged BF16 as F16.

H4 ROOT CAUSE #2: STILL OPEN
=============================

Even with correct BF16-decoded weights (fresh APR), val_loss at step 1
is **18.55** — still above ln(vocab)=17.21 (uniform-over-vocab
baseline). The dtype fix moved the dial slightly (was 19.80) but did
not resolve the sub-random predictions issue.

Remaining hypotheses for the residual gap:
  H4B (layout): some tensor's row/col-major orientation may differ
       between Qwen export and aprender::Transformer expectations
  H4D (forward path): cuBLAS / CudaBlock forward may produce
       wrong logits despite correct weight values
  Other: tied embedding fall-through path (`lm_head: None` →
       embed_tokens reuse) may have a sign or scale issue

This PR ships the diagnostic infrastructure that PROVED root cause #1
and provides the foundation for bisecting the remaining gap.

What this PR ships
===================

`falsify_h4_init_stats_qwen_embed_norm_sensible` — a host-gated
diagnostic test that:
  - Loads the Qwen 0.5B init APR (prefers fresh, falls back to legacy)
  - Reports tensor stats (mean, std, min, max) for embed_tokens,
    final norm, per-layer norms, q/k/v projections, mlp gates
  - Asserts sensible bounds (embed std ∈ [0.005, 0.5], norm in
    [0.01, 100], etc.)
  - Dumps element-0 values for cross-comparison with safetensors
    source

Industrial validation example output:
  embed_tokens.weight: mean=0.00014, std=0.0152, range[-0.196, 0.128]
                       — sensible HF LLaMA init scale
  model.norm.weight:   mean=7.46, std=0.84, range[-2.28, 17.38]
                       — Qwen-typical (final norm scaled up)
  q_proj.bias L0:      mean=0.03, std=7.88, range[-65.5, 128]
                       — Qwen-typical (large attention biases)

Five-Whys
==========

1. Why was the OLD Qwen APR tagged F16? Created by a buggy import
   path that didn't pass `is_bf16` flag through to the writer.
   Fixed in current apr-cli but the artifact is preserved on disk.
2. Why does the fresh APR not fully fix val_loss? The dtype fix
   makes loaded values match safetensors, but val_loss=18.55 still
   exceeds ln(vocab)=17.21 — meaning forward path or some other
   tensor is still producing sub-random predictions.
3. Why didn't existing falsifiers catch the dtype mislabel? No
   falsifier asserted "loaded values match safetensors source
   element-by-element". The PMAT-187 NaN/Inf/explosive-mean check
   passes because BF16-as-F16 distortion produces values that are
   neither NaN nor unusually large.
4. Why ship the diagnostic before the full H4 fix? The diagnostic
   itself proves H4 root cause #1 and provides the bisection
   foundation for #2. Per `feedback_falsifier_first_cascade_pattern.md`,
   1 PR ≈ 1 falsifier discharge. The dtype-mislabel discharge is
   real progress.
5. Why does the operator need to know? They have an old Qwen APR
   on disk that mis-decodes silently. With this PR's diagnostic
   they can verify before training; without it, the silent error
   wastes ~17 hours of GPU time per cycle (per §60 evidence).

Quality gates (all green)
==========================

- cargo test -p aprender-train --lib falsify_h4_init_stats: PASS
- cargo test -p aprender-train --lib: 7585+ tests PASS
- cargo clippy -p aprender-train --lib -- -D warnings: clean
  (--tests has 4 PRE-EXISTING errors on main; not introduced by this PR)
- rustfmt --check: clean

SHIP-TWO impact
================

- MODEL-1 ship %: unchanged at 91%
- MODEL-2 ship %: unchanged at 57% — H4 root cause #1 found and fix
  available (use fresh APR), but val_loss still >ln(vocab). The
  next-cycle bisection (H4B or H4D) is now well-targeted.
- §60 H1C cascade: FULLY CLOSED per #1598
- §61 evidence: 5g.1-v2 corpus is 7.42 bits entropy / 0% unk
- This PR closes part of PMAT-CODE-PRETRAIN-INIT-LOAD-003 (task #22)

Out-of-scope follow-ups
========================

PMAT-CODE-PRETRAIN-INIT-LOAD-004 (H4 residual cascade):
  - Bisect H4B (layout): forward-pass element-wise compare against
    HF Qwen2 reference at each layer
  - Bisect H4D (forward path): instrument cuBLAS GEMM outputs
    against a CPU reference matmul
  - Fix root cause; flip MODEL-2 ship % 57% → ≥58%

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant