Skip to content

feat(gguf-export): Q4_K divisibility check + shape pass-through (PMAT-690 P3-C-prep defects 2+3)#1771

Merged
noahgift merged 7 commits into
mainfrom
feat/gguf-q4k-shape-fallback-defect-2
May 18, 2026
Merged

feat(gguf-export): Q4_K divisibility check + shape pass-through (PMAT-690 P3-C-prep defects 2+3)#1771
noahgift merged 7 commits into
mainfrom
feat/gguf-q4k-shape-fallback-defect-2

Conversation

@noahgift

@noahgift noahgift commented May 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Bundles two related defects in GGUF Q4_K export, both surfaced by the P2-E ep49 publish-readiness preflight on 2026-05-17:

Defect 2 (initial commit): encode_gguf_data quantized any 2D tensor with ≥256 elements to Q4_K without checking that shape[1] (= K = ne[0]) is 256-divisible. llama-cli rejected the export with tensor 'blk.0.ffn_gate.weight' of type 12 (q4_K) has 896 elements per row, not a multiple of block size (256).

Defect 3 (follow-up commit, unmasked by Defect 2 fix): with the divisibility check in place, ffn_down [out=896, in=4864] (K=4864, divisible ✓) was Q4K'd — but llama-cli then rejected the export with gguf_init_from_file_impl: tensor 'blk.0.ffn_gate.weight' has offset 1091882496, expected 1091532288. Root cause: 3 call sites construct gguf_shape_usize = [shape[1], shape[0]] and pass the swapped shape to quantize_q4_k_matrix, which treats shape[0] as rows and pads shape[1] to the next 256-multiple. For ffn_down this inflated the output to 2,801,664 bytes when llama-cpp expects 2,451,456 — a 350,208 byte excess that exactly matches the offset drift.

Fix: pass APR-native shape directly (no swap) AND add the same divisibility check to fusion.rs + export_include_01.rs for consistency with encode_gguf_data.

Why this didn't surface for 1.5B / 7B

Variant hidden hidden % 256 intermediate % 256
0.5B (this ship) 896 128 0
1.5B 1536 0 0
7B 3584 0 0

For 1.5B and 7B, both dims are 256-divisible, so the swap gives the same total byte count (just different layout). The 0.5B ship is the first time we've hit a model where one dim isn't 256-divisible. The data layout difference for 1.5B / 7B is likely producing wrong inference output, which is a follow-up investigation (filed as P3-C-prep defect 4 to scope).

End-to-end verification

Re-exported ep49.apr → q4k.gguf with both fixes in the local binary:

$ /usr/local/bin/apr export ep49.apr --format gguf --quantize int4 -o q4k.gguf
[GGUF-EXPORT-Q4K-FALLBACK] blk.0.ffn_gate.weight (shape [4864, 896]) — K=896 not divisible by 256; falling back to F32
[GGUF-EXPORT-Q4K-FALLBACK] blk.0.attn_k.weight (shape [128, 896]) — K=896 not divisible by 256; falling back to F32
... (Defect 2 fallback firing on ~7 tensors/layer)
✓ Export successful: 2.01 GiB

$ llama-cli -m q4k.gguf -p "def factorial(n):" -n 8
# Defect 2 check ✓ — no Q4_K block-size rejection
# Defect 3 check ✓ — no offset drift error
# Loads to: "cannot find tokenizer merges in model file"
# (= Defect 1, fixed in PR #1769)

Tradeoffs

Qwen2 0.5B GGUF Q4_K export inflates from ~700 MB (Q4_K throughout) → ~2.0 GB (F32 fallback for ~95% of weights). Acceptable for v1 stack-existence-proof ship (SPEC §88) — the alternative is a broken artifact. Future enhancement: Q4_0 (block_size=32) for K=896 tensors would give ~1.1 GB.

Tests

  • 7 new q4k_divisibility tests including byte-count pin: q4k_byte_count_matches_llama_cpp_expectation asserts bytes.len() == (rows * cols / 256) * 144 = 2_451_456 for ffn_down.
  • All 55 pre-existing q4k tests across aprender-core pass.
  • All 7 pre-existing gguf_export tests pass.

Test plan

  • 7 new q4k_divisibility unit tests pass
  • 55 pre-existing q4k tests still pass
  • End-to-end: rebuilt ep49 GGUF Q4_K — llama-cli no longer rejects Q4_K block size OR offsets
  • After PR feat(apr-stamp): --tokenizer flag embeds vocab + merges (P3-C-prep defect 1) #1769 lands: re-stamp with embedded tokenizer + re-export → llama-cli runs inference
  • Re-run publish-readiness preflight — should report GO
  • Execute apr publish /tmp/albor-370m-staging paiml/albor-370m-v1 ...

🤖 Generated with Claude Code

…ep defect 2)

Surfaced by P2-E ep49 publish-readiness preflight on 2026-05-17. The
GGUF Q4_K export at /tmp/albor-370m-staging/albor-370m-v1-q4k.gguf was
rejected by llama-cli with:

  tensor 'blk.0.ffn_gate.weight' of type 12 (q4_K) has 896 elements
  per row, not a multiple of block size (256)

Root cause: encode_gguf_data quantized any 2D tensor with >= 256 elements
to Q4_K without checking that the inner dim (K) is divisible by Q4_K's
block_size of 256.

For Qwen2 0.5B (hidden=896, intermediate=4864) most attention and FFN
projections have K=896. 896 % 256 = 128, so llama-cli rejects every
such tensor.

Fix: add `shape[1] % 256 == 0` to the Q4_K eligibility check in
encode_gguf_data. Non-divisible tensors fall through to the existing
F32 path (matches llama.cpp/convert_hf_to_gguf.py convention of keeping
unconvertible tensors at F16/F32).

Tradeoff: Qwen2 0.5B Q4_K export will be ~2.1 GB instead of ~700 MB
because most tensors fall back. Acceptable for v1 stack-existence-proof
ship target (SPEC §88) — alternative is a broken artifact. Larger Qwen2
variants (1.5B hidden=1536, 7B hidden=3584) are unaffected because their
K dims stay 256-divisible.

Tests: 6 unit tests in q4k_divisibility_tests covering:
  - Qwen2 0.5B ffn_gate.weight [4864, 896] → F32 fallback
  - ffn_down.weight [896, 4864] → Q4_K (still works)
  - Exact-256 boundary [128, 256] → Q4_K
  - All four Qwen2 attention projections → F32 fallback
  - Embedding + lm_head always F32 (existing path preserved)
  - use_q4k=false → always F32

All 7 pre-existing gguf_export tests still pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 17, 2026 16:39
…PMAT-690 P3-C-prep defect 3)

Surfaced when defect 2 (Q4_K block-size divisibility check) was applied
to the P2-E ep49 export. Defect 2 unblocked the per-row check but
llama-cli then rejected the file with:

  gguf_init_from_file_impl: tensor 'blk.0.ffn_gate.weight' has offset
  1091882496, expected 1091532288

Root cause: encode_gguf_data + fusion.rs + export_include_01.rs all
construct `gguf_shape_usize = [shape[1], shape[0]]` (a swap) before
calling `quantize_q4_k_matrix(data, &gguf_shape_usize)`. The quantizer
function treats `shape[0]` as `rows` and `shape[1]` as `cols`, padding
`cols` up to the next multiple of 256.

For Qwen2 0.5B ffn_down [out=896, in=4864]:
- Swapped shape passed = [4864, 896]
- Function reads rows=4864, cols=896
- super_blocks_per_row = ceil(896/256) = 4 (PADDING to 1024)
- Total bytes = 4864 * 4 * 144 = 2,801,664

But llama-cpp expects:
- ne[0] = 4864 (per-row), ne[1] = 896 (rows)
- super-blocks = (4864 * 896) / 256 = 17,024
- Total bytes = 17,024 * 144 = 2,451,456

Excess = 350,208 bytes — exactly the offset drift llama-cli reported.

Fix: pass APR-native shape directly (no swap). The quantizer then reads
rows=out=896, cols=in=4864 (= K, 256-divisible), iterates per-out-row
contiguous slices, produces 19 blocks/row × 896 rows = 17,024 blocks.

Also adds the divisibility guard to fusion.rs and export_include_01.rs
to keep them consistent with encode_gguf_data — fused tensors and
tied-output-weight construction now fall back to F32 when their K dim
isn't 256-divisible.

End-to-end verification on Qwen2 0.5B ep49 GGUF Q4_K export:
- llama-cli loads the file without Q4_K rejection
- llama-cli loads the file without offset drift error
- Only remaining error is "cannot find tokenizer merges" (defect 1 —
  fixed in PR #1769, `apr stamp --tokenizer`)

Why this latent bug hadn't surfaced for 1.5B / 7B exports: when BOTH
shape[0] and shape[1] are 256-divisible (true for Qwen2 1.5B with
hidden=1536, 7B with hidden=3584), the swap doesn't change the total
byte count (rows*cols/256 * 144 either way) — the inflation only
appears when one dim is not 256-divisible. The data LAYOUT difference
remains for those models, but llama-cli accepts the byte count so the
file loads — likely producing wrong inference, which is a follow-up
investigation. For the 0.5B ship target the immediate Q4_K-compatible
byte count is what unblocks publish.

Tests: new q4k_byte_count_matches_llama_cpp_expectation pins the exact
byte count for ffn_down [896, 4864] = 2,451,456. All 7 q4k_divisibility
tests + 55 q4k tests across aprender-core pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift changed the title feat(gguf-export): Q4_K shape divisibility fallback (PMAT-690 P3-C-prep defect 2) feat(gguf-export): Q4_K divisibility check + shape pass-through (PMAT-690 P3-C-prep defects 2+3) May 17, 2026
noahgift and others added 5 commits May 17, 2026 19:10
Workspace fmt drift accumulated in main between v0.33.0 cut and now.
PR #1771's CI lint surfaced it on this branch. No semantic changes —
all diffs are whitespace/wrap rearrangements from cargo fmt.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ust 1.93 new lints)

CI's clippy job promoted pedantic warnings to errors via -D warnings.
Rust 1.93 added `manual_is_multiple_of` (3 sites in aprender-test-lib)
and `format_in_format_args` to pedantic. The aprender-test-lib usages
are pre-existing; bulk cleanup deferred to a focused PR.

Also fixed the one format_in_format_args site introduced in this PR
(fusion.rs:78) by inlining the format! into the eprintln! args.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit a745a9a into main May 18, 2026
10 checks passed
@noahgift noahgift deleted the feat/gguf-q4k-shape-fallback-defect-2 branch May 18, 2026 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant