feat(gguf-export): Q4_K divisibility check + shape pass-through (PMAT-690 P3-C-prep defects 2+3) by noahgift · Pull Request #1771 · paiml/aprender

noahgift · 2026-05-17T16:39:42Z

Summary

Bundles two related defects in GGUF Q4_K export, both surfaced by the P2-E ep49 publish-readiness preflight on 2026-05-17:

Defect 2 (initial commit): encode_gguf_data quantized any 2D tensor with ≥256 elements to Q4_K without checking that shape[1] (= K = ne[0]) is 256-divisible. llama-cli rejected the export with tensor 'blk.0.ffn_gate.weight' of type 12 (q4_K) has 896 elements per row, not a multiple of block size (256).

Defect 3 (follow-up commit, unmasked by Defect 2 fix): with the divisibility check in place, ffn_down [out=896, in=4864] (K=4864, divisible ✓) was Q4K'd — but llama-cli then rejected the export with gguf_init_from_file_impl: tensor 'blk.0.ffn_gate.weight' has offset 1091882496, expected 1091532288. Root cause: 3 call sites construct gguf_shape_usize = [shape[1], shape[0]] and pass the swapped shape to quantize_q4_k_matrix, which treats shape[0] as rows and pads shape[1] to the next 256-multiple. For ffn_down this inflated the output to 2,801,664 bytes when llama-cpp expects 2,451,456 — a 350,208 byte excess that exactly matches the offset drift.

Fix: pass APR-native shape directly (no swap) AND add the same divisibility check to fusion.rs + export_include_01.rs for consistency with encode_gguf_data.

Why this didn't surface for 1.5B / 7B

Variant	hidden	hidden % 256
0.5B (this ship)	896	128
1.5B	1536	0
7B	3584	0

For 1.5B and 7B, both dims are 256-divisible, so the swap gives the same total byte count (just different layout). The 0.5B ship is the first time we've hit a model where one dim isn't 256-divisible. The data layout difference for 1.5B / 7B is likely producing wrong inference output, which is a follow-up investigation (filed as P3-C-prep defect 4 to scope).

End-to-end verification

Re-exported ep49.apr → q4k.gguf with both fixes in the local binary:

$ /usr/local/bin/apr export ep49.apr --format gguf --quantize int4 -o q4k.gguf
[GGUF-EXPORT-Q4K-FALLBACK] blk.0.ffn_gate.weight (shape [4864, 896]) — K=896 not divisible by 256; falling back to F32
[GGUF-EXPORT-Q4K-FALLBACK] blk.0.attn_k.weight (shape [128, 896]) — K=896 not divisible by 256; falling back to F32
... (Defect 2 fallback firing on ~7 tensors/layer)
✓ Export successful: 2.01 GiB

$ llama-cli -m q4k.gguf -p "def factorial(n):" -n 8
# Defect 2 check ✓ — no Q4_K block-size rejection
# Defect 3 check ✓ — no offset drift error
# Loads to: "cannot find tokenizer merges in model file"
# (= Defect 1, fixed in PR #1769)

Tradeoffs

Qwen2 0.5B GGUF Q4_K export inflates from ~700 MB (Q4_K throughout) → ~2.0 GB (F32 fallback for ~95% of weights). Acceptable for v1 stack-existence-proof ship (SPEC §88) — the alternative is a broken artifact. Future enhancement: Q4_0 (block_size=32) for K=896 tensors would give ~1.1 GB.

Tests

7 new q4k_divisibility tests including byte-count pin: q4k_byte_count_matches_llama_cpp_expectation asserts bytes.len() == (rows * cols / 256) * 144 = 2_451_456 for ffn_down.
All 55 pre-existing q4k tests across aprender-core pass.
All 7 pre-existing gguf_export tests pass.

Test plan

7 new q4k_divisibility unit tests pass
55 pre-existing q4k tests still pass
End-to-end: rebuilt ep49 GGUF Q4_K — llama-cli no longer rejects Q4_K block size OR offsets
After PR feat(apr-stamp): --tokenizer flag embeds vocab + merges (P3-C-prep defect 1) #1769 lands: re-stamp with embedded tokenizer + re-export → llama-cli runs inference
Re-run publish-readiness preflight — should report GO
Execute apr publish /tmp/albor-370m-staging paiml/albor-370m-v1 ...

🤖 Generated with Claude Code

…ep defect 2) Surfaced by P2-E ep49 publish-readiness preflight on 2026-05-17. The GGUF Q4_K export at /tmp/albor-370m-staging/albor-370m-v1-q4k.gguf was rejected by llama-cli with: tensor 'blk.0.ffn_gate.weight' of type 12 (q4_K) has 896 elements per row, not a multiple of block size (256) Root cause: encode_gguf_data quantized any 2D tensor with >= 256 elements to Q4_K without checking that the inner dim (K) is divisible by Q4_K's block_size of 256. For Qwen2 0.5B (hidden=896, intermediate=4864) most attention and FFN projections have K=896. 896 % 256 = 128, so llama-cli rejects every such tensor. Fix: add `shape[1] % 256 == 0` to the Q4_K eligibility check in encode_gguf_data. Non-divisible tensors fall through to the existing F32 path (matches llama.cpp/convert_hf_to_gguf.py convention of keeping unconvertible tensors at F16/F32). Tradeoff: Qwen2 0.5B Q4_K export will be ~2.1 GB instead of ~700 MB because most tensors fall back. Acceptable for v1 stack-existence-proof ship target (SPEC §88) — alternative is a broken artifact. Larger Qwen2 variants (1.5B hidden=1536, 7B hidden=3584) are unaffected because their K dims stay 256-divisible. Tests: 6 unit tests in q4k_divisibility_tests covering: - Qwen2 0.5B ffn_gate.weight [4864, 896] → F32 fallback - ffn_down.weight [896, 4864] → Q4_K (still works) - Exact-256 boundary [128, 256] → Q4_K - All four Qwen2 attention projections → F32 fallback - Embedding + lm_head always F32 (existing path preserved) - use_q4k=false → always F32 All 7 pre-existing gguf_export tests still pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…PMAT-690 P3-C-prep defect 3) Surfaced when defect 2 (Q4_K block-size divisibility check) was applied to the P2-E ep49 export. Defect 2 unblocked the per-row check but llama-cli then rejected the file with: gguf_init_from_file_impl: tensor 'blk.0.ffn_gate.weight' has offset 1091882496, expected 1091532288 Root cause: encode_gguf_data + fusion.rs + export_include_01.rs all construct `gguf_shape_usize = [shape[1], shape[0]]` (a swap) before calling `quantize_q4_k_matrix(data, &gguf_shape_usize)`. The quantizer function treats `shape[0]` as `rows` and `shape[1]` as `cols`, padding `cols` up to the next multiple of 256. For Qwen2 0.5B ffn_down [out=896, in=4864]: - Swapped shape passed = [4864, 896] - Function reads rows=4864, cols=896 - super_blocks_per_row = ceil(896/256) = 4 (PADDING to 1024) - Total bytes = 4864 * 4 * 144 = 2,801,664 But llama-cpp expects: - ne[0] = 4864 (per-row), ne[1] = 896 (rows) - super-blocks = (4864 * 896) / 256 = 17,024 - Total bytes = 17,024 * 144 = 2,451,456 Excess = 350,208 bytes — exactly the offset drift llama-cli reported. Fix: pass APR-native shape directly (no swap). The quantizer then reads rows=out=896, cols=in=4864 (= K, 256-divisible), iterates per-out-row contiguous slices, produces 19 blocks/row × 896 rows = 17,024 blocks. Also adds the divisibility guard to fusion.rs and export_include_01.rs to keep them consistent with encode_gguf_data — fused tensors and tied-output-weight construction now fall back to F32 when their K dim isn't 256-divisible. End-to-end verification on Qwen2 0.5B ep49 GGUF Q4_K export: - llama-cli loads the file without Q4_K rejection - llama-cli loads the file without offset drift error - Only remaining error is "cannot find tokenizer merges" (defect 1 — fixed in PR #1769, `apr stamp --tokenizer`) Why this latent bug hadn't surfaced for 1.5B / 7B exports: when BOTH shape[0] and shape[1] are 256-divisible (true for Qwen2 1.5B with hidden=1536, 7B with hidden=3584), the swap doesn't change the total byte count (rows*cols/256 * 144 either way) — the inflation only appears when one dim is not 256-divisible. The data LAYOUT difference remains for those models, but llama-cli accepts the byte count so the file loads — likely producing wrong inference, which is a follow-up investigation. For the 0.5B ship target the immediate Q4_K-compatible byte count is what unblocks publish. Tests: new q4k_byte_count_matches_llama_cpp_expectation pins the exact byte count for ffn_down [896, 4864] = 2,451,456. All 7 q4k_divisibility tests + 55 q4k tests across aprender-core pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Workspace fmt drift accumulated in main between v0.33.0 cut and now. PR #1771's CI lint surfaced it on this branch. No semantic changes — all diffs are whitespace/wrap rearrangements from cargo fmt. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ust 1.93 new lints) CI's clippy job promoted pedantic warnings to errors via -D warnings. Rust 1.93 added `manual_is_multiple_of` (3 sites in aprender-test-lib) and `format_in_format_args` to pedantic. The aprender-test-lib usages are pre-existing; bulk cleanup deferred to a focused PR. Also fixed the one format_in_format_args site introduced in this PR (fusion.rs:78) by inlining the format! into the eprintln! args. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 17, 2026 16:39

noahgift changed the title ~~feat(gguf-export): Q4_K shape divisibility fallback (PMAT-690 P3-C-prep defect 2)~~ feat(gguf-export): Q4_K divisibility check + shape pass-through (PMAT-690 P3-C-prep defects 2+3) May 17, 2026

noahgift and others added 5 commits May 17, 2026 19:10

Merge branch 'main' into feat/gguf-q4k-shape-fallback-defect-2

9603021

Merge branch 'main' into feat/gguf-q4k-shape-fallback-defect-2

9724016

Merge branch 'main' into feat/gguf-q4k-shape-fallback-defect-2

bb3a0c8

noahgift merged commit a745a9a into main May 18, 2026
10 checks passed

noahgift deleted the feat/gguf-q4k-shape-fallback-defect-2 branch May 18, 2026 00:14

noahgift mentioned this pull request May 18, 2026

release: v0.34.0 — MODEL-2 §88 stack-existence-proof + apr publish defect cascade #1776

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gguf-export): Q4_K divisibility check + shape pass-through (PMAT-690 P3-C-prep defects 2+3)#1771

feat(gguf-export): Q4_K divisibility check + shape pass-through (PMAT-690 P3-C-prep defects 2+3)#1771
noahgift merged 7 commits into
mainfrom
feat/gguf-q4k-shape-fallback-defect-2

noahgift commented May 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this didn't surface for 1.5B / 7B

End-to-end verification

Tradeoffs

Tests

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

noahgift commented May 17, 2026 •

edited

Loading