feat(gguf-export): Q4_K divisibility check + shape pass-through (PMAT-690 P3-C-prep defects 2+3)#1771
Merged
Merged
Conversation
…ep defect 2) Surfaced by P2-E ep49 publish-readiness preflight on 2026-05-17. The GGUF Q4_K export at /tmp/albor-370m-staging/albor-370m-v1-q4k.gguf was rejected by llama-cli with: tensor 'blk.0.ffn_gate.weight' of type 12 (q4_K) has 896 elements per row, not a multiple of block size (256) Root cause: encode_gguf_data quantized any 2D tensor with >= 256 elements to Q4_K without checking that the inner dim (K) is divisible by Q4_K's block_size of 256. For Qwen2 0.5B (hidden=896, intermediate=4864) most attention and FFN projections have K=896. 896 % 256 = 128, so llama-cli rejects every such tensor. Fix: add `shape[1] % 256 == 0` to the Q4_K eligibility check in encode_gguf_data. Non-divisible tensors fall through to the existing F32 path (matches llama.cpp/convert_hf_to_gguf.py convention of keeping unconvertible tensors at F16/F32). Tradeoff: Qwen2 0.5B Q4_K export will be ~2.1 GB instead of ~700 MB because most tensors fall back. Acceptable for v1 stack-existence-proof ship target (SPEC §88) — alternative is a broken artifact. Larger Qwen2 variants (1.5B hidden=1536, 7B hidden=3584) are unaffected because their K dims stay 256-divisible. Tests: 6 unit tests in q4k_divisibility_tests covering: - Qwen2 0.5B ffn_gate.weight [4864, 896] → F32 fallback - ffn_down.weight [896, 4864] → Q4_K (still works) - Exact-256 boundary [128, 256] → Q4_K - All four Qwen2 attention projections → F32 fallback - Embedding + lm_head always F32 (existing path preserved) - use_q4k=false → always F32 All 7 pre-existing gguf_export tests still pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…PMAT-690 P3-C-prep defect 3) Surfaced when defect 2 (Q4_K block-size divisibility check) was applied to the P2-E ep49 export. Defect 2 unblocked the per-row check but llama-cli then rejected the file with: gguf_init_from_file_impl: tensor 'blk.0.ffn_gate.weight' has offset 1091882496, expected 1091532288 Root cause: encode_gguf_data + fusion.rs + export_include_01.rs all construct `gguf_shape_usize = [shape[1], shape[0]]` (a swap) before calling `quantize_q4_k_matrix(data, &gguf_shape_usize)`. The quantizer function treats `shape[0]` as `rows` and `shape[1]` as `cols`, padding `cols` up to the next multiple of 256. For Qwen2 0.5B ffn_down [out=896, in=4864]: - Swapped shape passed = [4864, 896] - Function reads rows=4864, cols=896 - super_blocks_per_row = ceil(896/256) = 4 (PADDING to 1024) - Total bytes = 4864 * 4 * 144 = 2,801,664 But llama-cpp expects: - ne[0] = 4864 (per-row), ne[1] = 896 (rows) - super-blocks = (4864 * 896) / 256 = 17,024 - Total bytes = 17,024 * 144 = 2,451,456 Excess = 350,208 bytes — exactly the offset drift llama-cli reported. Fix: pass APR-native shape directly (no swap). The quantizer then reads rows=out=896, cols=in=4864 (= K, 256-divisible), iterates per-out-row contiguous slices, produces 19 blocks/row × 896 rows = 17,024 blocks. Also adds the divisibility guard to fusion.rs and export_include_01.rs to keep them consistent with encode_gguf_data — fused tensors and tied-output-weight construction now fall back to F32 when their K dim isn't 256-divisible. End-to-end verification on Qwen2 0.5B ep49 GGUF Q4_K export: - llama-cli loads the file without Q4_K rejection - llama-cli loads the file without offset drift error - Only remaining error is "cannot find tokenizer merges" (defect 1 — fixed in PR #1769, `apr stamp --tokenizer`) Why this latent bug hadn't surfaced for 1.5B / 7B exports: when BOTH shape[0] and shape[1] are 256-divisible (true for Qwen2 1.5B with hidden=1536, 7B with hidden=3584), the swap doesn't change the total byte count (rows*cols/256 * 144 either way) — the inflation only appears when one dim is not 256-divisible. The data LAYOUT difference remains for those models, but llama-cli accepts the byte count so the file loads — likely producing wrong inference, which is a follow-up investigation. For the 0.5B ship target the immediate Q4_K-compatible byte count is what unblocks publish. Tests: new q4k_byte_count_matches_llama_cpp_expectation pins the exact byte count for ffn_down [896, 4864] = 2,451,456. All 7 q4k_divisibility tests + 55 q4k tests across aprender-core pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Workspace fmt drift accumulated in main between v0.33.0 cut and now. PR #1771's CI lint surfaced it on this branch. No semantic changes — all diffs are whitespace/wrap rearrangements from cargo fmt. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ust 1.93 new lints) CI's clippy job promoted pedantic warnings to errors via -D warnings. Rust 1.93 added `manual_is_multiple_of` (3 sites in aprender-test-lib) and `format_in_format_args` to pedantic. The aprender-test-lib usages are pre-existing; bulk cleanup deferred to a focused PR. Also fixed the one format_in_format_args site introduced in this PR (fusion.rs:78) by inlining the format! into the eprintln! args. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bundles two related defects in GGUF Q4_K export, both surfaced by the P2-E ep49 publish-readiness preflight on 2026-05-17:
Defect 2 (initial commit):
encode_gguf_dataquantized any 2D tensor with ≥256 elements to Q4_K without checking thatshape[1](= K = ne[0]) is 256-divisible. llama-cli rejected the export withtensor 'blk.0.ffn_gate.weight' of type 12 (q4_K) has 896 elements per row, not a multiple of block size (256).Defect 3 (follow-up commit, unmasked by Defect 2 fix): with the divisibility check in place, ffn_down [out=896, in=4864] (K=4864, divisible ✓) was Q4K'd — but llama-cli then rejected the export with
gguf_init_from_file_impl: tensor 'blk.0.ffn_gate.weight' has offset 1091882496, expected 1091532288. Root cause: 3 call sites constructgguf_shape_usize = [shape[1], shape[0]]and pass the swapped shape toquantize_q4_k_matrix, which treatsshape[0]asrowsand padsshape[1]to the next 256-multiple. For ffn_down this inflated the output to 2,801,664 bytes when llama-cpp expects 2,451,456 — a 350,208 byte excess that exactly matches the offset drift.Fix: pass APR-native shape directly (no swap) AND add the same divisibility check to
fusion.rs+export_include_01.rsfor consistency withencode_gguf_data.Why this didn't surface for 1.5B / 7B
For 1.5B and 7B, both dims are 256-divisible, so the swap gives the same total byte count (just different layout). The 0.5B ship is the first time we've hit a model where one dim isn't 256-divisible. The data layout difference for 1.5B / 7B is likely producing wrong inference output, which is a follow-up investigation (filed as P3-C-prep defect 4 to scope).
End-to-end verification
Re-exported ep49.apr → q4k.gguf with both fixes in the local binary:
Tradeoffs
Qwen2 0.5B GGUF Q4_K export inflates from ~700 MB (Q4_K throughout) → ~2.0 GB (F32 fallback for ~95% of weights). Acceptable for v1 stack-existence-proof ship (SPEC §88) — the alternative is a broken artifact. Future enhancement: Q4_0 (block_size=32) for K=896 tensors would give ~1.1 GB.
Tests
q4k_byte_count_matches_llama_cpp_expectationassertsbytes.len() == (rows * cols / 256) * 144 = 2_451_456for ffn_down.Test plan
apr publish /tmp/albor-370m-staging paiml/albor-370m-v1 ...🤖 Generated with Claude Code