fix(apr): per-layer dequant for native q4/q8 tensors (GH-478)#750
Merged
Conversation
Native APR q4/q8 tensors were eagerly expanded to F32 at load time in apr_load_quantized_tensor, producing a 4x memory blowup: a 32B model's ~16 GB of quantized weights became ~128 GB of F32 working set — OOM on a 119 GB host. Fix: when transpose=false, store the raw quantized bytes on OwnedQuantizedTensor and tag them with new constants APR_TYPE_Q4 (128) / APR_TYPE_Q8 (129). Per-tensor scratch dequant happens inside fused_matmul during forward, bounding peak RAM to one tensor's worth of F32 scratch (~566 MB for a 32B model). Conv1D (transpose=true) stays on the legacy dequant→transpose fallback since those models are small. Dispatch added in three places: CPU (matmul_fused.rs), CUDA (dequantize_weight_for_cuda), and WGPU (dequant_tensor_public). Three falsifiable tests assert the memory invariant and preserve the Conv1D path contract. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… (CI gate) Required-check `ci / gate` was failing on PR #750 because `ci / security` flagged RUSTSEC-2026-0098 / RUSTSEC-2026-0099 in rustls-webpki (published 2026-04-14). Two versions were present in Cargo.lock: - 0.103.10 (direct path via reqwest/hyper-rustls): bumped to 0.103.12 (fixed) - 0.101.7 (transitive via legacy AWS SDK: aws-smithy-http-client 1.1.12 → rustls 0.21.12 → rustls-webpki 0.101.7): no upstream fix yet, added to audit.toml + deny.toml ignore list Direct TLS path is now on the fixed version; legacy AWS chain is ignored with a clear reason, matching the existing pattern for transitive-only advisories. Also carries an empirical falsification test (`gh478_real_model_load_stays_bounded`) for GH-478 — gated on `GH478_APR_Q4_MODEL` env var so it's skipped by default. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…t-gh478 # Conflicts: # .cargo/audit.toml # deny.toml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes GH-478: native APR q4/q8 tensors were eagerly expanded to F32 at load time, causing 4× memory blowup (32B model: ~16 GB quantized → ~128 GB F32, OOM on 119 GB hosts).
transpose=false,apr_load_quantized_tensornow stores raw quantized bytes and tagsqtypewith new constantsAPR_TYPE_Q4(128) /APR_TYPE_Q8(129). Per-tensor scratch dequant happens insidefused_matmulduring forward.4 × num_paramsbytes.transpose=truekeeps the legacy dequant→transpose fallback since those models are small.matmul_fused.rs), CUDA (dequantize_weight_for_cuda), WGPU (dequant_tensor_public).Test plan
cargo check -p aprender-servecleancargo clippy -p aprender-serve -- -D warningscleancargo test -p aprender-serve --lib— 15,097 passed, 0 failedgh478_per_layer_dequant_tests):apr_q4_load_keeps_raw_bytes_not_f32_expansion— assertstensor.data.len() == raw_bytes.len(), not4 × num_elementsapr_q8_load_keeps_raw_bytes_not_f32_expansion— same invariant for q8apr_q4_conv1d_transpose_still_dequants_to_f32— preserves Conv1D path contract🤖 Generated with Claude Code