fix(apr): per-layer dequant for native q4/q8 tensors (GH-478) by noahgift · Pull Request #750 · paiml/aprender

noahgift · 2026-04-16T16:56:52Z

Summary

Fixes GH-478: native APR q4/q8 tensors were eagerly expanded to F32 at load time, causing 4× memory blowup (32B model: ~16 GB quantized → ~128 GB F32, OOM on 119 GB hosts).

Fix: when transpose=false, apr_load_quantized_tensor now stores raw quantized bytes and tags qtype with new constants APR_TYPE_Q4 (128) / APR_TYPE_Q8 (129). Per-tensor scratch dequant happens inside fused_matmul during forward.
Memory bound: peak RAM = one tensor's worth of F32 scratch (~566 MB for a 32B model) instead of 4 × num_params bytes.
Conv1D compat: transpose=true keeps the legacy dequant→transpose fallback since those models are small.
Dispatch wired in three places: CPU (matmul_fused.rs), CUDA (dequantize_weight_for_cuda), WGPU (dequant_tensor_public).

Test plan

cargo check -p aprender-serve clean
cargo clippy -p aprender-serve -- -D warnings clean
cargo test -p aprender-serve --lib — 15,097 passed, 0 failed
3 new falsifiable tests pass (gh478_per_layer_dequant_tests):
- apr_q4_load_keeps_raw_bytes_not_f32_expansion — asserts tensor.data.len() == raw_bytes.len(), not 4 × num_elements
- apr_q8_load_keeps_raw_bytes_not_f32_expansion — same invariant for q8
- apr_q4_conv1d_transpose_still_dequants_to_f32 — preserves Conv1D path contract
End-to-end: load a 32B APR q4 model on the 119 GB host and confirm no OOM (next step; need model on disk)

🤖 Generated with Claude Code

Native APR q4/q8 tensors were eagerly expanded to F32 at load time in apr_load_quantized_tensor, producing a 4x memory blowup: a 32B model's ~16 GB of quantized weights became ~128 GB of F32 working set — OOM on a 119 GB host. Fix: when transpose=false, store the raw quantized bytes on OwnedQuantizedTensor and tag them with new constants APR_TYPE_Q4 (128) / APR_TYPE_Q8 (129). Per-tensor scratch dequant happens inside fused_matmul during forward, bounding peak RAM to one tensor's worth of F32 scratch (~566 MB for a 32B model). Conv1D (transpose=true) stays on the legacy dequant→transpose fallback since those models are small. Dispatch added in three places: CPU (matmul_fused.rs), CUDA (dequantize_weight_for_cuda), and WGPU (dequant_tensor_public). Three falsifiable tests assert the memory invariant and preserve the Conv1D path contract. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… (CI gate) Required-check `ci / gate` was failing on PR #750 because `ci / security` flagged RUSTSEC-2026-0098 / RUSTSEC-2026-0099 in rustls-webpki (published 2026-04-14). Two versions were present in Cargo.lock: - 0.103.10 (direct path via reqwest/hyper-rustls): bumped to 0.103.12 (fixed) - 0.101.7 (transitive via legacy AWS SDK: aws-smithy-http-client 1.1.12 → rustls 0.21.12 → rustls-webpki 0.101.7): no upstream fix yet, added to audit.toml + deny.toml ignore list Direct TLS path is now on the fixed version; legacy AWS chain is ignored with a clear reason, matching the existing pattern for transitive-only advisories. Also carries an empirical falsification test (`gh478_real_model_load_stays_bounded`) for GH-478 — gated on `GH478_APR_Q4_MODEL` env var so it's skipped by default. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…t-gh478 # Conflicts: # .cargo/audit.toml # deny.toml

noahgift and others added 3 commits April 16, 2026 18:56

Merge remote-tracking branch 'origin/main' into feat/per-layer-dequan…

2795db6

…t-gh478 # Conflicts: # .cargo/audit.toml # deny.toml

noahgift mentioned this pull request Apr 16, 2026

Refactor APR CPU path to use OwnedQuantizedModel::generate (follow-up to GH-478) #752

Closed

4 tasks

noahgift merged commit 7f74543 into main Apr 16, 2026
10 checks passed

noahgift deleted the feat/per-layer-dequant-gh478 branch April 16, 2026 17:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(apr): per-layer dequant for native q4/q8 tensors (GH-478)#750

fix(apr): per-layer dequant for native q4/q8 tensors (GH-478)#750
noahgift merged 3 commits into
mainfrom
feat/per-layer-dequant-gh478

noahgift commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 16, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant