Skip to content

fix(apr): per-layer dequant for native q4/q8 tensors (GH-478)#750

Merged
noahgift merged 3 commits into
mainfrom
feat/per-layer-dequant-gh478
Apr 16, 2026
Merged

fix(apr): per-layer dequant for native q4/q8 tensors (GH-478)#750
noahgift merged 3 commits into
mainfrom
feat/per-layer-dequant-gh478

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Fixes GH-478: native APR q4/q8 tensors were eagerly expanded to F32 at load time, causing 4× memory blowup (32B model: ~16 GB quantized → ~128 GB F32, OOM on 119 GB hosts).

  • Fix: when transpose=false, apr_load_quantized_tensor now stores raw quantized bytes and tags qtype with new constants APR_TYPE_Q4 (128) / APR_TYPE_Q8 (129). Per-tensor scratch dequant happens inside fused_matmul during forward.
  • Memory bound: peak RAM = one tensor's worth of F32 scratch (~566 MB for a 32B model) instead of 4 × num_params bytes.
  • Conv1D compat: transpose=true keeps the legacy dequant→transpose fallback since those models are small.
  • Dispatch wired in three places: CPU (matmul_fused.rs), CUDA (dequantize_weight_for_cuda), WGPU (dequant_tensor_public).

Test plan

  • cargo check -p aprender-serve clean
  • cargo clippy -p aprender-serve -- -D warnings clean
  • cargo test -p aprender-serve --lib — 15,097 passed, 0 failed
  • 3 new falsifiable tests pass (gh478_per_layer_dequant_tests):
    • apr_q4_load_keeps_raw_bytes_not_f32_expansion — asserts tensor.data.len() == raw_bytes.len(), not 4 × num_elements
    • apr_q8_load_keeps_raw_bytes_not_f32_expansion — same invariant for q8
    • apr_q4_conv1d_transpose_still_dequants_to_f32 — preserves Conv1D path contract
  • End-to-end: load a 32B APR q4 model on the 119 GB host and confirm no OOM (next step; need model on disk)

🤖 Generated with Claude Code

noahgift and others added 3 commits April 16, 2026 18:56
Native APR q4/q8 tensors were eagerly expanded to F32 at load time in
apr_load_quantized_tensor, producing a 4x memory blowup: a 32B model's
~16 GB of quantized weights became ~128 GB of F32 working set — OOM on
a 119 GB host. Fix: when transpose=false, store the raw quantized bytes
on OwnedQuantizedTensor and tag them with new constants APR_TYPE_Q4
(128) / APR_TYPE_Q8 (129). Per-tensor scratch dequant happens inside
fused_matmul during forward, bounding peak RAM to one tensor's worth of
F32 scratch (~566 MB for a 32B model). Conv1D (transpose=true) stays on
the legacy dequant→transpose fallback since those models are small.

Dispatch added in three places: CPU (matmul_fused.rs), CUDA
(dequantize_weight_for_cuda), and WGPU (dequant_tensor_public). Three
falsifiable tests assert the memory invariant and preserve the Conv1D
path contract.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… (CI gate)

Required-check `ci / gate` was failing on PR #750 because `ci / security`
flagged RUSTSEC-2026-0098 / RUSTSEC-2026-0099 in rustls-webpki (published
2026-04-14). Two versions were present in Cargo.lock:

- 0.103.10 (direct path via reqwest/hyper-rustls): bumped to 0.103.12 (fixed)
- 0.101.7  (transitive via legacy AWS SDK: aws-smithy-http-client 1.1.12
           → rustls 0.21.12 → rustls-webpki 0.101.7): no upstream fix yet,
           added to audit.toml + deny.toml ignore list

Direct TLS path is now on the fixed version; legacy AWS chain is ignored
with a clear reason, matching the existing pattern for transitive-only
advisories. Also carries an empirical falsification test
(`gh478_real_model_load_stays_bounded`) for GH-478 — gated on
`GH478_APR_Q4_MODEL` env var so it's skipped by default.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…t-gh478

# Conflicts:
#	.cargo/audit.toml
#	deny.toml
@noahgift noahgift merged commit 7f74543 into main Apr 16, 2026
10 checks passed
@noahgift noahgift deleted the feat/per-layer-dequant-gh478 branch April 16, 2026 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Per-layer F32 dequantization for CPU inference (32B OOM on 119GB)

1 participant