Problem
apr quantize loads the entire model into memory via fs::read() + dequant to f32, requiring ~3x file size in RAM. For the Qwen3-Coder-30B-A3B teacher model (57 GB F16 APR), this requires ~170 GB RAM and OOMs on a 128 GB machine.
Five Whys
- Why does quantize OOM? — It needs ~170 GB RAM for a 57 GB model
- Why does it need 170 GB? —
load_model_tensors() calls fs::read() (57 GB) then dequants every tensor to f32 (113 GB) into a BTreeMap
- Why does it load everything at once? —
apr_convert() was designed for small models (<2 GB) where full-load is fine
- Why wasn't streaming considered? — The quantize path predates MoE/30B+ model support; all prior models fit in RAM
- Why is this blocking now? — ALB-010 Step 7: Qwen3-Coder-30B-A3B teacher model must be Q4K quantized to fit in 24 GB VRAM for distillation
Root Cause
src/format/converter/mod.rs:351 — load_model_tensors() is monolithic: reads entire file, dequants all tensors to f32, returns BTreeMap<String, (Vec<f32>, Vec<usize>)>. No streaming/iterator API exists.
Even --plan mode calls this path, causing 88 GB RSS for a plan-only estimation.
Solution
Streaming quantize: read one tensor at a time from APR v2 (mmap index lookup), quantize to Q4K, write to output via AprV2StreamingWriter. Peak memory = single largest tensor (~26 MB for MoE expert weights), not the full model.
Implementation
- Add
AprV2StreamingReader — mmap-based, yields (name, shape, &[u8]) one tensor at a time
- Add
streaming_quantize_q4k() in converter — reads via streaming reader, quantizes per-tensor, writes via AprV2StreamingWriter
- Wire into
run_apr_quantize() when input exceeds size threshold (e.g., >4 GB)
- Fix
--plan mode to use index-only metadata scan (no tensor data load)
Acceptance Criteria
Context
- Blocks: ALB-010 Step 7 (teacher Q4K quantization)
- Model: Qwen3-Coder-30B-A3B-Instruct, 18,867 tensors, 56.9 GB F16
- Machine: 128 GB RAM, RTX 4090 24 GB VRAM
Problem
apr quantizeloads the entire model into memory viafs::read()+ dequant to f32, requiring ~3x file size in RAM. For the Qwen3-Coder-30B-A3B teacher model (57 GB F16 APR), this requires ~170 GB RAM and OOMs on a 128 GB machine.Five Whys
load_model_tensors()callsfs::read()(57 GB) then dequants every tensor to f32 (113 GB) into aBTreeMapapr_convert()was designed for small models (<2 GB) where full-load is fineRoot Cause
src/format/converter/mod.rs:351—load_model_tensors()is monolithic: reads entire file, dequants all tensors to f32, returnsBTreeMap<String, (Vec<f32>, Vec<usize>)>. No streaming/iterator API exists.Even
--planmode calls this path, causing 88 GB RSS for a plan-only estimation.Solution
Streaming quantize: read one tensor at a time from APR v2 (mmap index lookup), quantize to Q4K, write to output via
AprV2StreamingWriter. Peak memory = single largest tensor (~26 MB for MoE expert weights), not the full model.Implementation
AprV2StreamingReader— mmap-based, yields(name, shape, &[u8])one tensor at a timestreaming_quantize_q4k()in converter — reads via streaming reader, quantizes per-tensor, writes viaAprV2StreamingWriterrun_apr_quantize()when input exceeds size threshold (e.g., >4 GB)--planmode to use index-only metadata scan (no tensor data load)Acceptance Criteria
apr quantize --scheme q4kworks on 57 GB APR file with <1 GB peak RSSapr quantize --planworks with <100 MB peak RSSContext