Skip to content

ALB-093: Streaming quantize for large models (57+ GB OOM) #434

@noahgift

Description

@noahgift

Problem

apr quantize loads the entire model into memory via fs::read() + dequant to f32, requiring ~3x file size in RAM. For the Qwen3-Coder-30B-A3B teacher model (57 GB F16 APR), this requires ~170 GB RAM and OOMs on a 128 GB machine.

Five Whys

  1. Why does quantize OOM? — It needs ~170 GB RAM for a 57 GB model
  2. Why does it need 170 GB?load_model_tensors() calls fs::read() (57 GB) then dequants every tensor to f32 (113 GB) into a BTreeMap
  3. Why does it load everything at once?apr_convert() was designed for small models (<2 GB) where full-load is fine
  4. Why wasn't streaming considered? — The quantize path predates MoE/30B+ model support; all prior models fit in RAM
  5. Why is this blocking now? — ALB-010 Step 7: Qwen3-Coder-30B-A3B teacher model must be Q4K quantized to fit in 24 GB VRAM for distillation

Root Cause

src/format/converter/mod.rs:351load_model_tensors() is monolithic: reads entire file, dequants all tensors to f32, returns BTreeMap<String, (Vec<f32>, Vec<usize>)>. No streaming/iterator API exists.

Even --plan mode calls this path, causing 88 GB RSS for a plan-only estimation.

Solution

Streaming quantize: read one tensor at a time from APR v2 (mmap index lookup), quantize to Q4K, write to output via AprV2StreamingWriter. Peak memory = single largest tensor (~26 MB for MoE expert weights), not the full model.

Implementation

  1. Add AprV2StreamingReader — mmap-based, yields (name, shape, &[u8]) one tensor at a time
  2. Add streaming_quantize_q4k() in converter — reads via streaming reader, quantizes per-tensor, writes via AprV2StreamingWriter
  3. Wire into run_apr_quantize() when input exceeds size threshold (e.g., >4 GB)
  4. Fix --plan mode to use index-only metadata scan (no tensor data load)

Acceptance Criteria

  • apr quantize --scheme q4k works on 57 GB APR file with <1 GB peak RSS
  • apr quantize --plan works with <100 MB peak RSS
  • Existing small-model quantize path unchanged
  • Output matches non-streaming path bit-for-bit

Context

  • Blocks: ALB-010 Step 7 (teacher Q4K quantization)
  • Model: Qwen3-Coder-30B-A3B-Instruct, 18,867 tensors, 56.9 GB F16
  • Machine: 128 GB RAM, RTX 4090 24 GB VRAM

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingperformancePerformance optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions