ALB-093: Streaming quantize for large models (57+ GB OOM)

## Problem

`apr quantize` loads the entire model into memory via `fs::read()` + dequant to f32, requiring ~3x file size in RAM. For the Qwen3-Coder-30B-A3B teacher model (57 GB F16 APR), this requires ~170 GB RAM and OOMs on a 128 GB machine.

## Five Whys

1. **Why does quantize OOM?** — It needs ~170 GB RAM for a 57 GB model
2. **Why does it need 170 GB?** — `load_model_tensors()` calls `fs::read()` (57 GB) then dequants every tensor to f32 (113 GB) into a `BTreeMap`
3. **Why does it load everything at once?** — `apr_convert()` was designed for small models (<2 GB) where full-load is fine
4. **Why wasn't streaming considered?** — The quantize path predates MoE/30B+ model support; all prior models fit in RAM
5. **Why is this blocking now?** — ALB-010 Step 7: Qwen3-Coder-30B-A3B teacher model must be Q4K quantized to fit in 24 GB VRAM for distillation

## Root Cause

`src/format/converter/mod.rs:351` — `load_model_tensors()` is monolithic: reads entire file, dequants all tensors to f32, returns `BTreeMap<String, (Vec<f32>, Vec<usize>)>`. No streaming/iterator API exists.

Even `--plan` mode calls this path, causing 88 GB RSS for a plan-only estimation.

## Solution

Streaming quantize: read one tensor at a time from APR v2 (mmap index lookup), quantize to Q4K, write to output via `AprV2StreamingWriter`. Peak memory = single largest tensor (~26 MB for MoE expert weights), not the full model.

### Implementation

1. Add `AprV2StreamingReader` — mmap-based, yields `(name, shape, &[u8])` one tensor at a time
2. Add `streaming_quantize_q4k()` in converter — reads via streaming reader, quantizes per-tensor, writes via `AprV2StreamingWriter`
3. Wire into `run_apr_quantize()` when input exceeds size threshold (e.g., >4 GB)
4. Fix `--plan` mode to use index-only metadata scan (no tensor data load)

### Acceptance Criteria

- [ ] `apr quantize --scheme q4k` works on 57 GB APR file with <1 GB peak RSS
- [ ] `apr quantize --plan` works with <100 MB peak RSS
- [ ] Existing small-model quantize path unchanged
- [ ] Output matches non-streaming path bit-for-bit

## Context

- Blocks: ALB-010 Step 7 (teacher Q4K quantization)
- Model: Qwen3-Coder-30B-A3B-Instruct, 18,867 tensors, 56.9 GB F16
- Machine: 128 GB RAM, RTX 4090 24 GB VRAM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ALB-093: Streaming quantize for large models (57+ GB OOM) #434

Problem

Five Whys

Root Cause

Solution

Implementation

Acceptance Criteria

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ALB-093: Streaming quantize for large models (57+ GB OOM) #434

Description

Problem

Five Whys

Root Cause

Solution

Implementation

Acceptance Criteria

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions