Per-layer F32 dequantization for CPU inference (32B OOM on 119GB)

## Problem

CPU inference dequantizes ALL quantized tensors (Q4K/Q8/etc.) to F32 at model load time, requiring `num_params × 4 bytes` of RAM for the F32 working set.

For 32B models: 32B × 4 = **128 GB F32**, which exceeds the 119 GB unified memory on Project DIGITS (GB10). The process is OOM-killed at ~103 GB RSS.

7B models work fine (7B × 4 = 28 GB).

## Proposed Fix

Implement **per-layer dequantization**: only hold one transformer layer's F32 tensors in memory at a time during the forward pass.

- At layer `i`: dequant Q4K→F32 for layer `i`'s weights, run forward, release F32
- Peak memory: ~400 MB (single layer) instead of 128 GB (all layers)
- The Q4K weights stay mmap'd (~18 GB) throughout

This is how llama.cpp and other CPU inference engines handle large models.

## Context

- Hardware: NVIDIA Project DIGITS (GB10), 119 GB LPDDR5X unified memory, 20 ARM cores
- Model: Qwen2.5-Coder-32B-Instruct Q4_K_M (19 GB .apr file)
- OOM details: `total-vm:108GB, anon-rss:103GB` before kill
- 7B Q4K HumanEval result: 85.37% pass@1 on same hardware (CPU inference works for 7B)
- GPU blocked: sm_121 (Blackwell) parity gate failure, CPU-only for now

## Files

- `realizar/src/infer/` — CPU inference engine
- Forward pass tensor loading and dequantization logic

## Impact

Unlocks 32B+ model inference on consumer hardware with 64-128 GB RAM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-layer F32 dequantization for CPU inference (32B OOM on 119GB) #478

Problem

Proposed Fix

Context

Files

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Per-layer F32 dequantization for CPU inference (32B OOM on 119GB) #478

Description

Problem

Proposed Fix

Context

Files

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions