Skip to content

Per-layer F32 dequantization for CPU inference (32B OOM on 119GB) #478

@noahgift

Description

@noahgift

Problem

CPU inference dequantizes ALL quantized tensors (Q4K/Q8/etc.) to F32 at model load time, requiring num_params × 4 bytes of RAM for the F32 working set.

For 32B models: 32B × 4 = 128 GB F32, which exceeds the 119 GB unified memory on Project DIGITS (GB10). The process is OOM-killed at ~103 GB RSS.

7B models work fine (7B × 4 = 28 GB).

Proposed Fix

Implement per-layer dequantization: only hold one transformer layer's F32 tensors in memory at a time during the forward pass.

  • At layer i: dequant Q4K→F32 for layer i's weights, run forward, release F32
  • Peak memory: ~400 MB (single layer) instead of 128 GB (all layers)
  • The Q4K weights stay mmap'd (~18 GB) throughout

This is how llama.cpp and other CPU inference engines handle large models.

Context

  • Hardware: NVIDIA Project DIGITS (GB10), 119 GB LPDDR5X unified memory, 20 ARM cores
  • Model: Qwen2.5-Coder-32B-Instruct Q4_K_M (19 GB .apr file)
  • OOM details: total-vm:108GB, anon-rss:103GB before kill
  • 7B Q4K HumanEval result: 85.37% pass@1 on same hardware (CPU inference works for 7B)
  • GPU blocked: sm_121 (Blackwell) parity gate failure, CPU-only for now

Files

  • realizar/src/infer/ — CPU inference engine
  • Forward pass tensor loading and dequantization logic

Impact

Unlocks 32B+ model inference on consumer hardware with 64-128 GB RAM.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions