Skip to content

realizar: APR transformer loader lacks Q8/Q4 dequantization for attention weights #239

@noahgift

Description

@noahgift

Bug Report

Source: tiny-model-ground-truth parity checker (0/59 passing)
Severity: Critical — blocks ALL quantized inference (Int4 and Int8)
Related: GH-237 fixed the write side; this is the read side

Description

After GH-237 wired real quantization through the APR write pipeline, quantized tensors are now stored correctly as Q8 (dtype=9) and Q4 (dtype=8) in .apr files. However, the inference loader in realizar/src/apr_transformer/ does not dequantize these tensors — it reads quantized bytes as if they were f32, producing shape mismatches (Int8: 4:1 element deficit) or all-zeros (Int4: nibble data misinterpreted as float zeros).

Embeddings and lm_head work because they're skipped from quantization (dtype=0, F32). Layer weights fail because they're actually quantized.

Evidence

SmolLM-135M Int8 — embeddings/lm_head load fine, layer weights fail:

[APR-LOAD] Embedding loaded: 28311552 elements (dtype=0, F32)     ← OK (skip-quant)
[APR-LOAD] LM head loaded: 28311552 elements (dtype=0, F32)      ← OK (skip-quant)

error: [F-LAYOUT-CONTRACT-001] Tensor 'layers.0.qkv_weight':
  Shape mismatch: got 138243 elements, expected 552960 (960x576)

138,243 = 552,960 / 4 — the loader reads Q8 bytes (1 byte each) as f32 (4 bytes each).

SmolLM-135M Int4:

error: [F-DATA-QUALITY-001] Tensor 'layers.0.qkv_weight':
  DENSITY FAILURE: 100.0% zeros (max 80%)

Q4 nibble-packed data is being read as f32 zeros.

What Needs to Happen

The APR transformer loader in realizar needs to:

  1. Check dtype field for each tensor in the APR file
  2. Dequantize Q8 tensors (dtype=9): unpack int8 → f32 using scale/zero-point
  3. Dequantize Q4 tensors (dtype=8): unpack int4 nibbles → f32 using block scale/min
  4. Pass f32 data to the existing matmul kernels

Alternatively, if realizar already has Q4K/Q6K GPU kernels (per CLAUDE.md: QuantizeKernel, Q5KKernel, Q6KKernel), the loader should route quantized tensors to those kernels instead of the F32 path.

Affected

ALL 3 models × ALL quantized layers × both Int4 and Int8 = every check that gets past embedding loading.

Reproduction

cd tiny-model-ground-truth
make clean && make convert  # With GH-237 apr
make check                  # 0/59 — all fail on layers.0.qkv_weight

Environment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions