realizar: APR transformer loader lacks Q8/Q4 dequantization for attention weights

## Bug Report

**Source**: `tiny-model-ground-truth` parity checker (0/59 passing)
**Severity**: Critical — blocks ALL quantized inference (Int4 and Int8)
**Related**: GH-237 fixed the write side; this is the read side

## Description

After GH-237 wired real quantization through the APR write pipeline, quantized tensors are now stored correctly as Q8 (dtype=9) and Q4 (dtype=8) in `.apr` files. However, the inference loader in `realizar/src/apr_transformer/` does not dequantize these tensors — it reads quantized bytes as if they were f32, producing shape mismatches (Int8: 4:1 element deficit) or all-zeros (Int4: nibble data misinterpreted as float zeros).

**Embeddings and lm_head work** because they're skipped from quantization (dtype=0, F32). Layer weights fail because they're actually quantized.

## Evidence

SmolLM-135M Int8 — embeddings/lm_head load fine, layer weights fail:
```
[APR-LOAD] Embedding loaded: 28311552 elements (dtype=0, F32)     ← OK (skip-quant)
[APR-LOAD] LM head loaded: 28311552 elements (dtype=0, F32)      ← OK (skip-quant)

error: [F-LAYOUT-CONTRACT-001] Tensor 'layers.0.qkv_weight':
  Shape mismatch: got 138243 elements, expected 552960 (960x576)
```

138,243 = 552,960 / 4 — the loader reads Q8 bytes (1 byte each) as f32 (4 bytes each).

SmolLM-135M Int4:
```
error: [F-DATA-QUALITY-001] Tensor 'layers.0.qkv_weight':
  DENSITY FAILURE: 100.0% zeros (max 80%)
```

Q4 nibble-packed data is being read as f32 zeros.

## What Needs to Happen

The APR transformer loader in realizar needs to:

1. **Check `dtype` field** for each tensor in the APR file
2. **Dequantize Q8 tensors** (dtype=9): unpack int8 → f32 using scale/zero-point
3. **Dequantize Q4 tensors** (dtype=8): unpack int4 nibbles → f32 using block scale/min
4. **Pass f32 data** to the existing matmul kernels

Alternatively, if realizar already has Q4K/Q6K GPU kernels (per CLAUDE.md: `QuantizeKernel`, `Q5KKernel`, `Q6KKernel`), the loader should route quantized tensors to those kernels instead of the F32 path.

## Affected

ALL 3 models × ALL quantized layers × both Int4 and Int8 = every check that gets past embedding loading.

## Reproduction

```bash
cd tiny-model-ground-truth
make clean && make convert  # With GH-237 apr
make check                  # 0/59 — all fail on layers.0.qkv_weight
```

## Environment

- `apr` v0.2.16 with GH-231/232/233/234/235/236/237 fixes
- `realizar` v0.6.13
- Platform: Linux x86_64

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

realizar: APR transformer loader lacks Q8/Q4 dequantization for attention weights #239

Bug Report

Description

Evidence

What Needs to Happen

Affected

Reproduction

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

realizar: APR transformer loader lacks Q8/Q4 dequantization for attention weights #239

Description

Bug Report

Description

Evidence

What Needs to Happen

Affected

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions