Bug Report
Source: tiny-model-ground-truth parity checker (0/59 passing)
Severity: Critical — blocks ALL quantized inference (Int4 and Int8)
Related: GH-237 fixed the write side; this is the read side
Description
After GH-237 wired real quantization through the APR write pipeline, quantized tensors are now stored correctly as Q8 (dtype=9) and Q4 (dtype=8) in .apr files. However, the inference loader in realizar/src/apr_transformer/ does not dequantize these tensors — it reads quantized bytes as if they were f32, producing shape mismatches (Int8: 4:1 element deficit) or all-zeros (Int4: nibble data misinterpreted as float zeros).
Embeddings and lm_head work because they're skipped from quantization (dtype=0, F32). Layer weights fail because they're actually quantized.
Evidence
SmolLM-135M Int8 — embeddings/lm_head load fine, layer weights fail:
[APR-LOAD] Embedding loaded: 28311552 elements (dtype=0, F32) ← OK (skip-quant)
[APR-LOAD] LM head loaded: 28311552 elements (dtype=0, F32) ← OK (skip-quant)
error: [F-LAYOUT-CONTRACT-001] Tensor 'layers.0.qkv_weight':
Shape mismatch: got 138243 elements, expected 552960 (960x576)
138,243 = 552,960 / 4 — the loader reads Q8 bytes (1 byte each) as f32 (4 bytes each).
SmolLM-135M Int4:
error: [F-DATA-QUALITY-001] Tensor 'layers.0.qkv_weight':
DENSITY FAILURE: 100.0% zeros (max 80%)
Q4 nibble-packed data is being read as f32 zeros.
What Needs to Happen
The APR transformer loader in realizar needs to:
- Check
dtype field for each tensor in the APR file
- Dequantize Q8 tensors (dtype=9): unpack int8 → f32 using scale/zero-point
- Dequantize Q4 tensors (dtype=8): unpack int4 nibbles → f32 using block scale/min
- Pass f32 data to the existing matmul kernels
Alternatively, if realizar already has Q4K/Q6K GPU kernels (per CLAUDE.md: QuantizeKernel, Q5KKernel, Q6KKernel), the loader should route quantized tensors to those kernels instead of the F32 path.
Affected
ALL 3 models × ALL quantized layers × both Int4 and Int8 = every check that gets past embedding loading.
Reproduction
cd tiny-model-ground-truth
make clean && make convert # With GH-237 apr
make check # 0/59 — all fail on layers.0.qkv_weight
Environment
Bug Report
Source:
tiny-model-ground-truthparity checker (0/59 passing)Severity: Critical — blocks ALL quantized inference (Int4 and Int8)
Related: GH-237 fixed the write side; this is the read side
Description
After GH-237 wired real quantization through the APR write pipeline, quantized tensors are now stored correctly as Q8 (dtype=9) and Q4 (dtype=8) in
.aprfiles. However, the inference loader inrealizar/src/apr_transformer/does not dequantize these tensors — it reads quantized bytes as if they were f32, producing shape mismatches (Int8: 4:1 element deficit) or all-zeros (Int4: nibble data misinterpreted as float zeros).Embeddings and lm_head work because they're skipped from quantization (dtype=0, F32). Layer weights fail because they're actually quantized.
Evidence
SmolLM-135M Int8 — embeddings/lm_head load fine, layer weights fail:
138,243 = 552,960 / 4 — the loader reads Q8 bytes (1 byte each) as f32 (4 bytes each).
SmolLM-135M Int4:
Q4 nibble-packed data is being read as f32 zeros.
What Needs to Happen
The APR transformer loader in realizar needs to:
dtypefield for each tensor in the APR fileAlternatively, if realizar already has Q4K/Q6K GPU kernels (per CLAUDE.md:
QuantizeKernel,Q5KKernel,Q6KKernel), the loader should route quantized tensors to those kernels instead of the F32 path.Affected
ALL 3 models × ALL quantized layers × both Int4 and Int8 = every check that gets past embedding loading.
Reproduction
Environment
aprv0.2.16 with Int8 quantization corrupts embedding tensors (NaN/Inf + shape mismatch) #231/232/233/234/235/236/237 fixesrealizarv0.6.13