Skip to content

Int8/Int4 quantization corrupts ALL tensors, not just embeddings/lm_head #237

@noahgift

Description

@noahgift

Bug Report

Source: tiny-model-ground-truth parity checker (0/59 passing)
Severity: Critical — quantization is fundamentally broken for all tensors
Related: GH-231, GH-232, GH-234 fixes unmasked this — same corruption pattern propagates through every layer

Description

After applying GH-231/232 (embedding skip-quant) and GH-234 (lm_head skip-quant), the exact same corruption now appears in layers.0.qkv_weight — the first attention QKV weight. This proves the quantization bug is not specific to embeddings or lm_head, but affects ALL tensors:

  • Int8: Element count is ~4:1 too small (quantized bytes stored as f32 without packing ratio)
  • Int4: Element count is correct but data is 100% zeros (wrong offset or not written)

The should_skip_quant() approach of exempting individual tensors is whack-a-mole. The quantization pipeline itself is broken.

Error Output (SmolLM-135M Int8)

[APR-LOAD] Embedding loaded: 28311552 elements  ← FIXED (skipped quant)
[APR-LOAD] LM head loaded: 28311552 elements    ← FIXED (skipped quant)

error: [F-LAYOUT-CONTRACT-001] Tensor 'layers.0.qkv_weight': Shape mismatch:
  got 138243 elements, expected 552960 (960x576)

138,243 ≈ 552,960 / 4 — same 4:1 ratio as the original embedding bug.

Error Output (SmolLM-135M Int4)

error: [F-DATA-QUALITY-001] Tensor 'layers.0.qkv_weight': DENSITY FAILURE:
  100.0% zeros (max 80%)

Same all-zeros pattern as the original Int4 embedding bug.

Affected: ALL 3 Models, ALL Quantized Tensors

Model Int8 Error Int4 Error
SmolLM-135M qkv_weight: 138K vs 553K expected qkv_weight: 100% zeros
Qwen2-0.5B qkv_weight: shape mismatch qkv_weight: 100% zeros
GPT-2 124M qkv_weight: shape mismatch qkv_weight: 100% zeros

Root Cause

The quantization pipeline in converter/write.rs and converter/mod.rs has a fundamental data serialization bug:

Int8: When writing quantized int8 data, the writer stores raw bytes but records the tensor shape as if they were f32 elements. Since each f32 is 4 bytes and each int8 is 1 byte, the actual element count is 1/4 of expected.

Int4: The writer computes the correct element count (accounting for int4 packing), but writes the data at the wrong file offset — leaving the tensor region as zeros.

The should_skip_quant() approach only works as a workaround for embeddings/lm_head. The fix must be in the quantization serialization logic itself.

Reproduction

cd tiny-model-ground-truth
make clean && make convert
apr run models/smollm-135m-int8.apr -p "Hello" -n 32 --json
# Embedding and lm_head load fine, crashes on layers.0.qkv_weight

Environment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions