Skip to content

REGRESSION: Format conversion still produces large diffs after #177 fix #181

@noahgift

Description

@noahgift

REGRESSION: Format Conversion Still Failing After #177 Fix

Status: REGRESSION from closed #177
Severity: P0 (CRITICAL - Data Corruption)
Component: apr-rosetta / realizear
Discovered By: apr-model-qa-playbook requalification (2026-01-30)
Blocking: Model qualification certification


Executive Summary

Issue #177 was closed, but requalification testing on 2026-01-30 shows format conversion still fails with large output differences. The Jidoka detection is working (diffs are flagged), but the root cause fix is incomplete.


Regression Evidence

Test Environment

Date: 2026-01-30T14:59:00Z
Host: noah-Lambda-Vector
Model: Qwen/Qwen2.5-Coder-1.5B-Instruct (GGUF Q4_K_M)
Path: /home/noah/.cache/huggingface/hub/models--Qwen--Qwen2.5-Coder-1.5B-Instruct-GGUF/snapshots/.../qwen2.5-coder-1.5b-instruct-q4_k_m.gguf
Playbook: qwen2.5-coder-1.5b-ci.playbook.yaml

Test Results

Total scenarios: 57
Passed: 50
Failed: 7  ← ALL 7 ARE FORMAT CONVERSION
Pass rate: 89.3%  ← Should be 100%

Detailed Failures

Gate Conversion Diff Tolerance Verdict
F-CONV-001 GGUF → APR 6.77e-1 1.00e-6 ❌ FAIL (677,000× over tolerance)
F-CONV-002 APR → GGUF 4.16e-1 1.00e-6 ❌ FAIL (416,000× over tolerance)
F-CONV-003 GGUF → SafeTensors Infrastructure error - ❌ FAIL (see below)
F-CONV-004 SafeTensors → GGUF 4.16e-1 1.00e-6 ❌ FAIL
F-CONV-005 APR → SafeTensors Infrastructure error - ❌ FAIL (see below)
F-CONV-006 SafeTensors → APR 6.77e-1 1.00e-6 ❌ FAIL
F-CONV-RT-001 Round-trip Blocked - ❌ FAIL

Raw Evidence from evidence.json

{
  "gate_id": "F-CONV-G-A",
  "outcome": "Falsified",
  "reason": "Conversion Gguf → Apr produced different output (diff: 6.77e-1, ε: 1.00e-6)",
  "output": "6de63189564fc936",
  "timestamp": "2026-01-30T14:07:23.xxx"
}

{
  "gate_id": "F-CONV-A-G", 
  "outcome": "Falsified",
  "reason": "Conversion Apr → Gguf produced different output (diff: 4.16e-1, ε: 1.00e-6)",
  "output": "0356a3e657672e25",
  "timestamp": "2026-01-30T14:07:35.xxx"
}

Comparison: Before vs After #177 Fix

Metric Before #177 After #177 Status
NaN detection ❌ Silent ✅ Detected FIXED
Inf detection ❌ Silent ✅ Detected FIXED
Output diff (GGUF→APR) 8.46e-1 6.77e-1 WORSE → BETTER (15% improvement)
Output diff (APR→GGUF) 6.34e-1 4.16e-1 WORSE → BETTER (34% improvement)
Within tolerance (ε=1e-6) ❌ No ❌ No STILL FAILING
Round-trip lossless ❌ No ❌ No STILL FAILING

Conclusion: #177 fix improved detection and reduced diff magnitude, but diffs are still 400,000× to 700,000× above tolerance.


Root Cause Hypothesis

The #177 fix addressed:

  1. ✅ NaN/Inf detection (Jidoka working)
  2. ✅ Some quantization parameter handling

But did NOT address:

  1. ❌ Quantization scale/offset precision loss
  2. ❌ Block-wise quantization metadata transfer
  3. ❌ Q4_K_M super-block structure preservation

Technical Detail

Q4_K_M uses a two-level quantization structure:

Super-block (256 elements):
  - Scale (fp16)
  - Min (fp16)
  - 32× Sub-blocks of 8 elements each
    - Sub-scale (6-bit)
    - 4-bit quantized weights

If the super-block scales are truncated or misaligned during conversion, all weights in that block will be off by a multiplicative factor, leading to the large cumulative diffs we observe.


Suggested Additional Fixes

1. Preserve Full Quantization Metadata

struct Q4KMSuperBlock {
    d: f16,      // Super-block scale - MUST preserve full precision
    dmin: f16,   // Super-block min - MUST preserve full precision
    scales: [u8; 12],  // Sub-block scales - MUST preserve bit-exact
    qs: [u8; 128],     // Quantized values
}

// During conversion, ensure:
// 1. d and dmin are NOT downcast to f32 then back to f16
// 2. scales array is copied bit-exact, not recomputed
// 3. Block alignment matches source format

2. Add Tensor-Level Validation

fn validate_conversion(source: &Tensor, converted: &Tensor) -> Result<()> {
    let diff = (source.to_f32() - converted.to_f32()).abs().max();
    if diff > EPSILON {
        return Err(ConversionError::LossyConversion { 
            diff, 
            tolerance: EPSILON,
            tensor_name: source.name.clone(),
        });
    }
    Ok(())
}

3. Test Each Quantization Type Separately

# Test suite should cover:
apr rosetta convert model_q4_k_m.gguf test.apr && apr rosetta convert test.apr model_back.gguf
apr rosetta convert model_q5_k_m.gguf test.apr && apr rosetta convert test.apr model_back.gguf
apr rosetta convert model_q8_0.gguf test.apr && apr rosetta convert test.apr model_back.gguf
apr rosetta convert model_f16.gguf test.apr && apr rosetta convert test.apr model_back.gguf
# All should produce diff < 1e-6

MQS Impact

Metric Current Required
Score 41.1/100 87+/100
Grade F B or higher
Conversion gates 0/7 7/7
Lost points ~45 0

Verification Criteria

Issue is resolved when:

cd ../apr-model-qa-playbook
cargo run --bin apr-qa -- run playbooks/models/qwen2.5-coder-1.5b-ci.playbook.yaml \
  --subprocess --model-path <model.gguf> --no-gpu --output output/verify

# Required:
# - F-CONV-001 through F-CONV-006: ALL PASS (diff < 1e-6)
# - F-CONV-RT-001: PASS (round-trip lossless)
# - MQS Score: 87+/100
# - Pass rate: 100%

References

  • Original issue: P0 CRITICAL: Format conversion introduces NaN/Inf corruption in tensor weights #177 (CLOSED - but regression detected)
  • Evidence file: ../apr-model-qa-playbook/output/qwen-requalify/evidence.json
  • MQS report: ../apr-model-qa-playbook/output/qwen-requalify/mqs.json
  • Verification playbook: ../apr-model-qa-playbook/playbooks/verify/TICKET-177.yaml
  • Spec: Section 4 (Format Conversion Testing), tolerance = 1e-6

Filed by: apr-model-qa-playbook requalification (automated)
Related: #177 (regression), #172 (original P0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions