You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
P0 CRITICAL: Format Conversion Introduces NaN/Inf Corruption
Status: OPEN Severity: P0 (CRITICAL - Data Corruption) Component: apr-rosetta / realizear Discovered By: apr-model-qa-playbook (Popperian Falsification) Date: 2026-01-30 Blocking: Model qualification certification
Executive Summary
Format conversion via apr rosetta convert introduces catastrophic numerical corruption including NaN values, Inf values, and tensor weight explosions (means exceeding 10^38). This affects ALL conversion paths and renders converted models unusable. This is a data integrity violation that blocks model certification.
{
"gate_id": "F-CONV-G-S",
"outcome": "Falsified",
"reason": "Conversion infrastructure error: No tokenizer found for converted.safetensors",
"error": "[PMAT-172] ERROR: No tokenizer found... config.json not found (required for SafeTensors inference)",
"timestamp": "2026-01-30T13:03:59.608739501Z"
}
Test 4: F-CONV-RT-001 (Round-Trip) - CATASTROPHIC
Round-trip failed: Validation failed (75 errors)
SAMPLE OF 75 TENSOR CORRUPTION ERRORS:
Layer blk.0.attn_k.weight:
- mean = 342663942034581145229192610889859072.0000 (expected: [-0.1, 0.1])
- contains 322 NaN values
Layer blk.0.attn_output.weight:
- mean = 600104090742549741985281036230066176.0000
- contains 1880 NaN values
Layer blk.0.attn_q.weight:
- mean = 358825298765053356133211673599148032.0000
- contains 2194 NaN values
Layer blk.0.ffn_gate.weight:
- mean = 248127115888904366664980114610061312.0000
- contains 12864 NaN values
Layer blk.1.ffn_down.weight:
- mean = 165087563774647571714992342227222528.0000
- contains 7387 NaN values
- contains 1 Inf values <-- INFINITY INTRODUCED
[... 65 more tensor corruption errors across ALL 28 layers ...]
Five Whys Root Cause Analysis
Why #
Question
Answer
Why 1
Why do converted models produce different output?
Because tensor weights are corrupted during conversion
Why 2
Why are tensor weights corrupted?
Because NaN and Inf values are introduced in dequantization/requantization
Why 3
Why are NaN/Inf values introduced?
Likely integer overflow or division by zero in quantization scaling factors
Why 4
Why does scaling overflow?
Q4_K_M uses block-wise scaling; conversion may not preserve scale bounds
Why 5
Why aren't scale bounds preserved?
ROOT CAUSE: Quantization metadata (scales, mins, block structure) not correctly transferred between formats
Hypothesis
The GGUF Q4_K_M format stores quantization parameters (scales, minimums) in a specific block structure. When converting to APR format, these parameters are either:
Lost entirely (causing dequantization to fail)
Misinterpreted (causing incorrect scaling)
Truncated (causing overflow on large values)
Impact Assessment
Severity: P0 CRITICAL
Impact
Description
Data Integrity
Converted models produce corrupted output
Silent Corruption
Users may not realize output is wrong without validation
Certification Blocked
Models cannot pass MQS qualification (89.3% → should be 100%)
blk.0.attn_k.weight: mean=342663942034581145229192610889859072.0000, 322 NaN
blk.0.attn_output.weight: mean=600104090742549741985281036230066176.0000, 1880 NaN
blk.0.attn_q.weight: mean=358825298765053356133211673599148032.0000, 2194 NaN
blk.0.attn_v.weight: mean=104987109912991073240180736066060288.0000, 173 NaN
blk.0.ffn_down.weight: mean=182500428367235730661723423562006528.0000, 7219 NaN
blk.0.ffn_gate.weight: mean=248127115888904366664980114610061312.0000, 12864 NaN
blk.0.ffn_up.weight: mean=227714970169135588388223435105370112.0000, 13612 NaN
blk.1.attn_k.weight: mean=817014914460021388316716866687991808.0000, 291 NaN
blk.1.attn_output.weight: mean=232121838638585256506020333482934272.0000, 2414 NaN
blk.1.attn_q.weight: mean=658610800790589613328103111479263232.0000, 1917 NaN
blk.1.attn_v.weight: mean=93900614097166921650294139165605888.0000, 224 NaN
blk.1.ffn_down.weight: mean=165087563774647571714992342227222528.0000, 7387 NaN, 1 Inf
blk.1.ffn_gate.weight: mean=350491169507934119065240395147378688.0000, 11249 NaN
blk.1.ffn_up.weight: mean=138885721043945784138165446621790208.0000, 14399 NaN
blk.10.attn_k.weight: mean=546944489382597570786647242189045760.0000, 332 NaN
blk.10.attn_output.weight: mean=428728821534445123910189859016278016.0000, 2136 NaN
blk.10.attn_q.weight: mean=384735443208418250985103538765430784.0000, 2094 NaN
blk.10.attn_v.weight: mean=56025798981149455007586636760350720.0000, 235 NaN
blk.10.ffn_down.weight: mean=173893259055051710120320394271391744.0000, 7033 NaN
blk.10.ffn_gate.weight: mean=291361052663150758782063519324962816.0000, 12879 NaN
blk.10.ffn_up.weight: mean=159597864201074824329220080294428672.0000, 14240 NaN
blk.11.attn_k.weight: mean=512004711257481969379219173002969088.0000, 321 NaN
blk.11.attn_output.weight: mean=347848197234620775027457362508120064.0000, 2195 NaN
blk.11.attn_q.weight: mean=375089850177200396336946327303749632.0000, 2151 NaN
blk.11.attn_v.weight: mean=94796645001121994176308324471930880.0000, 377 NaN
blk.11.ffn_down.weight: mean=208739567755441108175472479982059520.0000, 13508 NaN
blk.11.ffn_gate.weight: mean=308981455427445033161120889037651968.0000, 12461 NaN
blk.11.ffn_up.weight: mean=174119356423826791973727970319663104.0000, 13798 NaN
blk.12.attn_k.weight: mean=493066407430884793442546394834403328.0000, 293 NaN
blk.12.attn_output.weight: mean=346576466384023061012574591789301760.0000, 1977 NaN
blk.12.attn_q.weight: mean=325108011187532853454851987566755840.0000, 2189 NaN
blk.12.attn_v.weight: mean=272807144074032164845310218017439744.0000, 437 NaN
blk.12.ffn_down.weight: mean=252127306200168315937909351890550784.0000, 12964 NaN
blk.12.ffn_gate.weight: mean=304284512847990014737309627871920128.0000, 12356 NaN
blk.12.ffn_up.weight: mean=164185155003610100909801876632895488.0000, 13808 NaN
blk.13.attn_k.weight: mean=379946378082999771702755384371445760.0000, 304 NaN
blk.13.attn_output.weight: mean=170910596034958457735105041945067520.0000, 2280 NaN
[... additional layers truncated for brevity ...]
Filed by: apr-model-qa-playbook automated falsification system Ticket Template Version: 1.1.0
P0 CRITICAL: Format Conversion Introduces NaN/Inf Corruption
Status: OPEN
Severity: P0 (CRITICAL - Data Corruption)
Component: apr-rosetta / realizear
Discovered By: apr-model-qa-playbook (Popperian Falsification)
Date: 2026-01-30
Blocking: Model qualification certification
Executive Summary
Format conversion via
apr rosetta convertintroduces catastrophic numerical corruption including NaN values, Inf values, and tensor weight explosions (means exceeding 10^38). This affects ALL conversion paths and renders converted models unusable. This is a data integrity violation that blocks model certification.Reproduction
Environment
Minimal Reproduction Commands
Expected Behavior
Actual Behavior
Detailed Evidence
Test 1: F-CONV-G-A (GGUF → APR)
{ "gate_id": "F-CONV-G-A", "outcome": "Falsified", "reason": "Conversion Gguf → Apr produced different output (diff: 8.46e-1, ε: 1.00e-6)", "output_hash": "951de74e85c8f75d", "timestamp": "2026-01-30T13:03:35.131895019Z" }Test 2: F-CONV-A-G (APR → GGUF)
{ "gate_id": "F-CONV-A-G", "outcome": "Falsified", "reason": "Conversion Apr → Gguf produced different output (diff: 6.34e-1, ε: 1.00e-6)", "output_hash": "95f05bc3d19b1b6b", "timestamp": "2026-01-30T13:03:47.643921413Z" }Test 3: F-CONV-G-S (GGUF → SafeTensors)
{ "gate_id": "F-CONV-G-S", "outcome": "Falsified", "reason": "Conversion infrastructure error: No tokenizer found for converted.safetensors", "error": "[PMAT-172] ERROR: No tokenizer found... config.json not found (required for SafeTensors inference)", "timestamp": "2026-01-30T13:03:59.608739501Z" }Test 4: F-CONV-RT-001 (Round-Trip) - CATASTROPHIC
Five Whys Root Cause Analysis
Hypothesis
The GGUF Q4_K_M format stores quantization parameters (scales, minimums) in a specific block structure. When converting to APR format, these parameters are either:
Impact Assessment
Severity: P0 CRITICAL
Affected Gates (All P0)
MQS Impact
Suggested Fix
Immediate (P0 Hotfix)
Add tensor validation after every conversion step:
Fail fast on corruption - do not write corrupted files
Short-term
realizear/src/convert/Long-term
--verifyflag that runs inference comparison automaticallyVerification
Once fixed, verify with:
References
playbooks/models/qwen2.5-coder-1.5b-ci.playbook.yamloutput/qwen-full/evidence.jsondocs/specifications/apr-playbook-spec.mdSection 4 (Format Conversion Testing)Appendix: Full Tensor Corruption Log
Click to expand full 75-error validation log
Filed by: apr-model-qa-playbook automated falsification system
Ticket Template Version: 1.1.0