Summary
SafeTensors inference on Qwen2.5-Coder-0.5B-Instruct produces garbage output while GGUF inference on the same model produces correct output. Root cause appears to be incorrect layer count detection in the SafeTensors loader.
apr version: 0.2.12 (commit f7f7ca8)
Model: Qwen/Qwen2.5-Coder-0.5B-Instruct
QA Gate: F-QUAL-001 (Garbage Output Detection)
Reproduction
# SafeTensors - GARBAGE OUTPUT
apr run --prompt "What is 2+2?" --max-tokens 10 \
/path/to/safetensors/model.safetensors --verbose
# Output:
# Architecture: SafeTensors (14 layers, vocab_size=151936) # <-- WRONG: should be 24
# Output: çī¹åĪ«æĺ¯åľ¨âĢĶevenâĢĶevenâĢĶevenâĢĶallthoughâĢĶeveninders-associated/dis
# GGUF - CORRECT OUTPUT
apr run --prompt "What is 2+2?" --max-tokens 10 \
/path/to/gguf/model.gguf --verbose
# Output:
# Architecture: Qwen2 [GGUF: qwen2] (24 layers, vocab_size=151936) # <-- CORRECT
# Output: 2 + 2 equals 4.
Root Cause Analysis
The SafeTensors loader detects 14 layers but the model actually has 24 layers.
Evidence: Verbose Output Comparison
| Format |
Detected Layers |
Actual Layers |
Output Quality |
| SafeTensors |
14 |
24 |
Garbage |
| GGUF |
24 |
24 |
Correct |
SafeTensors Verbose Output
Source: /home/noah/.cache/pacha/models/qwen2-5-coder-0-5b-instruct/safetensors/model.safetensors
Using mmap for 942MB model
Loading SafeTensors model: ...
Architecture: SafeTensors (14 layers, vocab_size=151936) # BUG: Wrong layer count
Config: hidden_size=896, context_length=32768, quant=F16/BF16, threads=1 (GPU)
Model loaded in 2029.7ms
Backend: GPU (NVIDIA GeForce RTX 4090, 24045 MB VRAM)
GGUF Verbose Output
Source: /home/noah/.cache/pacha/models/qwen2-5-coder-0-5b-instruct/gguf/model.gguf
Using mmap for 468MB model
Loading model: ...
Architecture: Qwen2 [GGUF: qwen2] (24 layers, vocab_size=151936) # Correct
Config: hidden_size=896, context_length=32768, quant=Q8_0, threads=48
Model loaded in 545.4ms
Backend: CPU (Q4_0 format - GPU Q4_K kernels incompatible)
Tensor Verification
Both files have the same layer structure:
# SafeTensors layers (verified via apr tensors | grep model.layers | unique)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 # 24 layers
# GGUF layers (from rosetta inspect metadata)
n_layers: 24
Hypothesis
The SafeTensors loader is incorrectly calculating the layer count. Possible causes:
- Off-by-one in layer counting loop - stops at layer 13 instead of 23
- Hardcoded assumption - assuming half the layers based on some heuristic
- Metadata parsing bug - misreading config.json num_hidden_layers field
- Tensor name pattern mismatch - not recognizing layers 14-23 naming convention
Impact
- Qwen2.5-Coder-0.5B MVP certification BLOCKED at MQS 270 (was targeting 800+)
- All 6 SafeTensors inference tests fail (3 modalities × 2 backends)
- GGUF and APR inference pass correctly
Model File Details
SafeTensors:
File Size: 988097824 bytes (943 MB)
Total Parameters: 494032768
Tensors: 290
Data Type: BF16
GGUF:
File Size: 491400064 bytes (468 MB)
Total Parameters: 630167424
Tensors: 291
Quantization: Q8_0
Architecture: qwen2
n_layers: 24
Environment
- GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)
- OS: Linux 6.8.0-90-generic
- apr version: 0.2.12
Related
Summary
SafeTensors inference on Qwen2.5-Coder-0.5B-Instruct produces garbage output while GGUF inference on the same model produces correct output. Root cause appears to be incorrect layer count detection in the SafeTensors loader.
apr version: 0.2.12 (commit f7f7ca8)
Model: Qwen/Qwen2.5-Coder-0.5B-Instruct
QA Gate: F-QUAL-001 (Garbage Output Detection)
Reproduction
Root Cause Analysis
The SafeTensors loader detects 14 layers but the model actually has 24 layers.
Evidence: Verbose Output Comparison
SafeTensors Verbose Output
GGUF Verbose Output
Tensor Verification
Both files have the same layer structure:
Hypothesis
The SafeTensors loader is incorrectly calculating the layer count. Possible causes:
Impact
Model File Details
Environment
Related