Skip to content

SafeTensors inference produces garbage: layer count misdetection (14 vs 24) #197

@noahgift

Description

@noahgift

Summary

SafeTensors inference on Qwen2.5-Coder-0.5B-Instruct produces garbage output while GGUF inference on the same model produces correct output. Root cause appears to be incorrect layer count detection in the SafeTensors loader.

apr version: 0.2.12 (commit f7f7ca8)
Model: Qwen/Qwen2.5-Coder-0.5B-Instruct
QA Gate: F-QUAL-001 (Garbage Output Detection)

Reproduction

# SafeTensors - GARBAGE OUTPUT
apr run --prompt "What is 2+2?" --max-tokens 10 \
  /path/to/safetensors/model.safetensors --verbose

# Output:
# Architecture: SafeTensors (14 layers, vocab_size=151936)  # <-- WRONG: should be 24
# Output: çī¹åĪ«æĺ¯åľ¨âĢĶevenâĢĶevenâĢĶevenâĢĶallthoughâĢĶeveninders-associated/dis

# GGUF - CORRECT OUTPUT
apr run --prompt "What is 2+2?" --max-tokens 10 \
  /path/to/gguf/model.gguf --verbose

# Output:
# Architecture: Qwen2 [GGUF: qwen2] (24 layers, vocab_size=151936)  # <-- CORRECT
# Output: 2 + 2 equals 4.

Root Cause Analysis

The SafeTensors loader detects 14 layers but the model actually has 24 layers.

Evidence: Verbose Output Comparison

Format Detected Layers Actual Layers Output Quality
SafeTensors 14 24 Garbage
GGUF 24 24 Correct

SafeTensors Verbose Output

Source: /home/noah/.cache/pacha/models/qwen2-5-coder-0-5b-instruct/safetensors/model.safetensors
Using mmap for 942MB model
Loading SafeTensors model: ...
Architecture: SafeTensors (14 layers, vocab_size=151936)   # BUG: Wrong layer count
Config: hidden_size=896, context_length=32768, quant=F16/BF16, threads=1 (GPU)
Model loaded in 2029.7ms
Backend: GPU (NVIDIA GeForce RTX 4090, 24045 MB VRAM)

GGUF Verbose Output

Source: /home/noah/.cache/pacha/models/qwen2-5-coder-0-5b-instruct/gguf/model.gguf
Using mmap for 468MB model
Loading model: ...
Architecture: Qwen2 [GGUF: qwen2] (24 layers, vocab_size=151936)  # Correct
Config: hidden_size=896, context_length=32768, quant=Q8_0, threads=48
Model loaded in 545.4ms
Backend: CPU (Q4_0 format - GPU Q4_K kernels incompatible)

Tensor Verification

Both files have the same layer structure:

# SafeTensors layers (verified via apr tensors | grep model.layers | unique)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23  # 24 layers

# GGUF layers (from rosetta inspect metadata)
n_layers: 24

Hypothesis

The SafeTensors loader is incorrectly calculating the layer count. Possible causes:

  1. Off-by-one in layer counting loop - stops at layer 13 instead of 23
  2. Hardcoded assumption - assuming half the layers based on some heuristic
  3. Metadata parsing bug - misreading config.json num_hidden_layers field
  4. Tensor name pattern mismatch - not recognizing layers 14-23 naming convention

Impact

  • Qwen2.5-Coder-0.5B MVP certification BLOCKED at MQS 270 (was targeting 800+)
  • All 6 SafeTensors inference tests fail (3 modalities × 2 backends)
  • GGUF and APR inference pass correctly

Model File Details

SafeTensors:
  File Size: 988097824 bytes (943 MB)
  Total Parameters: 494032768
  Tensors: 290
  Data Type: BF16

GGUF:
  File Size: 491400064 bytes (468 MB)
  Total Parameters: 630167424
  Tensors: 291
  Quantization: Q8_0
  Architecture: qwen2
  n_layers: 24

Environment

  • GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)
  • OS: Linux 6.8.0-90-generic
  • apr version: 0.2.12

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions