Skip to content

apr-exported GGUFs use pre-tokenizer type 'llama' which llama.cpp rejects #277

@noahgift

Description

@noahgift

Bug

apr-exported GGUFs cannot be loaded by llama.cpp (llama-completion). Each model fails
for a different reason, indicating multiple GGUF metadata/vocabulary issues in the
export path.

Failure Matrix (3 models, 3 distinct errors)

Model tokenizer.ggml.pre llama.cpp Error Severity
SmolLM-135M llama (should be default) unknown pre-tokenizer type: 'llama' Rejected
Qwen2-0.5B (crashes before reading) GGML_ASSERT(id_to_token.size() == token_to_id.size()) failed → segfault in llama_vocab::impl::load Crash
GPT-2 gpt2 (correct!) key not found in model: gpt2.attention.layer_norm_epsilon Rejected

Root Cause Analysis

SmolLM: apr hardcodes tokenizer.ggml.pre = "llama" for LLaMA-architecture models.
llama.cpp's convert_hf_to_gguf.py sets this to "default" for SmolLM. The
pre-tokenizer type must match what llama.cpp expects for each tokenizer class.

Qwen2: The vocabulary export produces a token table where id_to_token.size() != token_to_id.size(), meaning duplicate or missing token entries. llama.cpp hits a hard
assertion failure and crashes (not even a graceful error).

GPT-2: apr writes gpt2.attention.layer_norm_rms_epsilon (RMS norm key, from LLaMA)
but GPT-2 uses standard LayerNorm, so llama.cpp expects gpt2.attention.layer_norm_epsilon.
The architecture-specific hyperparameter key is wrong.

Reproduction

# apr version
apr --version
# apr 0.2.18 (940ef71e)

# llama.cpp version
llama-completion --version
# build: 7746 (39173bcac)

# Generate apr GGUFs (these already exist if you've run make convert)
apr export models/smollm-135m-int4.apr --format gguf --output models/smollm-135m-int4.gguf
apr export models/qwen2-0.5b-int4.apr --format gguf --output models/qwen2-0.5b-int4.gguf
apr export models/gpt2-124m-int4.apr --format gguf --output models/gpt2-124m-int4.gguf

# Attempt to load each in llama-completion
llama-completion -m models/smollm-135m-int4.gguf -p "Hello" -n 4 --temp 0 --top-k 1 -s 42
# ERROR: unknown pre-tokenizer type: 'llama'

llama-completion -m models/qwen2-0.5b-int4.gguf -p "Hello" -n 4 --temp 0 --top-k 1 -s 42
# CRASH: GGML_ASSERT(id_to_token.size() == token_to_id.size()) failed → segfault

llama-completion -m models/gpt2-124m-int4.gguf -p "Hello" -n 4 --temp 0 --top-k 1 -s 42
# ERROR: key not found in model: gpt2.attention.layer_norm_epsilon

Full llama.cpp Error Output

SmolLM — GGUF metadata dump shows the problem at kv 15:

llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = llama   ← WRONG (should be 'default')
...
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'llama'

Qwen2 — Segfault in vocab loading:

/home/noah/src/llama.cpp/src/llama-vocab.cpp:2126: GGML_ASSERT(id_to_token.size() == token_to_id.size()) failed
#3  llama_vocab::impl::load(llama_model_loader&, LLM_KV const&)
#4  llama_model::load_vocab(llama_model_loader&)

GPT-2 — Wrong hyperparameter key:

llama_model_loader: - kv  10:      gpt2.attention.layer_norm_rms_epsilon f32              = 0.000001  ← WRONG KEY
...
llama_model_load: error loading model: error loading model hyperparameters: key not found in model: gpt2.attention.layer_norm_epsilon

Comparison: apr-exported vs llama.cpp-native GGUF metadata

llama.cpp-native GGUFs (from convert_hf_to_gguf.py + llama-quantize) load fine in
llama-completion. Comparing GGUF key-value metadata:

Key apr-exported (SmolLM) llama.cpp-native (SmolLM)
tokenizer.ggml.pre llama default
tokenizer.ggml.model gpt2 gpt2
general.architecture llama llama
Key apr-exported (GPT-2) llama.cpp-native (GPT-2)
gpt2.attention.layer_norm_rms_epsilon 0.000001 (not present)
gpt2.attention.layer_norm_epsilon (not present) 0.00001

Five-Whys Analysis

SmolLM pre-tokenizer

  1. Why does llama.cpp reject apr SmolLM GGUF? → unknown pre-tokenizer type: 'llama'
  2. Why is the pre-tokenizer set to 'llama'? → apr GGUF export hardcodes tokenizer.ggml.pre based on general.architecture
  3. Why doesn't that work? → The pre-tokenizer type is a tokenizer property, not an architecture property. SmolLM uses GPT-2 BPE tokenizer, not LLaMA SentencePiece.
  4. Why does llama.cpp care? → llama.cpp uses tokenizer.ggml.pre to select regex-based pre-tokenization patterns (whitespace splitting, etc.)
  5. Why is this hard to get right? → The mapping from HF tokenizer class → GGUF pre-tokenizer type is a lookup table in convert_hf_to_gguf.py (~50 entries). apr must replicate this table.

Qwen2 vocab crash

  1. Why does llama.cpp crash on apr Qwen2 GGUF? → id_to_token.size() != token_to_id.size()
  2. Why are the sizes different? → The token table has duplicate entries (same string mapped to multiple IDs, or vice versa)
  3. Why are there duplicates? → apr's vocabulary export likely doesn't handle Qwen2's added tokens or special tokens correctly
  4. Why is Qwen2 different? → Qwen2 has 151,936 vocab entries with many special tokens (<|im_start|>, etc.) that overlap base vocabulary
  5. Why doesn't apr catch this? → No post-export validation that token table is bijective

GPT-2 missing hyperparameter

  1. Why does llama.cpp reject apr GPT-2 GGUF? → key not found: gpt2.attention.layer_norm_epsilon
  2. Why is the key missing? → apr writes gpt2.attention.layer_norm_rms_epsilon instead
  3. Why the wrong key? → apr's GGUF export uses LLaMA-style layer_norm_rms_epsilon for all architectures
  4. Why is that wrong for GPT-2? → GPT-2 uses standard LayerNorm (not RMSNorm), so the GGUF key is different
  5. Why does llama.cpp require the exact key? → GGUF is a typed key-value format with architecture-prefixed keys; gpt2.attention.layer_norm_epsilon is the spec-defined key

Popperian Falsification

Claim: "apr-exported GGUFs are valid GGUF files loadable by any GGUF-compatible runtime."

Test: Load apr-exported GGUFs in llama.cpp (llama-completion), the reference GGUF implementation.

Result: FALSIFIED — all 3 models fail to load, each for a different reason.

Falsification evidence:

  • SmolLM: vocabulary metadata error (pre-tokenizer type)
  • Qwen2: vocabulary data corruption (token table assertion failure → crash)
  • GPT-2: hyperparameter metadata error (wrong key name)

Severity: This is not a single bug but a pattern of GGUF export producing files that
don't conform to the llama.cpp GGUF specification. The three failures suggest the export
path was not tested against any external GGUF consumer.

Context

  • apr version: 0.2.18 (940ef71)
  • llama.cpp version: build 7746 (39173bcac), Feb 2026
  • Test repo: tiny-model-ground-truth, Layer 4b tests
  • Test file: tests/test_llamacpp_parity.py::test_apr_gguf_loads_in_llamacpp
  • All 3 models tested: SmolLM-135M, Qwen2-0.5B, GPT-2
  • Status: xfail in CI (does not block, but documents the failure)

Acceptance Criteria

  • apr-exported SmolLM GGUF loads in llama-completion (fix tokenizer.ggml.pre)
  • apr-exported Qwen2 GGUF loads in llama-completion (fix token table bijection)
  • apr-exported GPT-2 GGUF loads in llama-completion (fix layer_norm_epsilon key)
  • Add post-export validation: round-trip load test using llama.cpp C API or gguf-py
  • test_apr_gguf_loads_in_llamacpp xfail removed, tests pass green

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions