Bug
apr-exported GGUFs cannot be loaded by llama.cpp (llama-completion). Each model fails
for a different reason, indicating multiple GGUF metadata/vocabulary issues in the
export path.
Failure Matrix (3 models, 3 distinct errors)
| Model |
tokenizer.ggml.pre |
llama.cpp Error |
Severity |
| SmolLM-135M |
llama (should be default) |
unknown pre-tokenizer type: 'llama' |
Rejected |
| Qwen2-0.5B |
(crashes before reading) |
GGML_ASSERT(id_to_token.size() == token_to_id.size()) failed → segfault in llama_vocab::impl::load |
Crash |
| GPT-2 |
gpt2 (correct!) |
key not found in model: gpt2.attention.layer_norm_epsilon |
Rejected |
Root Cause Analysis
SmolLM: apr hardcodes tokenizer.ggml.pre = "llama" for LLaMA-architecture models.
llama.cpp's convert_hf_to_gguf.py sets this to "default" for SmolLM. The
pre-tokenizer type must match what llama.cpp expects for each tokenizer class.
Qwen2: The vocabulary export produces a token table where id_to_token.size() != token_to_id.size(), meaning duplicate or missing token entries. llama.cpp hits a hard
assertion failure and crashes (not even a graceful error).
GPT-2: apr writes gpt2.attention.layer_norm_rms_epsilon (RMS norm key, from LLaMA)
but GPT-2 uses standard LayerNorm, so llama.cpp expects gpt2.attention.layer_norm_epsilon.
The architecture-specific hyperparameter key is wrong.
Reproduction
# apr version
apr --version
# apr 0.2.18 (940ef71e)
# llama.cpp version
llama-completion --version
# build: 7746 (39173bcac)
# Generate apr GGUFs (these already exist if you've run make convert)
apr export models/smollm-135m-int4.apr --format gguf --output models/smollm-135m-int4.gguf
apr export models/qwen2-0.5b-int4.apr --format gguf --output models/qwen2-0.5b-int4.gguf
apr export models/gpt2-124m-int4.apr --format gguf --output models/gpt2-124m-int4.gguf
# Attempt to load each in llama-completion
llama-completion -m models/smollm-135m-int4.gguf -p "Hello" -n 4 --temp 0 --top-k 1 -s 42
# ERROR: unknown pre-tokenizer type: 'llama'
llama-completion -m models/qwen2-0.5b-int4.gguf -p "Hello" -n 4 --temp 0 --top-k 1 -s 42
# CRASH: GGML_ASSERT(id_to_token.size() == token_to_id.size()) failed → segfault
llama-completion -m models/gpt2-124m-int4.gguf -p "Hello" -n 4 --temp 0 --top-k 1 -s 42
# ERROR: key not found in model: gpt2.attention.layer_norm_epsilon
Full llama.cpp Error Output
SmolLM — GGUF metadata dump shows the problem at kv 15:
llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 15: tokenizer.ggml.pre str = llama ← WRONG (should be 'default')
...
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'llama'
Qwen2 — Segfault in vocab loading:
/home/noah/src/llama.cpp/src/llama-vocab.cpp:2126: GGML_ASSERT(id_to_token.size() == token_to_id.size()) failed
#3 llama_vocab::impl::load(llama_model_loader&, LLM_KV const&)
#4 llama_model::load_vocab(llama_model_loader&)
GPT-2 — Wrong hyperparameter key:
llama_model_loader: - kv 10: gpt2.attention.layer_norm_rms_epsilon f32 = 0.000001 ← WRONG KEY
...
llama_model_load: error loading model: error loading model hyperparameters: key not found in model: gpt2.attention.layer_norm_epsilon
Comparison: apr-exported vs llama.cpp-native GGUF metadata
llama.cpp-native GGUFs (from convert_hf_to_gguf.py + llama-quantize) load fine in
llama-completion. Comparing GGUF key-value metadata:
| Key |
apr-exported (SmolLM) |
llama.cpp-native (SmolLM) |
tokenizer.ggml.pre |
llama |
default |
tokenizer.ggml.model |
gpt2 |
gpt2 |
general.architecture |
llama |
llama |
| Key |
apr-exported (GPT-2) |
llama.cpp-native (GPT-2) |
gpt2.attention.layer_norm_rms_epsilon |
0.000001 |
(not present) |
gpt2.attention.layer_norm_epsilon |
(not present) |
0.00001 |
Five-Whys Analysis
SmolLM pre-tokenizer
- Why does llama.cpp reject apr SmolLM GGUF? →
unknown pre-tokenizer type: 'llama'
- Why is the pre-tokenizer set to
'llama'? → apr GGUF export hardcodes tokenizer.ggml.pre based on general.architecture
- Why doesn't that work? → The pre-tokenizer type is a tokenizer property, not an architecture property. SmolLM uses GPT-2 BPE tokenizer, not LLaMA SentencePiece.
- Why does llama.cpp care? → llama.cpp uses
tokenizer.ggml.pre to select regex-based pre-tokenization patterns (whitespace splitting, etc.)
- Why is this hard to get right? → The mapping from HF tokenizer class → GGUF pre-tokenizer type is a lookup table in
convert_hf_to_gguf.py (~50 entries). apr must replicate this table.
Qwen2 vocab crash
- Why does llama.cpp crash on apr Qwen2 GGUF? →
id_to_token.size() != token_to_id.size()
- Why are the sizes different? → The token table has duplicate entries (same string mapped to multiple IDs, or vice versa)
- Why are there duplicates? → apr's vocabulary export likely doesn't handle Qwen2's added tokens or special tokens correctly
- Why is Qwen2 different? → Qwen2 has 151,936 vocab entries with many special tokens (
<|im_start|>, etc.) that overlap base vocabulary
- Why doesn't apr catch this? → No post-export validation that token table is bijective
GPT-2 missing hyperparameter
- Why does llama.cpp reject apr GPT-2 GGUF? →
key not found: gpt2.attention.layer_norm_epsilon
- Why is the key missing? → apr writes
gpt2.attention.layer_norm_rms_epsilon instead
- Why the wrong key? → apr's GGUF export uses LLaMA-style
layer_norm_rms_epsilon for all architectures
- Why is that wrong for GPT-2? → GPT-2 uses standard LayerNorm (not RMSNorm), so the GGUF key is different
- Why does llama.cpp require the exact key? → GGUF is a typed key-value format with architecture-prefixed keys;
gpt2.attention.layer_norm_epsilon is the spec-defined key
Popperian Falsification
Claim: "apr-exported GGUFs are valid GGUF files loadable by any GGUF-compatible runtime."
Test: Load apr-exported GGUFs in llama.cpp (llama-completion), the reference GGUF implementation.
Result: FALSIFIED — all 3 models fail to load, each for a different reason.
Falsification evidence:
- SmolLM: vocabulary metadata error (pre-tokenizer type)
- Qwen2: vocabulary data corruption (token table assertion failure → crash)
- GPT-2: hyperparameter metadata error (wrong key name)
Severity: This is not a single bug but a pattern of GGUF export producing files that
don't conform to the llama.cpp GGUF specification. The three failures suggest the export
path was not tested against any external GGUF consumer.
Context
- apr version: 0.2.18 (940ef71)
- llama.cpp version: build 7746 (39173bcac), Feb 2026
- Test repo:
tiny-model-ground-truth, Layer 4b tests
- Test file:
tests/test_llamacpp_parity.py::test_apr_gguf_loads_in_llamacpp
- All 3 models tested: SmolLM-135M, Qwen2-0.5B, GPT-2
- Status: xfail in CI (does not block, but documents the failure)
Acceptance Criteria
Bug
apr-exported GGUFs cannot be loaded by llama.cpp (
llama-completion). Each model failsfor a different reason, indicating multiple GGUF metadata/vocabulary issues in the
export path.
Failure Matrix (3 models, 3 distinct errors)
tokenizer.ggml.prellama(should bedefault)unknown pre-tokenizer type: 'llama'GGML_ASSERT(id_to_token.size() == token_to_id.size()) failed→ segfault inllama_vocab::impl::loadgpt2(correct!)key not found in model: gpt2.attention.layer_norm_epsilonRoot Cause Analysis
SmolLM: apr hardcodes
tokenizer.ggml.pre = "llama"for LLaMA-architecture models.llama.cpp's
convert_hf_to_gguf.pysets this to"default"for SmolLM. Thepre-tokenizer type must match what llama.cpp expects for each tokenizer class.
Qwen2: The vocabulary export produces a token table where
id_to_token.size() != token_to_id.size(), meaning duplicate or missing token entries. llama.cpp hits a hardassertion failure and crashes (not even a graceful error).
GPT-2: apr writes
gpt2.attention.layer_norm_rms_epsilon(RMS norm key, from LLaMA)but GPT-2 uses standard LayerNorm, so llama.cpp expects
gpt2.attention.layer_norm_epsilon.The architecture-specific hyperparameter key is wrong.
Reproduction
Full llama.cpp Error Output
SmolLM — GGUF metadata dump shows the problem at kv 15:
Qwen2 — Segfault in vocab loading:
GPT-2 — Wrong hyperparameter key:
Comparison: apr-exported vs llama.cpp-native GGUF metadata
llama.cpp-native GGUFs (from
convert_hf_to_gguf.py+llama-quantize) load fine inllama-completion. Comparing GGUF key-value metadata:
tokenizer.ggml.prellamadefaulttokenizer.ggml.modelgpt2gpt2general.architecturellamallamagpt2.attention.layer_norm_rms_epsilon0.000001gpt2.attention.layer_norm_epsilon0.00001Five-Whys Analysis
SmolLM pre-tokenizer
unknown pre-tokenizer type: 'llama''llama'? → apr GGUF export hardcodestokenizer.ggml.prebased ongeneral.architecturetokenizer.ggml.preto select regex-based pre-tokenization patterns (whitespace splitting, etc.)convert_hf_to_gguf.py(~50 entries). apr must replicate this table.Qwen2 vocab crash
id_to_token.size() != token_to_id.size()<|im_start|>, etc.) that overlap base vocabularyGPT-2 missing hyperparameter
key not found: gpt2.attention.layer_norm_epsilongpt2.attention.layer_norm_rms_epsiloninsteadlayer_norm_rms_epsilonfor all architecturesgpt2.attention.layer_norm_epsilonis the spec-defined keyPopperian Falsification
Claim: "apr-exported GGUFs are valid GGUF files loadable by any GGUF-compatible runtime."
Test: Load apr-exported GGUFs in llama.cpp (
llama-completion), the reference GGUF implementation.Result: FALSIFIED — all 3 models fail to load, each for a different reason.
Falsification evidence:
Severity: This is not a single bug but a pattern of GGUF export producing files that
don't conform to the llama.cpp GGUF specification. The three failures suggest the export
path was not tested against any external GGUF consumer.
Context
tiny-model-ground-truth, Layer 4b teststests/test_llamacpp_parity.py::test_apr_gguf_loads_in_llamacppAcceptance Criteria
tokenizer.ggml.pre)layer_norm_epsilonkey)test_apr_gguf_loads_in_llamacppxfail removed, tests pass green