apr-exported GGUFs use pre-tokenizer type 'llama' which llama.cpp rejects

## Bug

apr-exported GGUFs cannot be loaded by llama.cpp (`llama-completion`). Each model fails
for a **different reason**, indicating multiple GGUF metadata/vocabulary issues in the
export path.

### Failure Matrix (3 models, 3 distinct errors)

| Model | `tokenizer.ggml.pre` | llama.cpp Error | Severity |
|-------|---------------------|-----------------|----------|
| SmolLM-135M | `llama` (should be `default`) | `unknown pre-tokenizer type: 'llama'` | Rejected |
| Qwen2-0.5B | (crashes before reading) | `GGML_ASSERT(id_to_token.size() == token_to_id.size()) failed` → segfault in `llama_vocab::impl::load` | **Crash** |
| GPT-2 | `gpt2` (correct!) | `key not found in model: gpt2.attention.layer_norm_epsilon` | Rejected |

### Root Cause Analysis

**SmolLM**: apr hardcodes `tokenizer.ggml.pre = "llama"` for LLaMA-architecture models.
llama.cpp's `convert_hf_to_gguf.py` sets this to `"default"` for SmolLM. The
pre-tokenizer type must match what llama.cpp expects for each tokenizer class.

**Qwen2**: The vocabulary export produces a token table where `id_to_token.size() != token_to_id.size()`, meaning duplicate or missing token entries. llama.cpp hits a hard
assertion failure and crashes (not even a graceful error).

**GPT-2**: apr writes `gpt2.attention.layer_norm_rms_epsilon` (RMS norm key, from LLaMA)
but GPT-2 uses standard LayerNorm, so llama.cpp expects `gpt2.attention.layer_norm_epsilon`.
The architecture-specific hyperparameter key is wrong.

## Reproduction

```bash
# apr version
apr --version
# apr 0.2.18 (940ef71e)

# llama.cpp version
llama-completion --version
# build: 7746 (39173bcac)

# Generate apr GGUFs (these already exist if you've run make convert)
apr export models/smollm-135m-int4.apr --format gguf --output models/smollm-135m-int4.gguf
apr export models/qwen2-0.5b-int4.apr --format gguf --output models/qwen2-0.5b-int4.gguf
apr export models/gpt2-124m-int4.apr --format gguf --output models/gpt2-124m-int4.gguf

# Attempt to load each in llama-completion
llama-completion -m models/smollm-135m-int4.gguf -p "Hello" -n 4 --temp 0 --top-k 1 -s 42
# ERROR: unknown pre-tokenizer type: 'llama'

llama-completion -m models/qwen2-0.5b-int4.gguf -p "Hello" -n 4 --temp 0 --top-k 1 -s 42
# CRASH: GGML_ASSERT(id_to_token.size() == token_to_id.size()) failed → segfault

llama-completion -m models/gpt2-124m-int4.gguf -p "Hello" -n 4 --temp 0 --top-k 1 -s 42
# ERROR: key not found in model: gpt2.attention.layer_norm_epsilon
```

### Full llama.cpp Error Output

**SmolLM** — GGUF metadata dump shows the problem at kv 15:
```
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = llama   ← WRONG (should be 'default')
...
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'llama'
```

**Qwen2** — Segfault in vocab loading:
```
/home/noah/src/llama.cpp/src/llama-vocab.cpp:2126: GGML_ASSERT(id_to_token.size() == token_to_id.size()) failed
#3  llama_vocab::impl::load(llama_model_loader&, LLM_KV const&)
#4  llama_model::load_vocab(llama_model_loader&)
```

**GPT-2** — Wrong hyperparameter key:
```
llama_model_loader: - kv  10:      gpt2.attention.layer_norm_rms_epsilon f32              = 0.000001  ← WRONG KEY
...
llama_model_load: error loading model: error loading model hyperparameters: key not found in model: gpt2.attention.layer_norm_epsilon
```

### Comparison: apr-exported vs llama.cpp-native GGUF metadata

llama.cpp-native GGUFs (from `convert_hf_to_gguf.py` + `llama-quantize`) load fine in
llama-completion. Comparing GGUF key-value metadata:

| Key | apr-exported (SmolLM) | llama.cpp-native (SmolLM) |
|-----|----------------------|--------------------------|
| `tokenizer.ggml.pre` | `llama` | `default` |
| `tokenizer.ggml.model` | `gpt2` | `gpt2` |
| `general.architecture` | `llama` | `llama` |

| Key | apr-exported (GPT-2) | llama.cpp-native (GPT-2) |
|-----|---------------------|--------------------------|
| `gpt2.attention.layer_norm_rms_epsilon` | `0.000001` | *(not present)* |
| `gpt2.attention.layer_norm_epsilon` | *(not present)* | `0.00001` |

## Five-Whys Analysis

### SmolLM pre-tokenizer
1. **Why** does llama.cpp reject apr SmolLM GGUF? → `unknown pre-tokenizer type: 'llama'`
2. **Why** is the pre-tokenizer set to `'llama'`? → apr GGUF export hardcodes `tokenizer.ggml.pre` based on `general.architecture`
3. **Why** doesn't that work? → The pre-tokenizer type is a **tokenizer property**, not an architecture property. SmolLM uses GPT-2 BPE tokenizer, not LLaMA SentencePiece.
4. **Why** does llama.cpp care? → llama.cpp uses `tokenizer.ggml.pre` to select regex-based pre-tokenization patterns (whitespace splitting, etc.)
5. **Why** is this hard to get right? → The mapping from HF tokenizer class → GGUF pre-tokenizer type is a lookup table in `convert_hf_to_gguf.py` (~50 entries). apr must replicate this table.

### Qwen2 vocab crash
1. **Why** does llama.cpp crash on apr Qwen2 GGUF? → `id_to_token.size() != token_to_id.size()`
2. **Why** are the sizes different? → The token table has duplicate entries (same string mapped to multiple IDs, or vice versa)
3. **Why** are there duplicates? → apr's vocabulary export likely doesn't handle Qwen2's added tokens or special tokens correctly
4. **Why** is Qwen2 different? → Qwen2 has 151,936 vocab entries with many special tokens (`<|im_start|>`, etc.) that overlap base vocabulary
5. **Why** doesn't apr catch this? → No post-export validation that token table is bijective

### GPT-2 missing hyperparameter
1. **Why** does llama.cpp reject apr GPT-2 GGUF? → `key not found: gpt2.attention.layer_norm_epsilon`
2. **Why** is the key missing? → apr writes `gpt2.attention.layer_norm_rms_epsilon` instead
3. **Why** the wrong key? → apr's GGUF export uses LLaMA-style `layer_norm_rms_epsilon` for all architectures
4. **Why** is that wrong for GPT-2? → GPT-2 uses standard LayerNorm (not RMSNorm), so the GGUF key is different
5. **Why** does llama.cpp require the exact key? → GGUF is a typed key-value format with architecture-prefixed keys; `gpt2.attention.layer_norm_epsilon` is the spec-defined key

## Popperian Falsification

**Claim**: "apr-exported GGUFs are valid GGUF files loadable by any GGUF-compatible runtime."

**Test**: Load apr-exported GGUFs in llama.cpp (`llama-completion`), the reference GGUF implementation.

**Result**: **FALSIFIED** — all 3 models fail to load, each for a different reason.

**Falsification evidence**:
- SmolLM: vocabulary metadata error (pre-tokenizer type)
- Qwen2: vocabulary data corruption (token table assertion failure → crash)
- GPT-2: hyperparameter metadata error (wrong key name)

**Severity**: This is not a single bug but a pattern of GGUF export producing files that
don't conform to the llama.cpp GGUF specification. The three failures suggest the export
path was not tested against any external GGUF consumer.

## Context

- apr version: 0.2.18 (940ef71e)
- llama.cpp version: build 7746 (39173bcac), Feb 2026
- Test repo: `tiny-model-ground-truth`, Layer 4b tests
- Test file: `tests/test_llamacpp_parity.py::test_apr_gguf_loads_in_llamacpp`
- All 3 models tested: SmolLM-135M, Qwen2-0.5B, GPT-2
- Status: xfail in CI (does not block, but documents the failure)

### Acceptance Criteria

- [ ] apr-exported SmolLM GGUF loads in llama-completion (fix `tokenizer.ggml.pre`)
- [ ] apr-exported Qwen2 GGUF loads in llama-completion (fix token table bijection)
- [ ] apr-exported GPT-2 GGUF loads in llama-completion (fix `layer_norm_epsilon` key)
- [ ] Add post-export validation: round-trip load test using llama.cpp C API or gguf-py
- [ ] `test_apr_gguf_loads_in_llamacpp` xfail removed, tests pass green

Key	apr-exported (SmolLM)	llama.cpp-native (SmolLM)
`tokenizer.ggml.pre`	`llama`	`default`
`tokenizer.ggml.model`	`gpt2`	`gpt2`
`general.architecture`	`llama`	`llama`

Key	apr-exported (GPT-2)	llama.cpp-native (GPT-2)
`gpt2.attention.layer_norm_rms_epsilon`	`0.000001`	(not present)
`gpt2.attention.layer_norm_epsilon`	(not present)	`0.00001`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apr-exported GGUFs use pre-tokenizer type 'llama' which llama.cpp rejects #277

Bug

Failure Matrix (3 models, 3 distinct errors)

Root Cause Analysis

Reproduction

Full llama.cpp Error Output

Comparison: apr-exported vs llama.cpp-native GGUF metadata

Five-Whys Analysis

SmolLM pre-tokenizer

Qwen2 vocab crash

GPT-2 missing hyperparameter

Popperian Falsification

Context

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	`tokenizer.ggml.pre`	llama.cpp Error	Severity
SmolLM-135M	`llama` (should be `default`)	`unknown pre-tokenizer type: 'llama'`	Rejected
Qwen2-0.5B	(crashes before reading)	`GGML_ASSERT(id_to_token.size() == token_to_id.size()) failed` → segfault in `llama_vocab::impl::load`	Crash
GPT-2	`gpt2` (correct!)	`key not found in model: gpt2.attention.layer_norm_epsilon`	Rejected

apr-exported GGUFs use pre-tokenizer type 'llama' which llama.cpp rejects #277

Description

Bug

Failure Matrix (3 models, 3 distinct errors)

Root Cause Analysis

Reproduction

Full llama.cpp Error Output

Comparison: apr-exported vs llama.cpp-native GGUF metadata

Five-Whys Analysis

SmolLM pre-tokenizer

Qwen2 vocab crash

GPT-2 missing hyperparameter

Popperian Falsification

Context

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions