QA: Qwen2.5-Coder-0.5B MVP qualification — 29/45 pass (64%), 3 root causes

## Summary

First automated MVP qualification run for **Qwen2.5-Coder-0.5B-Instruct** using the `apr-model-qa-playbook` framework. The model passes all gateway checks (G0-G4) and core inference gates, but fails 16/45 gates across format conversion and contract invariant tests.

**Result: 29/45 Corroborated, 16 Falsified (64% pass rate)**

## Environment

- **Playbook:** `qwen2.5-coder-0.5b-mvp.playbook.yaml`
- **Model:** `Qwen/Qwen2.5-Coder-0.5B-Instruct` (SafeTensors, 942 MB)
- **apr version:** 0.2.12
- **Backend:** CPU only (`--no-gpu`)
- **Duration:** ~17 minutes (1043s)

## What Passes (29 gates)

All gateways and core inference pass:

| Gate | Description |
|------|-------------|
| G0-PULL-001 | Model download/cache |
| G0-FORMAT-APR-001 | APR format available |
| G0-FORMAT-GGUF-001 | GGUF format available |
| G0-VALIDATE-001 (x4) | Model validation |
| G0-INTEGRITY-CONFIG | Config integrity |
| G0-LAYOUT-001 | Tensor layout correct |
| F-A1 through F-A6 (x18) | Core inference (run/chat/serve x cpu) |
| F-CONV-SafeTensors-Apr | ST→APR conversion |
| F-GOLDEN-RULE-001 | Golden rule (converted model = original) |

## What Fails (16 gates) — 3 Root Causes

### Root Cause 1: GGUF weights-only files lack embedded tokenizer (7 gates)

**Refs:** #216, #185

Any conversion chain that reads from GGUF fails with PMAT-232:

```
ERROR: GGUF file '...model.gguf' has no embedded tokenizer vocabulary.
This is a 'weights-only' GGUF that cannot produce a working APR file.
```

**Affected gates:**
| Gate | Chain |
|------|-------|
| F-CONV-G-A | GGUF → APR |
| F-CONV-G-S | GGUF → SafeTensors |
| F-CONV-RT-001 | GGUF → APR → ST → GGUF |
| F-CONV-IDEM-001 | Idempotency (GGUF path) |
| F-CONV-COM-001 | Commutativity (GGUF path) |

Plus cascading failures in RT-001 which starts from GGUF.

**Fix:** `apr rosetta convert` should accept `--tokenizer <path>` to supply an external tokenizer when the GGUF is weights-only. The QA playbook workspace already has `tokenizer.json` alongside the GGUF — rosetta just needs to look for it or accept it as an argument.

### Root Cause 2: Inference diff too high after format conversion (5 gates)

**Refs:** #215

Pairwise conversion produces inference output that differs significantly from the source format. The diffs are ~0.77-0.81 (on a 0-1 scale) against an epsilon of 1e-6:

| Gate | Conversion | Diff |
|------|-----------|------|
| F-CONV-A-G | APR → GGUF | 8.00e-1 |
| F-CONV-S-G | ST → GGUF | 8.00e-1 |
| F-CONV-A-S | APR → ST | 8.12e-1 |
| F-CONV-S-A | ST → APR | 7.67e-1 |

Round-trip chains also fail (F-CONV-RT-002, RT-003, RT-004) as the errors compound.

**Note:** The golden rule test (F-GOLDEN-RULE-001) PASSES, meaning `inference(convert(ST→APR)) == inference(ST)`. The high diffs in pairwise tests may indicate the comparison metric is using fingerprint hashes rather than semantic output similarity. Needs investigation — are these truly different inference outputs, or is the comparison method too strict?

### Root Cause 3: Contract test workspace path construction bug (4 gates)

The contract invariant tests (I-2 through I-5) fail due to incorrect file path construction in the QA runner:

| Gate | Error |
|------|-------|
| F-CONTRACT-I2-001 | `File not found: output/workspace/Qwen/Qwen2.5-Coder-0.apr` (path truncated — should be `Qwen2.5-Coder-0.5B-Instruct.apr`) |
| F-CONTRACT-I3-001 | `Unknown format extension: .5b-instruct` (directory path passed where file path expected) |
| F-CONTRACT-I4-001 | `unexpected argument 'output/workspace/Qwen/Qwen2.5-Coder-0.apr'` (wrong arg to `rosetta validate-stats`) |
| F-CONTRACT-I5-001 | `Unknown format extension: .5b-instruct` (same as I3) |

**This is a bug in apr-model-qa-playbook**, not aprender. The contract test runner is constructing workspace paths incorrectly when the model name contains dots (e.g., `Qwen2.5-Coder-0.5B`). Filing separately in the playbook repo.

## Actionable Items for Aprender

1. **P0** — GGUF tokenizer passthrough (#216): `rosetta convert` should auto-discover sibling `tokenizer.json` or accept `--tokenizer` flag when converting weights-only GGUF files
2. **P1** — Conversion diff investigation (#215): Determine whether the 0.77-0.81 pairwise diffs represent actual inference divergence or a measurement artifact (hash-based vs semantic comparison)
3. **P2** — `rosetta validate-stats` CLI: verify the subcommand accepts the argument format the QA runner is passing

## Reproduction

```bash
cd ../apr-model-qa-playbook

# Install apr from aprender
cargo install --path ../aprender/crates/apr-cli

# Run the qualification
cargo run --release --bin apr-qa -- run \
    playbooks/models/qwen2.5-coder-0.5b-mvp.playbook.yaml \
    --model-path ~/.cache/pacha/models/d71534cb948e32eb.safetensors \
    --no-gpu \
    -o certifications/qwen2.5-coder-0.5b-mvp

# Analyze results
python3 -c "
import json
with open('certifications/qwen2.5-coder-0.5b-mvp/evidence.json') as f:
    data = json.load(f)
for e in data:
    status = 'PASS' if e['outcome'] == 'Corroborated' else 'FAIL'
    print(f'[{status}] {e[\"gate_id\"]}: {e[\"reason\"][:100]}')
"
```

## Raw Evidence

Full evidence JSON: `certifications/qwen2.5-coder-0.5b-mvp/evidence.json` in `apr-model-qa-playbook` repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QA: Qwen2.5-Coder-0.5B MVP qualification — 29/45 pass (64%), 3 root causes #218

Summary

Environment

What Passes (29 gates)

What Fails (16 gates) — 3 Root Causes

Root Cause 1: GGUF weights-only files lack embedded tokenizer (7 gates)

Root Cause 2: Inference diff too high after format conversion (5 gates)

Root Cause 3: Contract test workspace path construction bug (4 gates)

Actionable Items for Aprender

Reproduction

Raw Evidence

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Gate	Description
G0-PULL-001	Model download/cache
G0-FORMAT-APR-001	APR format available
G0-FORMAT-GGUF-001	GGUF format available
G0-VALIDATE-001 (x4)	Model validation
G0-INTEGRITY-CONFIG	Config integrity
G0-LAYOUT-001	Tensor layout correct
F-A1 through F-A6 (x18)	Core inference (run/chat/serve x cpu)
F-CONV-SafeTensors-Apr	ST→APR conversion
F-GOLDEN-RULE-001	Golden rule (converted model = original)

Gate	Chain
F-CONV-G-A	GGUF → APR
F-CONV-G-S	GGUF → SafeTensors
F-CONV-RT-001	GGUF → APR → ST → GGUF
F-CONV-IDEM-001	Idempotency (GGUF path)
F-CONV-COM-001	Commutativity (GGUF path)

Gate	Conversion	Diff
F-CONV-A-G	APR → GGUF	8.00e-1
F-CONV-S-G	ST → GGUF	8.00e-1
F-CONV-A-S	APR → ST	8.12e-1
F-CONV-S-A	ST → APR	7.67e-1

Gate	Error
F-CONTRACT-I2-001	`File not found: output/workspace/Qwen/Qwen2.5-Coder-0.apr` (path truncated — should be `Qwen2.5-Coder-0.5B-Instruct.apr`)
F-CONTRACT-I3-001	`Unknown format extension: .5b-instruct` (directory path passed where file path expected)
F-CONTRACT-I4-001	`unexpected argument 'output/workspace/Qwen/Qwen2.5-Coder-0.apr'` (wrong arg to `rosetta validate-stats`)
F-CONTRACT-I5-001	`Unknown format extension: .5b-instruct` (same as I3)

QA: Qwen2.5-Coder-0.5B MVP qualification — 29/45 pass (64%), 3 root causes #218

Description

Summary

Environment

What Passes (29 gates)

What Fails (16 gates) — 3 Root Causes

Root Cause 1: GGUF weights-only files lack embedded tokenizer (7 gates)

Root Cause 2: Inference diff too high after format conversion (5 gates)

Root Cause 3: Contract test workspace path construction bug (4 gates)

Actionable Items for Aprender

Reproduction

Raw Evidence

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions