Skip to content

QA: Qwen2.5-Coder-0.5B MVP qualification — 29/45 pass (64%), 3 root causes #218

@noahgift

Description

@noahgift

Summary

First automated MVP qualification run for Qwen2.5-Coder-0.5B-Instruct using the apr-model-qa-playbook framework. The model passes all gateway checks (G0-G4) and core inference gates, but fails 16/45 gates across format conversion and contract invariant tests.

Result: 29/45 Corroborated, 16 Falsified (64% pass rate)

Environment

  • Playbook: qwen2.5-coder-0.5b-mvp.playbook.yaml
  • Model: Qwen/Qwen2.5-Coder-0.5B-Instruct (SafeTensors, 942 MB)
  • apr version: 0.2.12
  • Backend: CPU only (--no-gpu)
  • Duration: ~17 minutes (1043s)

What Passes (29 gates)

All gateways and core inference pass:

Gate Description
G0-PULL-001 Model download/cache
G0-FORMAT-APR-001 APR format available
G0-FORMAT-GGUF-001 GGUF format available
G0-VALIDATE-001 (x4) Model validation
G0-INTEGRITY-CONFIG Config integrity
G0-LAYOUT-001 Tensor layout correct
F-A1 through F-A6 (x18) Core inference (run/chat/serve x cpu)
F-CONV-SafeTensors-Apr ST→APR conversion
F-GOLDEN-RULE-001 Golden rule (converted model = original)

What Fails (16 gates) — 3 Root Causes

Root Cause 1: GGUF weights-only files lack embedded tokenizer (7 gates)

Refs: #216, #185

Any conversion chain that reads from GGUF fails with PMAT-232:

ERROR: GGUF file '...model.gguf' has no embedded tokenizer vocabulary.
This is a 'weights-only' GGUF that cannot produce a working APR file.

Affected gates:

Gate Chain
F-CONV-G-A GGUF → APR
F-CONV-G-S GGUF → SafeTensors
F-CONV-RT-001 GGUF → APR → ST → GGUF
F-CONV-IDEM-001 Idempotency (GGUF path)
F-CONV-COM-001 Commutativity (GGUF path)

Plus cascading failures in RT-001 which starts from GGUF.

Fix: apr rosetta convert should accept --tokenizer <path> to supply an external tokenizer when the GGUF is weights-only. The QA playbook workspace already has tokenizer.json alongside the GGUF — rosetta just needs to look for it or accept it as an argument.

Root Cause 2: Inference diff too high after format conversion (5 gates)

Refs: #215

Pairwise conversion produces inference output that differs significantly from the source format. The diffs are ~0.77-0.81 (on a 0-1 scale) against an epsilon of 1e-6:

Gate Conversion Diff
F-CONV-A-G APR → GGUF 8.00e-1
F-CONV-S-G ST → GGUF 8.00e-1
F-CONV-A-S APR → ST 8.12e-1
F-CONV-S-A ST → APR 7.67e-1

Round-trip chains also fail (F-CONV-RT-002, RT-003, RT-004) as the errors compound.

Note: The golden rule test (F-GOLDEN-RULE-001) PASSES, meaning inference(convert(ST→APR)) == inference(ST). The high diffs in pairwise tests may indicate the comparison metric is using fingerprint hashes rather than semantic output similarity. Needs investigation — are these truly different inference outputs, or is the comparison method too strict?

Root Cause 3: Contract test workspace path construction bug (4 gates)

The contract invariant tests (I-2 through I-5) fail due to incorrect file path construction in the QA runner:

Gate Error
F-CONTRACT-I2-001 File not found: output/workspace/Qwen/Qwen2.5-Coder-0.apr (path truncated — should be Qwen2.5-Coder-0.5B-Instruct.apr)
F-CONTRACT-I3-001 Unknown format extension: .5b-instruct (directory path passed where file path expected)
F-CONTRACT-I4-001 unexpected argument 'output/workspace/Qwen/Qwen2.5-Coder-0.apr' (wrong arg to rosetta validate-stats)
F-CONTRACT-I5-001 Unknown format extension: .5b-instruct (same as I3)

This is a bug in apr-model-qa-playbook, not aprender. The contract test runner is constructing workspace paths incorrectly when the model name contains dots (e.g., Qwen2.5-Coder-0.5B). Filing separately in the playbook repo.

Actionable Items for Aprender

  1. P0 — GGUF tokenizer passthrough (PMAT-232: GGUF weights-only files fail conversion to APR #216): rosetta convert should auto-discover sibling tokenizer.json or accept --tokenizer flag when converting weights-only GGUF files
  2. P1 — Conversion diff investigation (Conversion output differences exceed acceptable tolerance #215): Determine whether the 0.77-0.81 pairwise diffs represent actual inference divergence or a measurement artifact (hash-based vs semantic comparison)
  3. P2rosetta validate-stats CLI: verify the subcommand accepts the argument format the QA runner is passing

Reproduction

cd ../apr-model-qa-playbook

# Install apr from aprender
cargo install --path ../aprender/crates/apr-cli

# Run the qualification
cargo run --release --bin apr-qa -- run \
    playbooks/models/qwen2.5-coder-0.5b-mvp.playbook.yaml \
    --model-path ~/.cache/pacha/models/d71534cb948e32eb.safetensors \
    --no-gpu \
    -o certifications/qwen2.5-coder-0.5b-mvp

# Analyze results
python3 -c "
import json
with open('certifications/qwen2.5-coder-0.5b-mvp/evidence.json') as f:
    data = json.load(f)
for e in data:
    status = 'PASS' if e['outcome'] == 'Corroborated' else 'FAIL'
    print(f'[{status}] {e[\"gate_id\"]}: {e[\"reason\"][:100]}')
"

Raw Evidence

Full evidence JSON: certifications/qwen2.5-coder-0.5b-mvp/evidence.json in apr-model-qa-playbook repo.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions