You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First automated MVP qualification run for Qwen2.5-Coder-0.5B-Instruct using the apr-model-qa-playbook framework. The model passes all gateway checks (G0-G4) and core inference gates, but fails 16/45 gates across format conversion and contract invariant tests.
Any conversion chain that reads from GGUF fails with PMAT-232:
ERROR: GGUF file '...model.gguf' has no embedded tokenizer vocabulary.
This is a 'weights-only' GGUF that cannot produce a working APR file.
Affected gates:
Gate
Chain
F-CONV-G-A
GGUF → APR
F-CONV-G-S
GGUF → SafeTensors
F-CONV-RT-001
GGUF → APR → ST → GGUF
F-CONV-IDEM-001
Idempotency (GGUF path)
F-CONV-COM-001
Commutativity (GGUF path)
Plus cascading failures in RT-001 which starts from GGUF.
Fix:apr rosetta convert should accept --tokenizer <path> to supply an external tokenizer when the GGUF is weights-only. The QA playbook workspace already has tokenizer.json alongside the GGUF — rosetta just needs to look for it or accept it as an argument.
Root Cause 2: Inference diff too high after format conversion (5 gates)
Pairwise conversion produces inference output that differs significantly from the source format. The diffs are ~0.77-0.81 (on a 0-1 scale) against an epsilon of 1e-6:
Gate
Conversion
Diff
F-CONV-A-G
APR → GGUF
8.00e-1
F-CONV-S-G
ST → GGUF
8.00e-1
F-CONV-A-S
APR → ST
8.12e-1
F-CONV-S-A
ST → APR
7.67e-1
Round-trip chains also fail (F-CONV-RT-002, RT-003, RT-004) as the errors compound.
Note: The golden rule test (F-GOLDEN-RULE-001) PASSES, meaning inference(convert(ST→APR)) == inference(ST). The high diffs in pairwise tests may indicate the comparison metric is using fingerprint hashes rather than semantic output similarity. Needs investigation — are these truly different inference outputs, or is the comparison method too strict?
Root Cause 3: Contract test workspace path construction bug (4 gates)
The contract invariant tests (I-2 through I-5) fail due to incorrect file path construction in the QA runner:
Gate
Error
F-CONTRACT-I2-001
File not found: output/workspace/Qwen/Qwen2.5-Coder-0.apr (path truncated — should be Qwen2.5-Coder-0.5B-Instruct.apr)
F-CONTRACT-I3-001
Unknown format extension: .5b-instruct (directory path passed where file path expected)
F-CONTRACT-I4-001
unexpected argument 'output/workspace/Qwen/Qwen2.5-Coder-0.apr' (wrong arg to rosetta validate-stats)
F-CONTRACT-I5-001
Unknown format extension: .5b-instruct (same as I3)
This is a bug in apr-model-qa-playbook, not aprender. The contract test runner is constructing workspace paths incorrectly when the model name contains dots (e.g., Qwen2.5-Coder-0.5B). Filing separately in the playbook repo.
P2 — rosetta validate-stats CLI: verify the subcommand accepts the argument format the QA runner is passing
Reproduction
cd ../apr-model-qa-playbook
# Install apr from aprender
cargo install --path ../aprender/crates/apr-cli
# Run the qualification
cargo run --release --bin apr-qa -- run \
playbooks/models/qwen2.5-coder-0.5b-mvp.playbook.yaml \
--model-path ~/.cache/pacha/models/d71534cb948e32eb.safetensors \
--no-gpu \
-o certifications/qwen2.5-coder-0.5b-mvp
# Analyze results
python3 -c "import jsonwith open('certifications/qwen2.5-coder-0.5b-mvp/evidence.json') as f: data = json.load(f)for e in data: status = 'PASS' if e['outcome'] == 'Corroborated' else 'FAIL' print(f'[{status}] {e[\"gate_id\"]}: {e[\"reason\"][:100]}')"
Raw Evidence
Full evidence JSON: certifications/qwen2.5-coder-0.5b-mvp/evidence.json in apr-model-qa-playbook repo.
Summary
First automated MVP qualification run for Qwen2.5-Coder-0.5B-Instruct using the
apr-model-qa-playbookframework. The model passes all gateway checks (G0-G4) and core inference gates, but fails 16/45 gates across format conversion and contract invariant tests.Result: 29/45 Corroborated, 16 Falsified (64% pass rate)
Environment
qwen2.5-coder-0.5b-mvp.playbook.yamlQwen/Qwen2.5-Coder-0.5B-Instruct(SafeTensors, 942 MB)--no-gpu)What Passes (29 gates)
All gateways and core inference pass:
What Fails (16 gates) — 3 Root Causes
Root Cause 1: GGUF weights-only files lack embedded tokenizer (7 gates)
Refs: #216, #185
Any conversion chain that reads from GGUF fails with PMAT-232:
Affected gates:
Plus cascading failures in RT-001 which starts from GGUF.
Fix:
apr rosetta convertshould accept--tokenizer <path>to supply an external tokenizer when the GGUF is weights-only. The QA playbook workspace already hastokenizer.jsonalongside the GGUF — rosetta just needs to look for it or accept it as an argument.Root Cause 2: Inference diff too high after format conversion (5 gates)
Refs: #215
Pairwise conversion produces inference output that differs significantly from the source format. The diffs are ~0.77-0.81 (on a 0-1 scale) against an epsilon of 1e-6:
Round-trip chains also fail (F-CONV-RT-002, RT-003, RT-004) as the errors compound.
Note: The golden rule test (F-GOLDEN-RULE-001) PASSES, meaning
inference(convert(ST→APR)) == inference(ST). The high diffs in pairwise tests may indicate the comparison metric is using fingerprint hashes rather than semantic output similarity. Needs investigation — are these truly different inference outputs, or is the comparison method too strict?Root Cause 3: Contract test workspace path construction bug (4 gates)
The contract invariant tests (I-2 through I-5) fail due to incorrect file path construction in the QA runner:
File not found: output/workspace/Qwen/Qwen2.5-Coder-0.apr(path truncated — should beQwen2.5-Coder-0.5B-Instruct.apr)Unknown format extension: .5b-instruct(directory path passed where file path expected)unexpected argument 'output/workspace/Qwen/Qwen2.5-Coder-0.apr'(wrong arg torosetta validate-stats)Unknown format extension: .5b-instruct(same as I3)This is a bug in apr-model-qa-playbook, not aprender. The contract test runner is constructing workspace paths incorrectly when the model name contains dots (e.g.,
Qwen2.5-Coder-0.5B). Filing separately in the playbook repo.Actionable Items for Aprender
rosetta convertshould auto-discover siblingtokenizer.jsonor accept--tokenizerflag when converting weights-only GGUF filesrosetta validate-statsCLI: verify the subcommand accepts the argument format the QA runner is passingReproduction
Raw Evidence
Full evidence JSON:
certifications/qwen2.5-coder-0.5b-mvp/evidence.jsoninapr-model-qa-playbookrepo.