Bug
apr serve startup-readiness check in aprender-orchestrate/src/agent/driver/apr_serve.rs:143 uses a hardcoded Duration::from_secs(30). For large MoE GGUFs (e.g. Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf, 18.5 GB), cold-cache load + tokenizer setup + tensor validation routinely exceeds 30s — causing every subprocess driver invocation to fail with:
apr serve did not become ready within 30s
subprocess stderr:
[BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe'
When apr-serve startup fails this way, the SubprocessDriver falls back to embedded inference which then trips a SECOND bug on the same model:
Error: driver error: inference failed: Invalid shape: Tensor 'blk.0.ffn_up.weight' not found
(Qwen3-MoE uses ffn_up_exps per-expert, not dense ffn_up; the embedded fallback path doesn't know about MoE tensor naming.)
Empirical evidence
CCPA M260 dispatch (paiml/claude-code-parity-apr#238) ran bash scripts/phase-5-calibration-bench.sh against the M242 n=15 calibration-and-scale corpus. Every single one of the 15 student dispatches (apr code + Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf) failed with the above stacked errors → student-side score 0/15. The teacher (claude-opus-4-7) scored 15/15 in parallel, so the corpus is real and the failure is harness-side.
Root cause (5-whys)
- Why did all 15 student dispatches fail? →
apr serve did not become ready within 30s
- Why? → Cold-cache load of 18.5 GB GGUF + tokenizer setup + tensor validation exceeds 30s
- Why is the timeout 30s? → Hardcoded
Duration::from_secs(30) at apr_serve.rs:143
- Why hardcoded? → No env-var or config knob; no model-size-aware scaling
- Why? → Originally designed for sub-2GB models that load in <5s; large MoE models weren't a use case
Companion-side workaround (already shipped)
scripts/phase-5-{arena,calibration}-bench.sh now pre-warms the GGUF into OS page cache via cat \$APR_MODEL > /dev/null before fixture dispatch. Empirically verified: cold-cache load >30s; warm-cache load ~1s. Workaround unblocks measurement on hosts with enough RAM to keep the model in page cache.
This issue tracks the upstream fix needed for hosts without enough RAM for full pre-warm + for general robustness.
Proposed upstream fix
Make the startup-readiness timeout env-var-configurable:
let timeout_secs = std::env::var(\"APR_SERVE_READY_TIMEOUT_S\")
.ok()
.and_then(|s| s.parse::<u64>().ok())
.unwrap_or(30);
let timeout = std::time::Duration::from_secs(timeout_secs);
Optional enhancement: scale default automatically by model file size (e.g. 30s + 1s per GB).
Also recommend fixing the embedded-fallback Qwen3-MoE tensor-name resolution as a separate concern (so that fallback works on MoE models when it does fire).
Cross-references
Bug
apr servestartup-readiness check inaprender-orchestrate/src/agent/driver/apr_serve.rs:143uses a hardcodedDuration::from_secs(30). For large MoE GGUFs (e.g. Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf, 18.5 GB), cold-cache load + tokenizer setup + tensor validation routinely exceeds 30s — causing every subprocess driver invocation to fail with:When apr-serve startup fails this way, the SubprocessDriver falls back to embedded inference which then trips a SECOND bug on the same model:
(Qwen3-MoE uses
ffn_up_expsper-expert, not denseffn_up; the embedded fallback path doesn't know about MoE tensor naming.)Empirical evidence
CCPA M260 dispatch (paiml/claude-code-parity-apr#238) ran
bash scripts/phase-5-calibration-bench.shagainst the M242 n=15 calibration-and-scale corpus. Every single one of the 15 student dispatches (apr code+Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf) failed with the above stacked errors → student-side score 0/15. The teacher (claude-opus-4-7) scored 15/15 in parallel, so the corpus is real and the failure is harness-side.Root cause (5-whys)
apr serve did not become ready within 30sDuration::from_secs(30)atapr_serve.rs:143Companion-side workaround (already shipped)
scripts/phase-5-{arena,calibration}-bench.shnow pre-warms the GGUF into OS page cache viacat \$APR_MODEL > /dev/nullbefore fixture dispatch. Empirically verified: cold-cache load >30s; warm-cache load ~1s. Workaround unblocks measurement on hosts with enough RAM to keep the model in page cache.This issue tracks the upstream fix needed for hosts without enough RAM for full pre-warm + for general robustness.
Proposed upstream fix
Make the startup-readiness timeout env-var-configurable:
Optional enhancement: scale default automatically by model file size (e.g. 30s + 1s per GB).
Also recommend fixing the embedded-fallback Qwen3-MoE tensor-name resolution as a separate concern (so that fallback works on MoE models when it does fire).
Cross-references
evidence/calibration-and-scale/