Skip to content

apr serve: 30s startup-readiness timeout is too short for large MoE GGUFs #1781

@noahgift

Description

@noahgift

Bug

apr serve startup-readiness check in aprender-orchestrate/src/agent/driver/apr_serve.rs:143 uses a hardcoded Duration::from_secs(30). For large MoE GGUFs (e.g. Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf, 18.5 GB), cold-cache load + tokenizer setup + tensor validation routinely exceeds 30s — causing every subprocess driver invocation to fail with:

apr serve did not become ready within 30s
subprocess stderr:
[BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe'

When apr-serve startup fails this way, the SubprocessDriver falls back to embedded inference which then trips a SECOND bug on the same model:

Error: driver error: inference failed: Invalid shape: Tensor 'blk.0.ffn_up.weight' not found

(Qwen3-MoE uses ffn_up_exps per-expert, not dense ffn_up; the embedded fallback path doesn't know about MoE tensor naming.)

Empirical evidence

CCPA M260 dispatch (paiml/claude-code-parity-apr#238) ran bash scripts/phase-5-calibration-bench.sh against the M242 n=15 calibration-and-scale corpus. Every single one of the 15 student dispatches (apr code + Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf) failed with the above stacked errors → student-side score 0/15. The teacher (claude-opus-4-7) scored 15/15 in parallel, so the corpus is real and the failure is harness-side.

Root cause (5-whys)

  1. Why did all 15 student dispatches fail? → apr serve did not become ready within 30s
  2. Why? → Cold-cache load of 18.5 GB GGUF + tokenizer setup + tensor validation exceeds 30s
  3. Why is the timeout 30s? → Hardcoded Duration::from_secs(30) at apr_serve.rs:143
  4. Why hardcoded? → No env-var or config knob; no model-size-aware scaling
  5. Why? → Originally designed for sub-2GB models that load in <5s; large MoE models weren't a use case

Companion-side workaround (already shipped)

scripts/phase-5-{arena,calibration}-bench.sh now pre-warms the GGUF into OS page cache via cat \$APR_MODEL > /dev/null before fixture dispatch. Empirically verified: cold-cache load >30s; warm-cache load ~1s. Workaround unblocks measurement on hosts with enough RAM to keep the model in page cache.

This issue tracks the upstream fix needed for hosts without enough RAM for full pre-warm + for general robustness.

Proposed upstream fix

Make the startup-readiness timeout env-var-configurable:

let timeout_secs = std::env::var(\"APR_SERVE_READY_TIMEOUT_S\")
    .ok()
    .and_then(|s| s.parse::<u64>().ok())
    .unwrap_or(30);
let timeout = std::time::Duration::from_secs(timeout_secs);

Optional enhancement: scale default automatically by model file size (e.g. 30s + 1s per GB).

Also recommend fixing the embedded-fallback Qwen3-MoE tensor-name resolution as a separate concern (so that fallback works on MoE models when it does fire).

Cross-references

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions