apr serve: 30s startup-readiness timeout is too short for large MoE GGUFs

## Bug

`apr serve` startup-readiness check in `aprender-orchestrate/src/agent/driver/apr_serve.rs:143` uses a hardcoded `Duration::from_secs(30)`. For large MoE GGUFs (e.g. Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf, 18.5 GB), cold-cache load + tokenizer setup + tensor validation routinely exceeds 30s — causing every subprocess driver invocation to fail with:

```
apr serve did not become ready within 30s
subprocess stderr:
[BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe'
```

When apr-serve startup fails this way, the SubprocessDriver falls back to embedded inference which then trips a SECOND bug on the same model:

```
Error: driver error: inference failed: Invalid shape: Tensor 'blk.0.ffn_up.weight' not found
```

(Qwen3-MoE uses `ffn_up_exps` per-expert, not dense `ffn_up`; the embedded fallback path doesn't know about MoE tensor naming.)

## Empirical evidence

CCPA M260 dispatch (paiml/claude-code-parity-apr#238) ran `bash scripts/phase-5-calibration-bench.sh` against the M242 n=15 calibration-and-scale corpus. Every single one of the 15 student dispatches (`apr code` + `Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf`) failed with the above stacked errors → student-side score 0/15. The teacher (claude-opus-4-7) scored 15/15 in parallel, so the corpus is real and the failure is harness-side.

## Root cause (5-whys)

1. Why did all 15 student dispatches fail? → `apr serve did not become ready within 30s`
2. Why? → Cold-cache load of 18.5 GB GGUF + tokenizer setup + tensor validation exceeds 30s
3. Why is the timeout 30s? → Hardcoded `Duration::from_secs(30)` at `apr_serve.rs:143`
4. Why hardcoded? → No env-var or config knob; no model-size-aware scaling
5. Why? → Originally designed for sub-2GB models that load in <5s; large MoE models weren't a use case

## Companion-side workaround (already shipped)

`scripts/phase-5-{arena,calibration}-bench.sh` now pre-warms the GGUF into OS page cache via `cat \$APR_MODEL > /dev/null` before fixture dispatch. Empirically verified: cold-cache load >30s; warm-cache load ~1s. Workaround unblocks measurement on hosts with enough RAM to keep the model in page cache.

This issue tracks the upstream fix needed for hosts without enough RAM for full pre-warm + for general robustness.

## Proposed upstream fix

Make the startup-readiness timeout env-var-configurable:

```rust
let timeout_secs = std::env::var(\"APR_SERVE_READY_TIMEOUT_S\")
    .ok()
    .and_then(|s| s.parse::<u64>().ok())
    .unwrap_or(30);
let timeout = std::time::Duration::from_secs(timeout_secs);
```

Optional enhancement: scale default automatically by model file size (e.g. 30s + 1s per GB).

Also recommend fixing the embedded-fallback Qwen3-MoE tensor-name resolution as a separate concern (so that fallback works on MoE models when it does fire).

## Cross-references

- Companion-side workaround: paiml/claude-code-parity-apr commit M262 (pre-warm in bench scripts)
- M260 evidence: paiml/claude-code-parity-apr `evidence/calibration-and-scale/`
- M238 Branch B Phase 2 baseline-confounder discipline: paiml/claude-code-parity-apr#225-ish (similar class of issue, different fixture set)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apr serve: 30s startup-readiness timeout is too short for large MoE GGUFs #1781

Bug

Empirical evidence

Root cause (5-whys)

Companion-side workaround (already shipped)

Proposed upstream fix

Cross-references

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

apr serve: 30s startup-readiness timeout is too short for large MoE GGUFs #1781

Description

Bug

Empirical evidence

Root cause (5-whys)

Companion-side workaround (already shipped)

Proposed upstream fix

Cross-references

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions