fix(#1789): make AprServeDriver max_tokens cap env-configurable by noahgift · Pull Request #1814 · paiml/aprender

noahgift · 2026-05-19T13:08:57Z

Summary

Third in the aprender#1789 follow-up series. PMAT-170's hardcoded 1024-token cap in AprServeDriver::build_openai_body is reasonable for small/fast models but excessive for large MoE models without KV cache.

Empirical context

Phase 6 bench against Qwen3-Coder-30B-A3B-Instruct-Q4_K_M (post-#1812):

Smoke test: ~0.5 tok/s sustained
1024 tokens at 0.5 tok/s = ~34 min wall per turn
Bench's per-turn-timeout budget is 900s default, 2000s configured
Result: every fixture hits driver_error (exit 124, timeout SIGTERM)

The HTTP timeout knob from #1812 isn't enough — the bottleneck is generation length, not HTTP wait.

Fix

Same env-var pattern as APR_AGENT_HTTP_TIMEOUT_S (#1812). New APR_AGENT_MAX_TOKENS_CAP env var; default unchanged at 1024 (no regression for existing flows). Phase 6 bench uses APR_AGENT_MAX_TOKENS_CAP=128 to fit per-turn budget.

Test plan

cargo check -p aprender-orchestrate — clean
CI (sovereign-ci full workflow)
Phase 6 30B-MoE bench: non-driver_error student outcomes (oracle_passed, oracle_failed_after_max_turns, or oracle_failed)

🤖 Generated with Claude Code

PMAT-170's hardcoded 1024-token cap is reasonable for small/fast models but excessive for large MoE models without KV cache. At empirically observed ~0.5 tok/s on Qwen3-Coder-30B-A3B-Instruct-Q4_K_M (no KV cache, full-prefill-per-token), a 1024-token cap produces ~34 min wall per turn — exceeding the bench harness's per-turn budgets. Same env-var pattern as APR_AGENT_HTTP_TIMEOUT_S (aprender#1812): allow the operator (or bench harness) to dial the cap down for slow models. Default unchanged at 1024 — covers all existing dense + small-MoE use cases. Override via APR_AGENT_MAX_TOKENS_CAP. Phase 6 30B-MoE bench uses APR_AGENT_MAX_TOKENS_CAP=128 to fit per-turn budget. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift force-pushed the fix/1789-max-tokens-cap-env branch from 6f89331 to 1b7f1e7 Compare May 19, 2026 13:32

noahgift merged commit e589250 into main May 19, 2026
18 of 20 checks passed

noahgift deleted the fix/1789-max-tokens-cap-env branch May 19, 2026 14:23

This was referenced May 19, 2026

spec(M32d): KV cache for qwen3_moe inference path — scope + operator decision doc #1826

Merged

M32d: KV cache for qwen3_moe inference path (engineer-driven, 1-2 week) #1830

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(#1789): make AprServeDriver max_tokens cap env-configurable#1814

fix(#1789): make AprServeDriver max_tokens cap env-configurable#1814
noahgift merged 1 commit into
mainfrom
fix/1789-max-tokens-cap-env

noahgift commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 19, 2026

Summary

Empirical context

Fix

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant