Skip to content

fix(#1789): make AprServeDriver max_tokens cap env-configurable#1814

Merged
noahgift merged 1 commit into
mainfrom
fix/1789-max-tokens-cap-env
May 19, 2026
Merged

fix(#1789): make AprServeDriver max_tokens cap env-configurable#1814
noahgift merged 1 commit into
mainfrom
fix/1789-max-tokens-cap-env

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Third in the aprender#1789 follow-up series. PMAT-170's hardcoded 1024-token cap in AprServeDriver::build_openai_body is reasonable for small/fast models but excessive for large MoE models without KV cache.

Empirical context

Phase 6 bench against Qwen3-Coder-30B-A3B-Instruct-Q4_K_M (post-#1812):

  • Smoke test: ~0.5 tok/s sustained
  • 1024 tokens at 0.5 tok/s = ~34 min wall per turn
  • Bench's per-turn-timeout budget is 900s default, 2000s configured
  • Result: every fixture hits driver_error (exit 124, timeout SIGTERM)

The HTTP timeout knob from #1812 isn't enough — the bottleneck is generation length, not HTTP wait.

Fix

Same env-var pattern as APR_AGENT_HTTP_TIMEOUT_S (#1812). New APR_AGENT_MAX_TOKENS_CAP env var; default unchanged at 1024 (no regression for existing flows). Phase 6 bench uses APR_AGENT_MAX_TOKENS_CAP=128 to fit per-turn budget.

Test plan

  • cargo check -p aprender-orchestrate — clean
  • CI (sovereign-ci full workflow)
  • Phase 6 30B-MoE bench: non-driver_error student outcomes (oracle_passed, oracle_failed_after_max_turns, or oracle_failed)

🤖 Generated with Claude Code

PMAT-170's hardcoded 1024-token cap is reasonable for small/fast
models but excessive for large MoE models without KV cache. At
empirically observed ~0.5 tok/s on Qwen3-Coder-30B-A3B-Instruct-Q4_K_M
(no KV cache, full-prefill-per-token), a 1024-token cap produces
~34 min wall per turn — exceeding the bench harness's per-turn
budgets.

Same env-var pattern as APR_AGENT_HTTP_TIMEOUT_S (aprender#1812):
allow the operator (or bench harness) to dial the cap down for
slow models. Default unchanged at 1024 — covers all existing dense
+ small-MoE use cases.

Override via APR_AGENT_MAX_TOKENS_CAP. Phase 6 30B-MoE bench uses
APR_AGENT_MAX_TOKENS_CAP=128 to fit per-turn budget.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the fix/1789-max-tokens-cap-env branch from 6f89331 to 1b7f1e7 Compare May 19, 2026 13:32
@noahgift noahgift merged commit e589250 into main May 19, 2026
18 of 20 checks passed
@noahgift noahgift deleted the fix/1789-max-tokens-cap-env branch May 19, 2026 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant