Summary
buildParams() in packages/ai/src/providers/openai-completions.ts passes options.maxTokens straight into the request without checking it fits in model.contextWindow. When the configured maxTokens is half the context window (a common default for self-hosted vLLM models), the agent hits a hard ceiling at exactly contextWindow - maxTokens input tokens, and every subsequent request is rejected — but the rejection surfaces only as the generic Stream ended without finish_reason, with no clue what actually broke.
The result is catastrophic and unrecoverable: every retry sends the same oversized request, the context can't shrink itself, and the session is permanently wedged.
Reproduction
- Self-hosted vLLM with
--max-model-len 131072 (or similar)
- Model entry in
~/.pi/agent/models.json:
{
"id": "Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-vllm",
"contextWindow": 131072,
"maxTokens": 65536
}
- Long-running session (lots of tool output) that grows the prompt past ~65k tokens.
vLLM enforces prompt_tokens + max_tokens <= max_model_len. With max_tokens=65536 baked into every request, the break-even input is 131072 - 65536 = 65536 tokens. The moment one tool result crosses that line, vLLM rejects the request, sends an error-shaped chunk that the OpenAI SDK stream parser silently drops, the stream closes, and pi-ai throws Stream ended without finish_reason. Three retries — same request, same rejection — session dead.
Evidence
Three independent pi sessions, same model, same wall:
| Engagement |
Last successful prompt_tokens |
Distance to 65,536 cap |
| session A |
65,073 |
−463 |
| session B |
65,478 |
−58 |
| session C |
65,030 |
−506 |
Every one fails on the very next turn (after one tool result pushes the prompt past 65,536) with Stream ended without finish_reason × 4 (initial + 3 retries) → Retry failed after 3 attempts. User reports the symptom as "happens immediately at 50% context" — which is exactly contextWindow / 2 when maxTokens == contextWindow / 2.
Proposed fix
Two small changes in packages/ai/src/providers/openai-completions.ts:
1. Clamp max_tokens in buildParams()
// CURRENT
if (options?.maxTokens) {
if (compat.maxTokensField === "max_tokens") {
params.max_tokens = options.maxTokens;
} else {
params.max_completion_tokens = options.maxTokens;
}
}
// PROPOSED
if (options?.maxTokens) {
let effective = options.maxTokens;
if (typeof model.contextWindow === "number" && model.contextWindow > 0) {
const promptChars = JSON.stringify(messages).length
+ (params.tools ? JSON.stringify(params.tools).length : 0);
const promptEst = Math.ceil(promptChars / 4); // matches faux.js heuristic
const SAFETY = 256;
const headroom = model.contextWindow - promptEst - SAFETY;
effective = Math.max(256, Math.min(effective, headroom));
}
if (compat.maxTokensField === "max_tokens") {
params.max_tokens = effective;
} else {
params.max_completion_tokens = effective;
}
}
As input grows, the per-request output budget shrinks instead of vLLM rejecting the whole request.
2. More diagnostic error when the stream closes without finish_reason
// CURRENT
if (!hasFinishReason) {
throw new Error("Stream ended without finish_reason");
}
// PROPOSED
if (!hasFinishReason) {
const promptEst = Math.ceil(JSON.stringify(context.messages ?? []).length / 4);
const ctx = model.contextWindow ?? 0;
const max = params.max_tokens ?? params.max_completion_tokens ?? 0;
throw new Error(
`Stream ended without finish_reason ` +
`(model=${model.id}, ~prompt_tokens=${promptEst}, ` +
`max_tokens=${max}, contextWindow=${ctx}). ` +
`Common cause: server rejected the request — check that ` +
`prompt_tokens + max_tokens <= server's max_model_len.`
);
}
This alone would have made today's investigation a one-look diagnosis.
Mirrored in
The same unclamped pattern exists in:
packages/ai/src/providers/openai-responses.ts (uses max_output_tokens)
packages/ai/src/providers/azure-openai-responses.ts (uses max_output_tokens)
openai-codex-responses.ts is structured differently (Codex Responses transport, response.completed / response.incomplete event-driven) and doesn't have the same failure mode at the same location.
Workaround for affected users
Lower maxTokens per model in ~/.pi/agent/models.json to a realistic agent output budget (e.g. 8192 instead of contextWindow / 2). For a 128K-context model, this raises the usable prompt cap from 50% → ~94%.
Environment
@earendil-works/pi-coding-agent 0.75.1
@earendil-works/pi-ai 0.75.1
- Provider: self-hosted vLLM via
openai-completions API, max_model_len=131072
- Model:
Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-vllm
Summary
buildParams()inpackages/ai/src/providers/openai-completions.tspassesoptions.maxTokensstraight into the request without checking it fits inmodel.contextWindow. When the configuredmaxTokensis half the context window (a common default for self-hosted vLLM models), the agent hits a hard ceiling at exactlycontextWindow - maxTokensinput tokens, and every subsequent request is rejected — but the rejection surfaces only as the genericStream ended without finish_reason, with no clue what actually broke.The result is catastrophic and unrecoverable: every retry sends the same oversized request, the context can't shrink itself, and the session is permanently wedged.
Reproduction
--max-model-len 131072(or similar)~/.pi/agent/models.json:{ "id": "Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-vllm", "contextWindow": 131072, "maxTokens": 65536 }vLLM enforces
prompt_tokens + max_tokens <= max_model_len. Withmax_tokens=65536baked into every request, the break-even input is131072 - 65536 = 65536tokens. The moment one tool result crosses that line, vLLM rejects the request, sends an error-shaped chunk that the OpenAI SDK stream parser silently drops, the stream closes, and pi-ai throwsStream ended without finish_reason. Three retries — same request, same rejection — session dead.Evidence
Three independent pi sessions, same model, same wall:
prompt_tokensEvery one fails on the very next turn (after one tool result pushes the prompt past 65,536) with
Stream ended without finish_reason× 4 (initial + 3 retries) →Retry failed after 3 attempts. User reports the symptom as "happens immediately at 50% context" — which is exactlycontextWindow / 2whenmaxTokens == contextWindow / 2.Proposed fix
Two small changes in
packages/ai/src/providers/openai-completions.ts:1. Clamp
max_tokensinbuildParams()As input grows, the per-request output budget shrinks instead of vLLM rejecting the whole request.
2. More diagnostic error when the stream closes without
finish_reasonThis alone would have made today's investigation a one-look diagnosis.
Mirrored in
The same unclamped pattern exists in:
packages/ai/src/providers/openai-responses.ts(usesmax_output_tokens)packages/ai/src/providers/azure-openai-responses.ts(usesmax_output_tokens)openai-codex-responses.tsis structured differently (Codex Responses transport,response.completed/response.incompleteevent-driven) and doesn't have the same failure mode at the same location.Workaround for affected users
Lower
maxTokensper model in~/.pi/agent/models.jsonto a realistic agent output budget (e.g.8192instead ofcontextWindow / 2). For a 128K-context model, this raises the usable prompt cap from 50% → ~94%.Environment
@earendil-works/pi-coding-agent0.75.1@earendil-works/pi-ai0.75.1openai-completionsAPI,max_model_len=131072Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-vllm