Skip to content

openai-completions: unclamped max_tokens silently wedges sessions on servers enforcing max_model_len (vLLM) #4675

@secinto

Description

@secinto

Summary

buildParams() in packages/ai/src/providers/openai-completions.ts passes options.maxTokens straight into the request without checking it fits in model.contextWindow. When the configured maxTokens is half the context window (a common default for self-hosted vLLM models), the agent hits a hard ceiling at exactly contextWindow - maxTokens input tokens, and every subsequent request is rejected — but the rejection surfaces only as the generic Stream ended without finish_reason, with no clue what actually broke.

The result is catastrophic and unrecoverable: every retry sends the same oversized request, the context can't shrink itself, and the session is permanently wedged.

Reproduction

  • Self-hosted vLLM with --max-model-len 131072 (or similar)
  • Model entry in ~/.pi/agent/models.json:
    {
      "id": "Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-vllm",
      "contextWindow": 131072,
      "maxTokens": 65536
    }
  • Long-running session (lots of tool output) that grows the prompt past ~65k tokens.

vLLM enforces prompt_tokens + max_tokens <= max_model_len. With max_tokens=65536 baked into every request, the break-even input is 131072 - 65536 = 65536 tokens. The moment one tool result crosses that line, vLLM rejects the request, sends an error-shaped chunk that the OpenAI SDK stream parser silently drops, the stream closes, and pi-ai throws Stream ended without finish_reason. Three retries — same request, same rejection — session dead.

Evidence

Three independent pi sessions, same model, same wall:

Engagement Last successful prompt_tokens Distance to 65,536 cap
session A 65,073 −463
session B 65,478 −58
session C 65,030 −506

Every one fails on the very next turn (after one tool result pushes the prompt past 65,536) with Stream ended without finish_reason × 4 (initial + 3 retries) → Retry failed after 3 attempts. User reports the symptom as "happens immediately at 50% context" — which is exactly contextWindow / 2 when maxTokens == contextWindow / 2.

Proposed fix

Two small changes in packages/ai/src/providers/openai-completions.ts:

1. Clamp max_tokens in buildParams()

// CURRENT
if (options?.maxTokens) {
    if (compat.maxTokensField === "max_tokens") {
        params.max_tokens = options.maxTokens;
    } else {
        params.max_completion_tokens = options.maxTokens;
    }
}

// PROPOSED
if (options?.maxTokens) {
    let effective = options.maxTokens;
    if (typeof model.contextWindow === "number" && model.contextWindow > 0) {
        const promptChars = JSON.stringify(messages).length
            + (params.tools ? JSON.stringify(params.tools).length : 0);
        const promptEst = Math.ceil(promptChars / 4); // matches faux.js heuristic
        const SAFETY = 256;
        const headroom = model.contextWindow - promptEst - SAFETY;
        effective = Math.max(256, Math.min(effective, headroom));
    }
    if (compat.maxTokensField === "max_tokens") {
        params.max_tokens = effective;
    } else {
        params.max_completion_tokens = effective;
    }
}

As input grows, the per-request output budget shrinks instead of vLLM rejecting the whole request.

2. More diagnostic error when the stream closes without finish_reason

// CURRENT
if (!hasFinishReason) {
    throw new Error("Stream ended without finish_reason");
}

// PROPOSED
if (!hasFinishReason) {
    const promptEst = Math.ceil(JSON.stringify(context.messages ?? []).length / 4);
    const ctx = model.contextWindow ?? 0;
    const max = params.max_tokens ?? params.max_completion_tokens ?? 0;
    throw new Error(
        `Stream ended without finish_reason ` +
        `(model=${model.id}, ~prompt_tokens=${promptEst}, ` +
        `max_tokens=${max}, contextWindow=${ctx}). ` +
        `Common cause: server rejected the request — check that ` +
        `prompt_tokens + max_tokens <= server's max_model_len.`
    );
}

This alone would have made today's investigation a one-look diagnosis.

Mirrored in

The same unclamped pattern exists in:

  • packages/ai/src/providers/openai-responses.ts (uses max_output_tokens)
  • packages/ai/src/providers/azure-openai-responses.ts (uses max_output_tokens)

openai-codex-responses.ts is structured differently (Codex Responses transport, response.completed / response.incomplete event-driven) and doesn't have the same failure mode at the same location.

Workaround for affected users

Lower maxTokens per model in ~/.pi/agent/models.json to a realistic agent output budget (e.g. 8192 instead of contextWindow / 2). For a 128K-context model, this raises the usable prompt cap from 50% → ~94%.

Environment

  • @earendil-works/pi-coding-agent 0.75.1
  • @earendil-works/pi-ai 0.75.1
  • Provider: self-hosted vLLM via openai-completions API, max_model_len=131072
  • Model: Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-vllm

Metadata

Metadata

Assignees

No one assigned

    Labels

    inprogressIssue is being worked on

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions