openai-completions: unclamped max_tokens silently wedges sessions on servers enforcing max_model_len (vLLM)

## Summary

`buildParams()` in `packages/ai/src/providers/openai-completions.ts` passes `options.maxTokens` straight into the request without checking it fits in `model.contextWindow`. When the configured `maxTokens` is half the context window (a common default for self-hosted vLLM models), the agent hits a hard ceiling at exactly `contextWindow - maxTokens` input tokens, and every subsequent request is rejected — but the rejection surfaces only as the generic `Stream ended without finish_reason`, with no clue what actually broke.

The result is **catastrophic and unrecoverable**: every retry sends the same oversized request, the context can't shrink itself, and the session is permanently wedged.

## Reproduction

- Self-hosted vLLM with `--max-model-len 131072` (or similar)
- Model entry in `~/.pi/agent/models.json`:
  ```json
  {
    "id": "Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-vllm",
    "contextWindow": 131072,
    "maxTokens": 65536
  }
  ```
- Long-running session (lots of tool output) that grows the prompt past ~65k tokens.

vLLM enforces `prompt_tokens + max_tokens <= max_model_len`. With `max_tokens=65536` baked into every request, the break-even input is `131072 - 65536 = 65536` tokens. The moment one tool result crosses that line, vLLM rejects the request, sends an error-shaped chunk that the OpenAI SDK stream parser silently drops, the stream closes, and pi-ai throws `Stream ended without finish_reason`. Three retries — same request, same rejection — session dead.

## Evidence

Three independent pi sessions, same model, same wall:

| Engagement | Last successful `prompt_tokens` | Distance to 65,536 cap |
|---|---|---|
| session A | 65,073 | −463 |
| session B | 65,478 | −58 |
| session C | 65,030 | −506 |

Every one fails on the very next turn (after one tool result pushes the prompt past 65,536) with `Stream ended without finish_reason` × 4 (initial + 3 retries) → `Retry failed after 3 attempts`. User reports the symptom as \"happens immediately at 50% context\" — which is exactly `contextWindow / 2` when `maxTokens == contextWindow / 2`.

## Proposed fix

Two small changes in `packages/ai/src/providers/openai-completions.ts`:

### 1. Clamp `max_tokens` in `buildParams()`

```ts
// CURRENT
if (options?.maxTokens) {
    if (compat.maxTokensField === "max_tokens") {
        params.max_tokens = options.maxTokens;
    } else {
        params.max_completion_tokens = options.maxTokens;
    }
}

// PROPOSED
if (options?.maxTokens) {
    let effective = options.maxTokens;
    if (typeof model.contextWindow === "number" && model.contextWindow > 0) {
        const promptChars = JSON.stringify(messages).length
            + (params.tools ? JSON.stringify(params.tools).length : 0);
        const promptEst = Math.ceil(promptChars / 4); // matches faux.js heuristic
        const SAFETY = 256;
        const headroom = model.contextWindow - promptEst - SAFETY;
        effective = Math.max(256, Math.min(effective, headroom));
    }
    if (compat.maxTokensField === "max_tokens") {
        params.max_tokens = effective;
    } else {
        params.max_completion_tokens = effective;
    }
}
```

As input grows, the per-request output budget shrinks instead of vLLM rejecting the whole request.

### 2. More diagnostic error when the stream closes without `finish_reason`

```ts
// CURRENT
if (!hasFinishReason) {
    throw new Error("Stream ended without finish_reason");
}

// PROPOSED
if (!hasFinishReason) {
    const promptEst = Math.ceil(JSON.stringify(context.messages ?? []).length / 4);
    const ctx = model.contextWindow ?? 0;
    const max = params.max_tokens ?? params.max_completion_tokens ?? 0;
    throw new Error(
        `Stream ended without finish_reason ` +
        `(model=${model.id}, ~prompt_tokens=${promptEst}, ` +
        `max_tokens=${max}, contextWindow=${ctx}). ` +
        `Common cause: server rejected the request — check that ` +
        `prompt_tokens + max_tokens <= server's max_model_len.`
    );
}
```

This alone would have made today's investigation a one-look diagnosis.

### Mirrored in

The same unclamped pattern exists in:
- `packages/ai/src/providers/openai-responses.ts` (uses `max_output_tokens`)
- `packages/ai/src/providers/azure-openai-responses.ts` (uses `max_output_tokens`)

`openai-codex-responses.ts` is structured differently (Codex Responses transport, `response.completed` / `response.incomplete` event-driven) and doesn't have the same failure mode at the same location.

## Workaround for affected users

Lower `maxTokens` per model in `~/.pi/agent/models.json` to a realistic agent output budget (e.g. `8192` instead of `contextWindow / 2`). For a 128K-context model, this raises the usable prompt cap from 50% → ~94%.

## Environment

- `@earendil-works/pi-coding-agent` `0.75.1`
- `@earendil-works/pi-ai` `0.75.1`
- Provider: self-hosted vLLM via `openai-completions` API, `max_model_len=131072`
- Model: `Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-vllm`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

openai-completions: unclamped max_tokens silently wedges sessions on servers enforcing max_model_len (vLLM) #4675

Summary

Reproduction

Evidence

Proposed fix

1. Clamp `max_tokens` in `buildParams()`

2. More diagnostic error when the stream closes without `finish_reason`

Mirrored in

Workaround for affected users

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Engagement	Last successful `prompt_tokens`	Distance to 65,536 cap
session A	65,073	−463
session B	65,478	−58
session C	65,030	−506

openai-completions: unclamped max_tokens silently wedges sessions on servers enforcing max_model_len (vLLM) #4675

Description

Summary

Reproduction

Evidence

Proposed fix

1. Clamp max_tokens in buildParams()

2. More diagnostic error when the stream closes without finish_reason

Mirrored in

Workaround for affected users

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Clamp `max_tokens` in `buildParams()`

2. More diagnostic error when the stream closes without `finish_reason`