Skip to content

fix(streaming): preserve all tool_calls in OpenAI batch responses#1686

Merged
Hmbown merged 1 commit into
Hmbown:mainfrom
h3c-hexin:fix/openai-streaming-batch-tool-calls
May 15, 2026
Merged

fix(streaming): preserve all tool_calls in OpenAI batch responses#1686
Hmbown merged 1 commit into
Hmbown:mainfrom
h3c-hexin:fix/openai-streaming-batch-tool-calls

Conversation

@h3c-hexin

Copy link
Copy Markdown
Contributor

Summary

Fixes a bug where multiple tool_calls in a single assistant message lose all but the last ToolCallStarted event when streaming from any OpenAI-compatible backend (vLLM / Ollama / LM Studio / Together AI / etc.).

Symptom: when a model emits N batch tool_calls in one round, listeners see 1 × chat:tool_start and N × chat:tool_end. The TUI / runtime API / embedder bridges end up with N orphan tool_result blocks and no matching tool_use blocks in assistant history. Session persistence drops the tool calls; subsequent recall_archive / cycle restart cannot reconstruct the turn.

Reproduction (vLLM 0.7 + Qwen3.6-35B-A3B, prompt asks for 7-file Tauri scaffold):

backend dispatches:   7 × write_file + 1 × exec_shell
ApprovalRequired log: 9 events ✓
listeners receive:    1 × chat:tool_start, 7 × chat:tool_end

Root cause: run_turn (core/engine/turn_loop.rs:354) tracked the streaming tool block with a single current_tool_index: Option<usize>. The Anthropic-style adapter emits Start/Stop in lockstep so the slot never overlaps. But the OpenAI streaming parser (client/chat.rs:1954-2064) emits all ContentBlockStart::ToolUse events as soon as their deltas land, then batches every ContentBlockStop at finish_reason. Each new Start overwrites the slot; the first Stop.take() returns the last index (wrong tool), every later Stop.take() returns None.

Fix: replace the single slot with HashMap<u32 block_index, usize tool_uses_idx>. Start inserts, InputJsonDelta looks up by the outer ContentBlockDelta index, Stop removes by its own index. Routing no longer depends on the equally-overwritten current_block_kind; a successful remove(&index) already proves the Stop belongs to a tool block.

5-place inline edit in turn_loop.rs, no public API changes.

Testing

  • cargo test --all-features — all related tests pass (3 pre-existing flaky tools::web_run::tests::* failures unrelated, reproducible on main under high concurrency, pass individually)
  • cargo fmt --all -- --check
  • cargo clippy --all-targets --all-features — zero warnings
  • Added regression test batch_tool_calls_preserve_all_tool_use_indices in core/engine/turn_loop.rs::tests
  • Manual end-to-end verification: vLLM + Qwen3.6-35B + 7-file Tauri scaffold prompt now shows all 7 write_file tool_use blocks paired with their tool_result blocks in the assistant history

Checklist

  • Updated docs or comments as needed (inline comments at all 5 change sites explain the bug and the fix)
  • Added or updated tests where relevant
  • Verified TUI behavior manually

When an OpenAI-compatible backend (vLLM, Ollama, LM Studio, Together AI,
self-hosted vLLM/SGLang, etc.) streams an assistant message containing
multiple tool_calls in a single round, only the **last** tool's
`Event::ToolCallStarted` was firing. The preceding N-1 tool calls
executed and produced tool_result events, but never announced their
start to consumers (TUI / runtime API / embedder bridges), leaving them
with N orphan tool_result blocks and no matching tool_use blocks in the
assistant history.

## Reproduction

```text
backend dispatches:   7 × write_file + 1 × exec_shell
log shows:            7 × ApprovalRequired events ✓
listeners receive:    1 × chat:tool_start, 7 × chat:tool_end
session history:      1 tool_use + 7 tool_result (6 orphans)
```

Tested against vLLM 0.7 + Qwen3.6-35B-A3B with a "scaffold 7-file Tauri
template" prompt. Any model+backend combo that emits batch tool_calls
trips this — typical when a single LLM round asks for multiple parallel
file writes or edits.

## Root cause

`run_turn` tracked the currently-streaming tool block with a single
`current_tool_index: Option<usize>`. The Anthropic-style adapter
(non-streaming response → events at `chat.rs::L1807`) emits
Start/Stop pairs in lockstep so the slot never overlaps. But the
OpenAI streaming parser (`chat.rs::L1954-2064`) emits every
`ContentBlockStart::ToolUse` as soon as a tool_call delta lands, then
batches every `ContentBlockStop` at `finish_reason`:

```text
Start { index: 0 }       // tool #1
Delta { index: 0, .. }
Start { index: 1 }       // tool #2 — overwrites current_tool_index
Delta { index: 1, .. }
…
Start { index: 6 }       // current_tool_index = Some(6)
Delta { index: 6, .. }
Stop  { index: 0 }       // take() returns Some(6)  ← wrong tool!
Stop  { index: 1 }       // take() returns None
Stop  { index: 2 }       // take() returns None
…
```

The first `Stop` consumes the last index and emits `ToolCallStarted`
for the wrong `tool_uses` entry; every subsequent `Stop` finds the
slot already `None` and skips the entire `if let Some(index) = …`
branch, dropping the announcement.

## Fix

Replace the single slot with `HashMap<u32 block_index, usize
tool_uses_idx>`:

- `ContentBlockStart::ToolUse` and `::ServerToolUse` insert the
  `(event.index → tool_uses.len())` mapping.
- `InputJsonDelta` looks up by the `ContentBlockDelta` outer index.
- `ContentBlockStop` removes by the stop's index, so each Stop routes
  to its own `tool_uses` entry regardless of arrival order.

Routing no longer depends on `current_block_kind` (which has the same
single-slot overwrite problem); `current_tool_indices.remove(&index)`
returning `Some(_)` already proves the Stop belongs to a tool block.

## Tests

Added `batch_tool_calls_preserve_all_tool_use_indices` in
`core/engine/turn_loop.rs::tests` — feeds 7 Starts and 7 Stops through
the same `HashMap` API used by `run_turn`, asserts every index round-trips.

Manual end-to-end verification: vLLM + Qwen3.6-35B + 7-file Tauri
template prompt → frontend `messages` history now contains all 7
`write_file` tool_use blocks paired with their tool_result blocks.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in handling OpenAI-compatible streaming responses where multiple tool calls are batched. It replaces the single current_tool_index with a HashMap to track multiple concurrent tool blocks by their index, ensuring that ContentBlockStop events are correctly routed. A regression test was also added to verify the fix. Feedback points out that while the tool use logic is corrected, other block types like Text and Thinking still rely on a single current_block_kind slot, which may lead to similar issues in batched responses; it is recommended to apply a similar HashMap approach to those block types to ensure full consistency.

Comment on lines +678 to +684
// Route the Stop using event.index (via
// `current_tool_indices`) rather than the single
// `current_block_kind` slot. In an OpenAI batch
// tool-call stream every Stop after the first sees
// `stopped_kind = None` because `take()` cleared the
// slot, so the original `matches!(stopped_kind, …)`
// check would skip every tool except the last.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While this fix correctly addresses the issue for ToolUse blocks by using the current_tool_indices map, the logic for Text and Thinking blocks remains broken for batched responses.

As noted in the comment at lines 681-684, stopped_kind is derived from current_block_kind.take() (at line 662), which clears the single slot. In an OpenAI-compatible stream where multiple ContentBlockStop events are batched at the end, only the first Stop event will have a non-None stopped_kind. If a Text block (usually index 0) is not the first to stop in the batch, its completion logic (e.g., setting pending_message_complete at line 665) will be skipped because stopped_kind will be None or ToolUse (if a tool stopped first).

To fully resolve this for all block types, current_block_kind should also be replaced with a map (e.g., HashMap<u32, ContentBlockKind>) to track the kind of each active block by its index.

@Hmbown Hmbown merged commit a528ea9 into Hmbown:main May 15, 2026
8 checks passed
@h3c-hexin h3c-hexin deleted the fix/openai-streaming-batch-tool-calls branch May 25, 2026 05:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants