fix(api-server): emit tool.completed lifecycle SSE for chat completions (#16588)#16666
fix(api-server): emit tool.completed lifecycle SSE for chat completions (#16588)#16666briandevans wants to merge 2 commits into
Conversation
…ns (NousResearch#16588) `/v1/chat/completions` streaming previously emitted only a ``tool.started``-style ``hermes.tool.progress`` SSE event for tool calls. Clients rendering tool lifecycle UI (Open WebUI, LobeChat, custom dashboards) had no matching ``status: completed`` event with a stable ``toolCallId``, so tool cards stayed in a "running" state until the full assistant response ended or the client guessed completion locally. Pass ``tool_start_callback`` / ``tool_complete_callback`` (the same pair already used by the ``/v1/responses`` path for exact call-id correlation) into ``_run_agent`` from chat completions, and emit each half on the existing ``event: hermes.tool.progress`` SSE line with ``toolCallId`` and ``status``. Internal tools (names starting with ``_``) and orphan ``completed`` events without a matching ``running`` are filtered out so clients never see uncorrelatable lifecycle updates. The pre-existing emoji/label ``__tool_progress__`` event continues unchanged for backward compatibility — frontends that already render the running marker keep working, and frontends that need lifecycle pairs now get them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes /v1/chat/completions streaming so SSE consumers can reliably track tool lifecycles by emitting a matching status: completed event correlated via a stable toolCallId, aligning chat-completions behavior with the existing structured callbacks used elsewhere in the API server.
Changes:
- Wire
tool_start_callback/tool_complete_callbackinto the chat-completions streaming agent invocation. - Extend the SSE writer to emit tagged tool lifecycle payloads on
event: hermes.tool.progress. - Add regression tests asserting lifecycle emission and filtering of internal/orphan lifecycle events.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
gateway/platforms/api_server.py |
Adds lifecycle callbacks for chat-completions streaming and teaches SSE writer to emit lifecycle payloads on hermes.tool.progress. |
tests/gateway/test_api_server.py |
Adds regression tests for lifecycle running/completed emission, and for filtering internal tools/orphan completes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "toolCallId": tool_call_id, | ||
| "tool": function_name, | ||
| "status": "completed", | ||
| })) |
There was a problem hiding this comment.
/v1/chat/completions streaming will now emit two event: hermes.tool.progress messages for every real tool start: one from tool_progress_callback ("tool_progress") and another from tool_start_callback ("tool_lifecycle"). This happens because run_agent.py fires both callbacks for each tool call. This duplication is likely to confuse SSE consumers (and it doesn’t match the PR description’s “payload now carries toolCallId + status alongside the legacy emoji/label event”). Consider emitting a single “running” payload that includes the legacy emoji/label fields plus toolCallId/status, or suppressing the legacy __tool_progress__ emission when lifecycle callbacks are enabled.
| statuses = [] | ||
| seen_call_ids = set() | ||
| lines = body.splitlines() | ||
| for i, line in enumerate(lines): | ||
| if line.strip() != "event: hermes.tool.progress": | ||
| continue | ||
| # Find the next non-blank ``data:`` line for this event. | ||
| for follow in lines[i + 1: i + 4]: | ||
| if follow.startswith("data: "): | ||
| try: | ||
| payload = _json.loads(follow[len("data: "):]) | ||
| except _json.JSONDecodeError: | ||
| break | ||
| status = payload.get("status") | ||
| if status: | ||
| statuses.append(status) | ||
| if "toolCallId" in payload: | ||
| seen_call_ids.add(payload["toolCallId"]) | ||
| break | ||
|
|
||
| # Both halves of the lifecycle pair must be emitted, in order, | ||
| # and tied to the same toolCallId — that's the contract the | ||
| # issue asks for. | ||
| assert "running" in statuses, f"missing running status; got {statuses}" | ||
| assert "completed" in statuses, f"missing completed status; got {statuses}" | ||
| assert statuses.index("running") < statuses.index("completed") | ||
| assert seen_call_ids == {"call_terminal_1"}, seen_call_ids |
There was a problem hiding this comment.
This test intends to assert that both lifecycle halves are correlated by the same toolCallId, but it only checks seen_call_ids == {"call_terminal_1"} overall. That would still pass if (for example) the running payload omitted toolCallId while only completed included it. Consider collecting (status, toolCallId) pairs and asserting both running and completed events each carry toolCallId == "call_terminal_1" (and in the right order).
Address Copilot review on PR NousResearch#16666: 1. **Duplicate event on every tool start** — both ``tool_progress_callback`` and ``tool_start_callback`` fire side-by-side in ``run_agent.py``, so wiring both into chat completions emitted *two* ``hermes.tool.progress`` events per real tool call. Drop the legacy ``_on_tool_progress`` emit entirely; ``_on_tool_start`` now produces a single unified event that carries the legacy ``tool``/``emoji``/``label`` fields plus the new ``toolCallId``/``status`` correlation fields. Label is computed inline via ``build_tool_preview`` so callers do not need to pre-format it. 2. **Weak per-event correlation in the regression test** — the previous assertion checked that a ``toolCallId`` appeared *somewhere* in the aggregate, which would have passed even if ``running`` lacked the id. Collect ``(status, toolCallId)`` per event and assert each event carries the correct pair, plus exactly two events on the wire (no silent duplication regression). The two existing chat-completions tool-progress tests are updated to fire ``tool_start_callback`` instead of ``tool_progress_callback``, matching production reality where ``run_agent`` always pairs them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@copilot All findings addressed in commit 3dd69f9:
|
|
CI audit — all 16 `test` job failures are pre-existing baselines on clean `origin/main` (6d24880). Zero failures are in touched code.
Touched code (`tests/gateway/test_api_server.py::TestChatCompletionsEndpoint`) — all 17 tests pass locally, including the 4 progress-event tests both before and after the fix-up commit. The `test` job is failing on baseline; my regression test `test_stream_emits_tool_lifecycle_with_call_id` passes. The `nix (ubuntu-latest)` failure tracks the same failed pytest run. |
Address Copilot review on PR #16666: 1. **Duplicate event on every tool start** — both ``tool_progress_callback`` and ``tool_start_callback`` fire side-by-side in ``run_agent.py``, so wiring both into chat completions emitted *two* ``hermes.tool.progress`` events per real tool call. Drop the legacy ``_on_tool_progress`` emit entirely; ``_on_tool_start`` now produces a single unified event that carries the legacy ``tool``/``emoji``/``label`` fields plus the new ``toolCallId``/``status`` correlation fields. Label is computed inline via ``build_tool_preview`` so callers do not need to pre-format it. 2. **Weak per-event correlation in the regression test** — the previous assertion checked that a ``toolCallId`` appeared *somewhere* in the aggregate, which would have passed even if ``running`` lacked the id. Collect ``(status, toolCallId)`` per event and assert each event carries the correct pair, plus exactly two events on the wire (no silent duplication regression). The two existing chat-completions tool-progress tests are updated to fire ``tool_start_callback`` instead of ``tool_progress_callback``, matching production reality where ``run_agent`` always pairs them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address Copilot review on PR NousResearch#16666: 1. **Duplicate event on every tool start** — both ``tool_progress_callback`` and ``tool_start_callback`` fire side-by-side in ``run_agent.py``, so wiring both into chat completions emitted *two* ``hermes.tool.progress`` events per real tool call. Drop the legacy ``_on_tool_progress`` emit entirely; ``_on_tool_start`` now produces a single unified event that carries the legacy ``tool``/``emoji``/``label`` fields plus the new ``toolCallId``/``status`` correlation fields. Label is computed inline via ``build_tool_preview`` so callers do not need to pre-format it. 2. **Weak per-event correlation in the regression test** — the previous assertion checked that a ``toolCallId`` appeared *somewhere* in the aggregate, which would have passed even if ``running`` lacked the id. Collect ``(status, toolCallId)`` per event and assert each event carries the correct pair, plus exactly two events on the wire (no silent duplication regression). The two existing chat-completions tool-progress tests are updated to fire ``tool_start_callback`` instead of ``tool_progress_callback``, matching production reality where ``run_agent`` always pairs them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address Copilot review on PR NousResearch#16666: 1. **Duplicate event on every tool start** — both ``tool_progress_callback`` and ``tool_start_callback`` fire side-by-side in ``run_agent.py``, so wiring both into chat completions emitted *two* ``hermes.tool.progress`` events per real tool call. Drop the legacy ``_on_tool_progress`` emit entirely; ``_on_tool_start`` now produces a single unified event that carries the legacy ``tool``/``emoji``/``label`` fields plus the new ``toolCallId``/``status`` correlation fields. Label is computed inline via ``build_tool_preview`` so callers do not need to pre-format it. 2. **Weak per-event correlation in the regression test** — the previous assertion checked that a ``toolCallId`` appeared *somewhere* in the aggregate, which would have passed even if ``running`` lacked the id. Collect ``(status, toolCallId)`` per event and assert each event carries the correct pair, plus exactly two events on the wire (no silent duplication regression). The two existing chat-completions tool-progress tests are updated to fire ``tool_start_callback`` instead of ``tool_progress_callback``, matching production reality where ``run_agent`` always pairs them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address Copilot review on PR NousResearch#16666: 1. **Duplicate event on every tool start** — both ``tool_progress_callback`` and ``tool_start_callback`` fire side-by-side in ``run_agent.py``, so wiring both into chat completions emitted *two* ``hermes.tool.progress`` events per real tool call. Drop the legacy ``_on_tool_progress`` emit entirely; ``_on_tool_start`` now produces a single unified event that carries the legacy ``tool``/``emoji``/``label`` fields plus the new ``toolCallId``/``status`` correlation fields. Label is computed inline via ``build_tool_preview`` so callers do not need to pre-format it. 2. **Weak per-event correlation in the regression test** — the previous assertion checked that a ``toolCallId`` appeared *somewhere* in the aggregate, which would have passed even if ``running`` lacked the id. Collect ``(status, toolCallId)`` per event and assert each event carries the correct pair, plus exactly two events on the wire (no silent duplication regression). The two existing chat-completions tool-progress tests are updated to fire ``tool_start_callback`` instead of ``tool_progress_callback``, matching production reality where ``run_agent`` always pairs them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address Copilot review on PR NousResearch#16666: 1. **Duplicate event on every tool start** — both ``tool_progress_callback`` and ``tool_start_callback`` fire side-by-side in ``run_agent.py``, so wiring both into chat completions emitted *two* ``hermes.tool.progress`` events per real tool call. Drop the legacy ``_on_tool_progress`` emit entirely; ``_on_tool_start`` now produces a single unified event that carries the legacy ``tool``/``emoji``/``label`` fields plus the new ``toolCallId``/``status`` correlation fields. Label is computed inline via ``build_tool_preview`` so callers do not need to pre-format it. 2. **Weak per-event correlation in the regression test** — the previous assertion checked that a ``toolCallId`` appeared *somewhere* in the aggregate, which would have passed even if ``running`` lacked the id. Collect ``(status, toolCallId)`` per event and assert each event carries the correct pair, plus exactly two events on the wire (no silent duplication regression). The two existing chat-completions tool-progress tests are updated to fire ``tool_start_callback`` instead of ``tool_progress_callback``, matching production reality where ``run_agent`` always pairs them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address Copilot review on PR NousResearch#16666: 1. **Duplicate event on every tool start** — both ``tool_progress_callback`` and ``tool_start_callback`` fire side-by-side in ``run_agent.py``, so wiring both into chat completions emitted *two* ``hermes.tool.progress`` events per real tool call. Drop the legacy ``_on_tool_progress`` emit entirely; ``_on_tool_start`` now produces a single unified event that carries the legacy ``tool``/``emoji``/``label`` fields plus the new ``toolCallId``/``status`` correlation fields. Label is computed inline via ``build_tool_preview`` so callers do not need to pre-format it. 2. **Weak per-event correlation in the regression test** — the previous assertion checked that a ``toolCallId`` appeared *somewhere* in the aggregate, which would have passed even if ``running`` lacked the id. Collect ``(status, toolCallId)`` per event and assert each event carries the correct pair, plus exactly two events on the wire (no silent duplication regression). The two existing chat-completions tool-progress tests are updated to fire ``tool_start_callback`` instead of ``tool_progress_callback``, matching production reality where ``run_agent`` always pairs them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address Copilot review on PR NousResearch#16666: 1. **Duplicate event on every tool start** — both ``tool_progress_callback`` and ``tool_start_callback`` fire side-by-side in ``run_agent.py``, so wiring both into chat completions emitted *two* ``hermes.tool.progress`` events per real tool call. Drop the legacy ``_on_tool_progress`` emit entirely; ``_on_tool_start`` now produces a single unified event that carries the legacy ``tool``/``emoji``/``label`` fields plus the new ``toolCallId``/``status`` correlation fields. Label is computed inline via ``build_tool_preview`` so callers do not need to pre-format it. 2. **Weak per-event correlation in the regression test** — the previous assertion checked that a ``toolCallId`` appeared *somewhere* in the aggregate, which would have passed even if ``running`` lacked the id. Collect ``(status, toolCallId)`` per event and assert each event carries the correct pair, plus exactly two events on the wire (no silent duplication regression). The two existing chat-completions tool-progress tests are updated to fire ``tool_start_callback`` instead of ``tool_progress_callback``, matching production reality where ``run_agent`` always pairs them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
tool_start_callback/tool_complete_callbackto the chat completions agent invocation so SSE consumers receive a matchingstatus: completedevent tied to the sametoolCallIdas the running event.event: hermes.tool.progressline; payload now carriestoolCallId+status(running/completed) alongside the legacy emoji/label event._) and drop orphancompletedcallbacks that have no priorrunningso clients never see uncorrelatable lifecycle updates.tests/gateway/test_api_server.py: one asserts both halves of the lifecycle pair land on the wire with matchingtoolCallId; the other proves filtering handles internal tools and orphan completes.The bug
/v1/chat/completionsstreaming emitsevent: hermes.tool.progressfor tool start, but never a matchingcompletedevent with the exacttoolCallId. Frontends rendering tool cards stay stuck in a "running" state until the full assistant response ends or they guess completion locally. The/v1/responsespath already does this correctly viatool_start_callback/tool_complete_callback(api_server.py:1797-1829) — only chat completions was missing it.The fix
Define
_on_tool_startand_on_tool_completein the chat completions handler that push tagged("__tool_lifecycle__", payload)items to the existing stream queue. The SSE writer's_emitis extended to recognize the new tag on the sameevent: hermes.tool.progressSSE line, so no new event type is introduced — clients consuming the legacy emoji/label event keep working unchanged. A small_started_tool_call_idsset tracks running calls so acompletedcallback for a filtered-out internal tool, or one without a prior matching start, is silently dropped.Test plan
test_stream_emits_tool_lifecycle_with_call_id: parseshermes.tool.progresspayloads from the SSE body and assertsrunning+completedarrive in order with the sametoolCallId.test_stream_tool_lifecycle_skips_internal_and_orphan_completes: internal tool starts and orphan completes (no matching start) produce no lifecycle events on the wire.tests/gateway/test_api_server.py,tests/gateway/test_api_server_runs.py,tests/gateway/test_api_server_multimodal.py— 159/159 pass.gateway/platforms/api_server.pyreverted toorigin/main, the new lifecycle test fails withmissing running status; got []; with the fix applied it passes.Related
/v1/responsespath (gateway/platforms/api_server.py:1797-1829).__tool_progress__event introduced in [Bug]: Tool progress markers in SSE content corrupt model behavior over time (Open WebUI / OpenAI-compatible API) #6972.