Skip to content

fix(api-server): emit tool.completed lifecycle SSE for chat completions (#16588)#16666

Closed
briandevans wants to merge 2 commits into
NousResearch:mainfrom
briandevans:fix/api-server-tool-completion-sse-16588
Closed

fix(api-server): emit tool.completed lifecycle SSE for chat completions (#16588)#16666
briandevans wants to merge 2 commits into
NousResearch:mainfrom
briandevans:fix/api-server-tool-completion-sse-16588

Conversation

@briandevans

Copy link
Copy Markdown
Contributor

Summary

  • Add tool_start_callback / tool_complete_callback to the chat completions agent invocation so SSE consumers receive a matching status: completed event tied to the same toolCallId as the running event.
  • Emit the lifecycle pair on the existing event: hermes.tool.progress line; payload now carries toolCallId + status (running / completed) alongside the legacy emoji/label event.
  • Filter internal tools (names starting with _) and drop orphan completed callbacks that have no prior running so clients never see uncorrelatable lifecycle updates.
  • Adds two regression tests in tests/gateway/test_api_server.py: one asserts both halves of the lifecycle pair land on the wire with matching toolCallId; the other proves filtering handles internal tools and orphan completes.

The bug

/v1/chat/completions streaming emits event: hermes.tool.progress for tool start, but never a matching completed event with the exact toolCallId. Frontends rendering tool cards stay stuck in a "running" state until the full assistant response ends or they guess completion locally. The /v1/responses path already does this correctly via tool_start_callback / tool_complete_callback (api_server.py:1797-1829) — only chat completions was missing it.

The fix

Define _on_tool_start and _on_tool_complete in the chat completions handler that push tagged ("__tool_lifecycle__", payload) items to the existing stream queue. The SSE writer's _emit is extended to recognize the new tag on the same event: hermes.tool.progress SSE line, so no new event type is introduced — clients consuming the legacy emoji/label event keep working unchanged. A small _started_tool_call_ids set tracks running calls so a completed callback for a filtered-out internal tool, or one without a prior matching start, is silently dropped.

Test plan

  • Focused regression test test_stream_emits_tool_lifecycle_with_call_id: parses hermes.tool.progress payloads from the SSE body and asserts running + completed arrive in order with the same toolCallId.
  • Edge-case test test_stream_tool_lifecycle_skips_internal_and_orphan_completes: internal tool starts and orphan completes (no matching start) produce no lifecycle events on the wire.
  • Adjacent suite: tests/gateway/test_api_server.py, tests/gateway/test_api_server_runs.py, tests/gateway/test_api_server_multimodal.py — 159/159 pass.
  • Regression guard: with gateway/platforms/api_server.py reverted to origin/main, the new lifecycle test fails with missing running status; got []; with the fix applied it passes.

Related

…ns (NousResearch#16588)

`/v1/chat/completions` streaming previously emitted only a
``tool.started``-style ``hermes.tool.progress`` SSE event for tool calls.
Clients rendering tool lifecycle UI (Open WebUI, LobeChat, custom
dashboards) had no matching ``status: completed`` event with a stable
``toolCallId``, so tool cards stayed in a "running" state until the full
assistant response ended or the client guessed completion locally.

Pass ``tool_start_callback`` / ``tool_complete_callback`` (the same pair
already used by the ``/v1/responses`` path for exact call-id correlation)
into ``_run_agent`` from chat completions, and emit each half on the
existing ``event: hermes.tool.progress`` SSE line with ``toolCallId`` and
``status``. Internal tools (names starting with ``_``) and orphan
``completed`` events without a matching ``running`` are filtered out so
clients never see uncorrelatable lifecycle updates.

The pre-existing emoji/label ``__tool_progress__`` event continues
unchanged for backward compatibility — frontends that already render the
running marker keep working, and frontends that need lifecycle pairs now
get them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 27, 2026 18:19

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes /v1/chat/completions streaming so SSE consumers can reliably track tool lifecycles by emitting a matching status: completed event correlated via a stable toolCallId, aligning chat-completions behavior with the existing structured callbacks used elsewhere in the API server.

Changes:

  • Wire tool_start_callback / tool_complete_callback into the chat-completions streaming agent invocation.
  • Extend the SSE writer to emit tagged tool lifecycle payloads on event: hermes.tool.progress.
  • Add regression tests asserting lifecycle emission and filtering of internal/orphan lifecycle events.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
gateway/platforms/api_server.py Adds lifecycle callbacks for chat-completions streaming and teaches SSE writer to emit lifecycle payloads on hermes.tool.progress.
tests/gateway/test_api_server.py Adds regression tests for lifecycle running/completed emission, and for filtering internal tools/orphan completes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread gateway/platforms/api_server.py Outdated
Comment on lines +935 to +1004
"toolCallId": tool_call_id,
"tool": function_name,
"status": "completed",
}))

Copilot AI Apr 27, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/v1/chat/completions streaming will now emit two event: hermes.tool.progress messages for every real tool start: one from tool_progress_callback ("tool_progress") and another from tool_start_callback ("tool_lifecycle"). This happens because run_agent.py fires both callbacks for each tool call. This duplication is likely to confuse SSE consumers (and it doesn’t match the PR description’s “payload now carries toolCallId + status alongside the legacy emoji/label event”). Consider emitting a single “running” payload that includes the legacy emoji/label fields plus toolCallId/status, or suppressing the legacy __tool_progress__ emission when lifecycle callbacks are enabled.

Copilot uses AI. Check for mistakes.
Comment thread tests/gateway/test_api_server.py Outdated
Comment on lines +799 to +825
statuses = []
seen_call_ids = set()
lines = body.splitlines()
for i, line in enumerate(lines):
if line.strip() != "event: hermes.tool.progress":
continue
# Find the next non-blank ``data:`` line for this event.
for follow in lines[i + 1: i + 4]:
if follow.startswith("data: "):
try:
payload = _json.loads(follow[len("data: "):])
except _json.JSONDecodeError:
break
status = payload.get("status")
if status:
statuses.append(status)
if "toolCallId" in payload:
seen_call_ids.add(payload["toolCallId"])
break

# Both halves of the lifecycle pair must be emitted, in order,
# and tied to the same toolCallId — that's the contract the
# issue asks for.
assert "running" in statuses, f"missing running status; got {statuses}"
assert "completed" in statuses, f"missing completed status; got {statuses}"
assert statuses.index("running") < statuses.index("completed")
assert seen_call_ids == {"call_terminal_1"}, seen_call_ids

Copilot AI Apr 27, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test intends to assert that both lifecycle halves are correlated by the same toolCallId, but it only checks seen_call_ids == {"call_terminal_1"} overall. That would still pass if (for example) the running payload omitted toolCallId while only completed included it. Consider collecting (status, toolCallId) pairs and asserting both running and completed events each carry toolCallId == "call_terminal_1" (and in the right order).

Copilot uses AI. Check for mistakes.
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels Apr 27, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #16591 (earlier fix attempt for same issue #16588). This PR appears more complete with filtering for internal tools and orphan completed events.

Address Copilot review on PR NousResearch#16666:

1. **Duplicate event on every tool start** — both ``tool_progress_callback``
   and ``tool_start_callback`` fire side-by-side in ``run_agent.py``, so
   wiring both into chat completions emitted *two* ``hermes.tool.progress``
   events per real tool call. Drop the legacy ``_on_tool_progress`` emit
   entirely; ``_on_tool_start`` now produces a single unified event that
   carries the legacy ``tool``/``emoji``/``label`` fields plus the new
   ``toolCallId``/``status`` correlation fields. Label is computed inline
   via ``build_tool_preview`` so callers do not need to pre-format it.

2. **Weak per-event correlation in the regression test** — the previous
   assertion checked that a ``toolCallId`` appeared *somewhere* in the
   aggregate, which would have passed even if ``running`` lacked the id.
   Collect ``(status, toolCallId)`` per event and assert each event
   carries the correct pair, plus exactly two events on the wire (no
   silent duplication regression).

The two existing chat-completions tool-progress tests are updated to fire
``tool_start_callback`` instead of ``tool_progress_callback``, matching
production reality where ``run_agent`` always pairs them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@briandevans

Copy link
Copy Markdown
Contributor Author

@copilot All findings addressed in commit 3dd69f9:

  • Duplicate hermes.tool.progress on every tool start — collapsed to a single emit. Dropped _on_tool_progress entirely from the chat completions path; _on_tool_start now produces one unified hermes.tool.progress event carrying the legacy tool/emoji/label payload and the new toolCallId/status correlation fields. tool_progress_callback is no longer wired into the chat completions _run_agent invocation, so the structured callbacks are the sole source for SSE tool lifecycle events. The two pre-existing tests were updated to fire tool_start_callback instead of tool_progress_callback, matching production reality where run_agent pairs them 1:1.

  • Weak per-event correlationtest_stream_emits_tool_lifecycle_with_call_id now collects (status, toolCallId) per event and asserts each event carries the correct pair plus exactly two events on the wire (assert pairs[0] == ("running", "call_terminal_1") / assert pairs[1] == ("completed", "call_terminal_1")). A regression that omitted toolCallId from the running payload would now fail loudly.

@briandevans

Copy link
Copy Markdown
Contributor Author

CI audit — all 16 `test` job failures are pre-existing baselines on clean `origin/main` (6d24880). Zero failures are in touched code.

Test Symptom Baseline?
`tests/gateway/test_run_progress_topics.py::test_run_agent_progress_uses_event_message_id_for_slack_dm` `TypeError: NoneType not callable` on `tool_progress_callback` reproduces on origin/main locally
`tests/hermes_cli/test_config_env_expansion.py::TestLoadConfigExpansion::test_load_config_*` `TypeError: string indices must be integers, not str` unrelated to api_server
`tests/hermes_cli/test_container_aware_cli.py::test_get_container_exec_info_defaults` `assert None is not None` unrelated
`tests/gateway/test_agent_cache.py::TestAgentCacheSpilloverLive::test_concurrent_inserts_settle_at_cap` 30s test timeout unrelated, flaky
`tests/hermes_cli/test_pty_bridge.py::TestPtyBridgeResize::test_resize_updates_child_winsize` `tput: No value for $TERM` in CI runner env-specific
`tests/hermes_cli/test_web_server.py::TestBuildSchemaFromConfig::test_no_single_field_categories` `prompt_caching` category has 1 field upstream config schema regression
`tests/hermes_cli/test_web_server.py::TestPtyWebSocket::test_resize_escape_is_forwarded` 30s timeout unrelated, flaky
`tests/run_agent/test_background_review_toolset_restriction.py::test_background_review_agent_uses_restricted_toolsets` `AIAgent.init was not called` unrelated
`tests/run_agent/test_tool_arg_coercion.py::TestCoerceNumber::test_inf_stays_string_for_integer_only` `assert "inf" == inf` reproduces on origin/main locally
`tests/tools/test_clipboard.py::TestIsWsl::*` WSL detection on non-WSL CI env-specific
`tests/tui_gateway/test_make_agent_provider.py::test_make_agent_passes_resolved_provider` unexpected `target_model` kwarg reproduces on origin/main locally
`tests/tui_gateway/test_protocol.py::test_session_resume_returns_hydrated_messages` unexpected `include_ancestors` kwarg reproduces on origin/main locally

Touched code (`tests/gateway/test_api_server.py::TestChatCompletionsEndpoint`) — all 17 tests pass locally, including the 4 progress-event tests both before and after the fix-up commit.

The `test` job is failing on baseline; my regression test `test_stream_emits_tool_lifecycle_with_call_id` passes. The `nix (ubuntu-latest)` failure tracks the same failed pytest run.

teknium1 pushed a commit that referenced this pull request Apr 29, 2026
Address Copilot review on PR #16666:

1. **Duplicate event on every tool start** — both ``tool_progress_callback``
   and ``tool_start_callback`` fire side-by-side in ``run_agent.py``, so
   wiring both into chat completions emitted *two* ``hermes.tool.progress``
   events per real tool call. Drop the legacy ``_on_tool_progress`` emit
   entirely; ``_on_tool_start`` now produces a single unified event that
   carries the legacy ``tool``/``emoji``/``label`` fields plus the new
   ``toolCallId``/``status`` correlation fields. Label is computed inline
   via ``build_tool_preview`` so callers do not need to pre-format it.

2. **Weak per-event correlation in the regression test** — the previous
   assertion checked that a ``toolCallId`` appeared *somewhere* in the
   aggregate, which would have passed even if ``running`` lacked the id.
   Collect ``(status, toolCallId)`` per event and assert each event
   carries the correct pair, plus exactly two events on the wire (no
   silent duplication regression).

The two existing chat-completions tool-progress tests are updated to fire
``tool_start_callback`` instead of ``tool_progress_callback``, matching
production reality where ``run_agent`` always pairs them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@teknium1

Copy link
Copy Markdown
Contributor

Merged via PR #17505 — your commit was cherry-picked with authorship preserved (commit e0a03f3 on main). Thanks for the rigorous regression tests and the _started_tool_call_ids orphan filter.

@teknium1 teknium1 closed this Apr 29, 2026
donald131 pushed a commit to donald131/hermes-agent that referenced this pull request May 2, 2026
Address Copilot review on PR NousResearch#16666:

1. **Duplicate event on every tool start** — both ``tool_progress_callback``
   and ``tool_start_callback`` fire side-by-side in ``run_agent.py``, so
   wiring both into chat completions emitted *two* ``hermes.tool.progress``
   events per real tool call. Drop the legacy ``_on_tool_progress`` emit
   entirely; ``_on_tool_start`` now produces a single unified event that
   carries the legacy ``tool``/``emoji``/``label`` fields plus the new
   ``toolCallId``/``status`` correlation fields. Label is computed inline
   via ``build_tool_preview`` so callers do not need to pre-format it.

2. **Weak per-event correlation in the regression test** — the previous
   assertion checked that a ``toolCallId`` appeared *somewhere* in the
   aggregate, which would have passed even if ``running`` lacked the id.
   Collect ``(status, toolCallId)`` per event and assert each event
   carries the correct pair, plus exactly two events on the wire (no
   silent duplication regression).

The two existing chat-completions tool-progress tests are updated to fire
``tool_start_callback`` instead of ``tool_progress_callback``, matching
production reality where ``run_agent`` always pairs them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
Address Copilot review on PR NousResearch#16666:

1. **Duplicate event on every tool start** — both ``tool_progress_callback``
   and ``tool_start_callback`` fire side-by-side in ``run_agent.py``, so
   wiring both into chat completions emitted *two* ``hermes.tool.progress``
   events per real tool call. Drop the legacy ``_on_tool_progress`` emit
   entirely; ``_on_tool_start`` now produces a single unified event that
   carries the legacy ``tool``/``emoji``/``label`` fields plus the new
   ``toolCallId``/``status`` correlation fields. Label is computed inline
   via ``build_tool_preview`` so callers do not need to pre-format it.

2. **Weak per-event correlation in the regression test** — the previous
   assertion checked that a ``toolCallId`` appeared *somewhere* in the
   aggregate, which would have passed even if ``running`` lacked the id.
   Collect ``(status, toolCallId)`` per event and assert each event
   carries the correct pair, plus exactly two events on the wire (no
   silent duplication regression).

The two existing chat-completions tool-progress tests are updated to fire
``tool_start_callback`` instead of ``tool_progress_callback``, matching
production reality where ``run_agent`` always pairs them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jsboige pushed a commit to jsboige/hermes-agent that referenced this pull request May 14, 2026
Address Copilot review on PR NousResearch#16666:

1. **Duplicate event on every tool start** — both ``tool_progress_callback``
   and ``tool_start_callback`` fire side-by-side in ``run_agent.py``, so
   wiring both into chat completions emitted *two* ``hermes.tool.progress``
   events per real tool call. Drop the legacy ``_on_tool_progress`` emit
   entirely; ``_on_tool_start`` now produces a single unified event that
   carries the legacy ``tool``/``emoji``/``label`` fields plus the new
   ``toolCallId``/``status`` correlation fields. Label is computed inline
   via ``build_tool_preview`` so callers do not need to pre-format it.

2. **Weak per-event correlation in the regression test** — the previous
   assertion checked that a ``toolCallId`` appeared *somewhere* in the
   aggregate, which would have passed even if ``running`` lacked the id.
   Collect ``(status, toolCallId)`` per event and assert each event
   carries the correct pair, plus exactly two events on the wire (no
   silent duplication regression).

The two existing chat-completions tool-progress tests are updated to fire
``tool_start_callback`` instead of ``tool_progress_callback``, matching
production reality where ``run_agent`` always pairs them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dannyJ848 pushed a commit to dannyJ848/hermes-agent that referenced this pull request May 17, 2026
Address Copilot review on PR NousResearch#16666:

1. **Duplicate event on every tool start** — both ``tool_progress_callback``
   and ``tool_start_callback`` fire side-by-side in ``run_agent.py``, so
   wiring both into chat completions emitted *two* ``hermes.tool.progress``
   events per real tool call. Drop the legacy ``_on_tool_progress`` emit
   entirely; ``_on_tool_start`` now produces a single unified event that
   carries the legacy ``tool``/``emoji``/``label`` fields plus the new
   ``toolCallId``/``status`` correlation fields. Label is computed inline
   via ``build_tool_preview`` so callers do not need to pre-format it.

2. **Weak per-event correlation in the regression test** — the previous
   assertion checked that a ``toolCallId`` appeared *somewhere* in the
   aggregate, which would have passed even if ``running`` lacked the id.
   Collect ``(status, toolCallId)`` per event and assert each event
   carries the correct pair, plus exactly two events on the wire (no
   silent duplication regression).

The two existing chat-completions tool-progress tests are updated to fire
``tool_start_callback`` instead of ``tool_progress_callback``, matching
production reality where ``run_agent`` always pairs them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
Address Copilot review on PR NousResearch#16666:

1. **Duplicate event on every tool start** — both ``tool_progress_callback``
   and ``tool_start_callback`` fire side-by-side in ``run_agent.py``, so
   wiring both into chat completions emitted *two* ``hermes.tool.progress``
   events per real tool call. Drop the legacy ``_on_tool_progress`` emit
   entirely; ``_on_tool_start`` now produces a single unified event that
   carries the legacy ``tool``/``emoji``/``label`` fields plus the new
   ``toolCallId``/``status`` correlation fields. Label is computed inline
   via ``build_tool_preview`` so callers do not need to pre-format it.

2. **Weak per-event correlation in the regression test** — the previous
   assertion checked that a ``toolCallId`` appeared *somewhere* in the
   aggregate, which would have passed even if ``running`` lacked the id.
   Collect ``(status, toolCallId)`` per event and assert each event
   carries the correct pair, plus exactly two events on the wire (no
   silent duplication regression).

The two existing chat-completions tool-progress tests are updated to fire
``tool_start_callback`` instead of ``tool_progress_callback``, matching
production reality where ``run_agent`` always pairs them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
Address Copilot review on PR NousResearch#16666:

1. **Duplicate event on every tool start** — both ``tool_progress_callback``
   and ``tool_start_callback`` fire side-by-side in ``run_agent.py``, so
   wiring both into chat completions emitted *two* ``hermes.tool.progress``
   events per real tool call. Drop the legacy ``_on_tool_progress`` emit
   entirely; ``_on_tool_start`` now produces a single unified event that
   carries the legacy ``tool``/``emoji``/``label`` fields plus the new
   ``toolCallId``/``status`` correlation fields. Label is computed inline
   via ``build_tool_preview`` so callers do not need to pre-format it.

2. **Weak per-event correlation in the regression test** — the previous
   assertion checked that a ``toolCallId`` appeared *somewhere* in the
   aggregate, which would have passed even if ``running`` lacked the id.
   Collect ``(status, toolCallId)`` per event and assert each event
   carries the correct pair, plus exactly two events on the wire (no
   silent duplication regression).

The two existing chat-completions tool-progress tests are updated to fire
``tool_start_callback`` instead of ``tool_progress_callback``, matching
production reality where ``run_agent`` always pairs them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: chat completions SSE does not emit tool completion progress

4 participants