fix(gateway): broadcast agent-run error payloads#85355
Conversation
|
Codex review: needs maintainer review before merge. Latest ClawSweeper review: 2026-05-22 12:47 UTC / May 22, 2026, 8:47 AM ET. Workflow note: Future ClawSweeper reviews update this same comment in place. How this review workflow works
Summary Reproducibility: yes. Source-level reproduction is high confidence: current main stores block/final replies in PR rating Rank-up moves:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. Real behavior proof Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Land one canonical Gateway fix that broadcasts returned agent-run error payloads, preserves the transcript ownership boundary, and then close the linked issue while superseding duplicate PRs. Do we have a high-confidence way to reproduce the issue? Yes. Source-level reproduction is high confidence: current main stores block/final replies in Is this the best way to solve the issue? Yes. The PR is the narrow maintainable fix: it inspects delivered error payloads in the existing post-dispatch path, reuses Label changes:
Label justifications:
What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 111bad106544. |
|
ClawSweeper PR egg ✨ Hatched: 💎 rare Mossy Merge Sprite Hatch commandComment Hatchability rules:
Rarity: 💎 rare. What is this egg doing here?
|
…UI error (#4437) ## Release target Refs #4434. This PR targets `v0.0.55`; #4434 should remain open until this OpenClaw upgrade is merged, tagged, and verified in the shipped `.55` release. ## Why this resolves #4434 NemoClaw #4434 reports that `openclaw tui` keeps an active spinner and `connected` status with no visible terminal error when the NVIDIA inference endpoint is unreachable. This branch moves the sandbox OpenClaw pin from `2026.5.22` to `2026.5.27` with npm integrity: `sha512-2N93zhdAo88KAbHt6T7KvYXf4s7XIkYXBgv1npYpn7e1Y9FvrtgtpsA38my9rtFW+70uXEojRPX5/OqnuDqJPw==` Upstream proof: - openclaw/openclaw#85815 and openclaw/openclaw@a668982 fix the missing `broadcastChatError()` call for synchronous `chat.send` failures. - openclaw/openclaw#84945 and openclaw/openclaw#85355 show the broader real class of gateway errors not being broadcast to clients. ## Changes - Bumps `Dockerfile`, `Dockerfile.base`, `agents/openclaw/manifest.yaml`, and package metadata to OpenClaw `2026.5.27`. - Updates OpenClaw pin/integrity tests, deployment/version tests, and the existing TUI chat-correlation E2E assertion. - Updates `scripts/patch-openclaw-chat-send.js` so NemoClaw's chat-send run-id preservation shim still recognizes the compiled OpenClaw `2026.5.27` followup-runner admission shape. - Adds a CI-safe Vitest contract harness for the #4434 TUI failure signature and expected visible-error behavior. - Adds the privileged live repro: `test/e2e/test-issue-4434-tui-unreachable-inference.sh`. - Wires that live repro into `nightly-e2e.yaml` as `issue-4434-tui-unreachable-inference-e2e`, including selective dispatch, public-install target-ref handling, failure artifacts, aggregate reporting coverage, and trusted workflow-script checkout for the secret/sudo firewall job. ## Local validation - `npm ci` - `npm ci --include=dev` - `npm run build:cli` - `npm run typecheck:cli` - `npm test -- test/fetch-guard-patch-regression.test.ts test/openclaw-chat-send-patch.test.ts test/openclaw-tui-chat-correlation.test.ts test/issue-4434-tui-unreachable-inference.test.ts` - `npm test -- src/lib/sandbox/version.test.ts src/lib/verify-deployment.test.ts` - `npm test -- test/validate-e2e-coverage.test.ts test/e2e-advisor-dispatch.test.ts test/e2e-script-workflow.test.ts test/issue-4434-tui-unreachable-inference.test.ts nemoclaw/src/package-metadata.test.ts` - `shellcheck test/e2e/test-issue-4434-tui-unreachable-inference.sh` - `bash -n test/e2e/test-issue-4434-tui-unreachable-inference.sh` - `bash -n test/e2e/test-openclaw-tui-chat-correlation.sh` - `NEMOCLAW_ISSUE_4434_LIVE=0 bash test/e2e/test-issue-4434-tui-unreachable-inference.sh` - `git diff --check` - Fresh `npm pack openclaw@2026.5.27` dist smoke with `node scripts/patch-openclaw-chat-send.js "$tmp/package/dist"` - Runtime Docker smoke: `docker build -f Dockerfile --build-arg BASE_IMAGE=ghcr.io/nvidia/nemoclaw/sandbox-base:latest -t nemoclaw-issue4434-openclaw-runtime-smoke:2026-5-27 .` - Runtime image version smoke: `docker run --rm --entrypoint openclaw nemoclaw-issue4434-openclaw-runtime-smoke:2026-5-27 --version` -> `OpenClaw 2026.5.27 (27ae826)` - Base-style OpenClaw install smoke in Docker for the `2026.5.27` npm integrity and install path. - Pre-commit suite on `98e0a763efe0925f26cf89129cd4ab63cb0b05f3`: passed, including CLI/plugin coverage hooks. - Pre-push suite reran CLI/plugin coverage; one unrelated `test/nemoclaw-start.test.ts` case timed out during the full concurrent run, then passed directly with `npx vitest run --project cli test/nemoclaw-start.test.ts -t "captures baseline snapshot when openclaw.json is valid and no baseline exists"`. ## Nightly proof Targeted nightly E2E passed on the final PR head: - Run: https://github.com/NVIDIA/NemoClaw/actions/runs/26586935610 - Job: https://github.com/NVIDIA/NemoClaw/actions/runs/26586935610/job/78335355241 - Head: `5f549f661fe81b485f75903146512af4225d4698` - Job: `issue-4434-tui-unreachable-inference-e2e` - Duration: 8m27s The live job runs the requested end-to-end flow on Linux with the repository `NVIDIA_API_KEY` secret: public install from this PR ref, cloud onboard with NVIDIA Endpoints and `nvidia/nemotron-3-super-120b-a12b`, pre-block `nemoclaw <sandbox> status`, pre-block `nemoclaw <sandbox> connect --probe-only`, exact `DOCKER-USER` `DROP` rules for `75.2.113.119` and `99.83.136.103`, in-sandbox endpoint-block verification, `openclaw tui`, `hello`, and final TUI assertion. The passing assertion was: `PASS: openclaw tui surfaced a visible unreachable-inference error and stopped the spinner` The dispatch command for reruns while this job only exists on the PR branch is: ```bash gh workflow run nightly-e2e.yaml --repo NVIDIA/NemoClaw \ --ref issue-4434-openclaw-2026-5-27-proof \ -f target_ref=5f549f661fe81b485f75903146512af4225d4698 \ -f pr_number=4437 \ -f jobs=issue-4434-tui-unreachable-inference-e2e ``` ## Remaining release note - Baseline: #4434 already captures the `v0.0.53` / OpenClaw `2026.5.22` spinner/no-error behavior after the exact firewall block. I did not rerun the mutating baseline repro from this macOS host. - Exact `Dockerfile.base` build was blocked locally because this Docker install does not provide `docker buildx`, while `Dockerfile.base` uses BuildKit `RUN --mount`. The runtime Docker path and a base-style OpenClaw install smoke both passed. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Added an opt-in live E2E repro and new unit/integration tests for TUI behavior when inference endpoints are unreachable, validating visible error reporting, spinner shutdown, and compatibility with updated runtime/followup-runner shapes. * **Chores** * Bumped OpenClaw/runtime to 2026.5.27 across builds, manifests, docs, and test expectations. * **Chores / CI** * Added a selective/nightly E2E job to run the repro, include its results in aggregated reports, and upload sanitized logs with sensitive tokens redacted. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: cjagwani <cjagwani@nvidia.com>
Summary
isErroragent-run payloads as terminalchaterror events after an agent has startedchat:<runId>dedupe entry failed for those returned error payloadsFixes #84945.
Verification
node scripts/run-vitest.mjs src/gateway/server-methods/chat.directive-tags.test.ts(2 files, 152 tests passed)node_modules/.bin/oxfmt --check src/gateway/server-methods/chat.ts src/gateway/server-methods/chat.directive-tags.test.ts CHANGELOG.mdgit diff --check origin/main...HEADReal behavior proof
Behavior addressed: returned
isErrorpayloads afteragentRunStarted=truewere persisted/logged but not broadcast as terminal chat errors.Real environment tested: local Gateway server-method unit harness in a Codex worktree, with
dispatchInboundMessagetriggeringonAgentRunStartand returning a final error payload.Exact steps or command run after this patch:
node scripts/run-vitest.mjs src/gateway/server-methods/chat.directive-tags.test.ts.Evidence after fix: the new regression observes a
chatevent withstate: "error",errorMessage: "LLM idle timeout (120s): no response from model", and a failedchat:<runId>dedupe snapshot.Observed result after fix: connected Gateway clients receive the terminal error event while normal agent-run final text still is not mirrored into assistant transcript entries.
What was not tested: a live ACP/Ki-Agents 120s idle timeout run; CI and the focused Gateway harness cover the source-level failure path.