Add gateway health connection telemetry#74783
Add gateway health connection telemetry#74783trialanderrorstudios wants to merge 4 commits intoopenclaw:mainfrom
Conversation
|
Codex review: needs maintainer review before merge. Summary Reproducibility: yes. Source inspection on current main verifies the feature gap: health types, handler responses, docs, and the existing ping loop do not expose per-client Next step before merge Security Review detailsBest possible solution: Land the telemetry after maintainer review, preserving one authenticated WebSocket ping owner and keeping client-specific connection data as a per-response overlay outside the shared health cache. Do we have a high-confidence way to reproduce the issue? Yes. Source inspection on current main verifies the feature gap: health types, handler responses, docs, and the existing ping loop do not expose per-client Is this the best way to solve the issue? Yes. Reusing the authenticated 25s WebSocket ping path from #76111 and overlaying client-specific state per health response is the narrowest maintainable implementation I found. Acceptance criteria:
What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 1d34564de9ba. |
f9383f6 to
1b3d3cc
Compare
1b3d3cc to
f264e95
Compare
|
Acknowledging the updated ClawSweeper pass on the same head SHA. The prior "needs maintainer review" state is now stale because the latest bot edit identifies a concrete blocker around delayed pong matching. I will update this PR so valid delayed pong samples within the stale window still refresh connection health, add the requested regression coverage, and then re-run the focused validation before re-requesting review. |
|
Updated in What changed:
Validation:
|
|
Follow-up pushed in The previous push exposed two stale gateway-method test expectations in CI: those tests now correctly expect the per-client Additional local validation:
|
6a01bb5 to
e1520cf
Compare
|
Updated in This rebases the PR onto current What changed:
Validation:
|
e1520cf to
1cd9292
Compare
|
Rebased this PR onto current main at 68918ab and pushed head 1cd9292. Resolved the conflict with main lazy websocket message-handler loading by waiting for vi.dynamicImportSettled() in ws-connection tests while preserving the authenticated websocket ping-owner assertions. Kept current main deferred startup/maintenance lifecycle and retained this PR connection-health overlay plus websocket ping ownership. The prior GitHub failure in checks-node-agentic-control-plane-runtime (server.reload.test.ts cron stop expectation) passes locally on the rebased branch. Local validation: focused gateway health tests passed; full test/vitest/vitest.gateway-server.config.ts passed with 99 files and 870 tests; pnpm tsgo:test:src passed; oxfmt check passed; git diff --check origin/main...HEAD passed. |
Summary
connectiontelemetry to authenticated WebSockethealthresponses.health.connectionand add tests for stale pongs, socket close, no-first-pong staleness, and multi-client isolation.Fixes #70230.
Details
health.connectionis overlaid per response and includes:The cached service-level
HealthSummarystays shared/global. The connection block is derived from the current request client, so one client cannot see another client's RTT or heartbeat timestamp.RTT uses WebSocket protocol ping/pong with a 5s ping cadence and a 20s stale threshold:
Before the first successful pong,
connectedcan betruewhilerttMsandlastHeartbeatAtarenull, but only during the initial stale window. If no pong is accepted withinCONNECTION_STALE_MSfromconnectedAtMs,connectedbecomesfalse.Coordination with #56668
Related #56668 adds a standalone 25s WebSocket protocol ping keepalive. This PR also sends protocol pings, but uses one gateway maintenance loop so the same ping/pong path powers both keepalive and
health.connectiontelemetry.If this PR lands first, it should be the single WebSocket ping owner and #56668's standalone per-socket interval is covered. If #56668 lands first, this branch should be adjusted to reuse the existing ping owner instead of merging duplicate protocol ping loops.
Tests
pnpm test src/gateway/server/connection-health.test.ts src/gateway/server-maintenance.test.ts src/gateway/server-close.test.ts src/gateway/server-startup-early.test.tspnpm test src/gateway/server.health.test.ts src/gateway/server/health-state.test.ts src/gateway/client.watchdog.test.ts src/commands/health.snapshot.test.tspnpm tsgo:test:srcpnpm tsgo:corepnpm exec oxfmt --check --threads=1 src/commands/health.ts src/commands/health.types.ts src/gateway/gateway-misc.test.ts src/gateway/server-close.test.ts src/gateway/server-close.ts src/gateway/server-maintenance.test.ts src/gateway/server-maintenance.ts src/gateway/server-methods/health.ts src/gateway/server-methods/shared-types.ts src/gateway/server-runtime-handles.ts src/gateway/server-startup-early.test.ts src/gateway/server-startup-early.ts src/gateway/server.canvas-auth.test.ts src/gateway/server.health.test.ts src/gateway/server.impl.ts src/gateway/server/connection-health.ts src/gateway/server/connection-health.test.ts src/gateway/server/ws-connection.ts src/gateway/server/ws-connection/message-handler.ts src/gateway/server/ws-types.ts docs/gateway/health.md docs/gateway/protocol.md docs/cli/health.mdpnpm check:changedNote: localhost-binding tests were run outside the sandbox because sandboxed execution rejects
listen 127.0.0.1withEPERM.