Skip to content

Add gateway health connection telemetry#74783

Open
trialanderrorstudios wants to merge 4 commits intoopenclaw:mainfrom
trialanderrorstudios:fix-health-connection-telemetry
Open

Add gateway health connection telemetry#74783
trialanderrorstudios wants to merge 4 commits intoopenclaw:mainfrom
trialanderrorstudios:fix-health-connection-telemetry

Conversation

@trialanderrorstudios
Copy link
Copy Markdown

@trialanderrorstudios trialanderrorstudios commented Apr 30, 2026

Summary

  • Add per-client connection telemetry to authenticated WebSocket health responses.
  • Track WebSocket protocol ping/pong RTT on each gateway client without mutating the cached global health snapshot.
  • Document health.connection and add tests for stale pongs, socket close, no-first-pong staleness, and multi-client isolation.

Fixes #70230.

Details

health.connection is overlaid per response and includes:

{
  connected: boolean;
  rttMs: number | null;
  lastHeartbeatAt: number | null;
}

The cached service-level HealthSummary stays shared/global. The connection block is derived from the current request client, so one client cannot see another client's RTT or heartbeat timestamp.

RTT uses WebSocket protocol ping/pong with a 5s ping cadence and a 20s stale threshold:

const CONNECTION_PING_INTERVAL_MS = 5_000;
const CONNECTION_STALE_MS = CONNECTION_PING_INTERVAL_MS * 4;

Before the first successful pong, connected can be true while rttMs and lastHeartbeatAt are null, but only during the initial stale window. If no pong is accepted within CONNECTION_STALE_MS from connectedAtMs, connected becomes false.

Coordination with #56668

Related #56668 adds a standalone 25s WebSocket protocol ping keepalive. This PR also sends protocol pings, but uses one gateway maintenance loop so the same ping/pong path powers both keepalive and health.connection telemetry.

If this PR lands first, it should be the single WebSocket ping owner and #56668's standalone per-socket interval is covered. If #56668 lands first, this branch should be adjusted to reuse the existing ping owner instead of merging duplicate protocol ping loops.

Tests

  • pnpm test src/gateway/server/connection-health.test.ts src/gateway/server-maintenance.test.ts src/gateway/server-close.test.ts src/gateway/server-startup-early.test.ts
  • pnpm test src/gateway/server.health.test.ts src/gateway/server/health-state.test.ts src/gateway/client.watchdog.test.ts src/commands/health.snapshot.test.ts
  • pnpm tsgo:test:src
  • pnpm tsgo:core
  • pnpm exec oxfmt --check --threads=1 src/commands/health.ts src/commands/health.types.ts src/gateway/gateway-misc.test.ts src/gateway/server-close.test.ts src/gateway/server-close.ts src/gateway/server-maintenance.test.ts src/gateway/server-maintenance.ts src/gateway/server-methods/health.ts src/gateway/server-methods/shared-types.ts src/gateway/server-runtime-handles.ts src/gateway/server-startup-early.test.ts src/gateway/server-startup-early.ts src/gateway/server.canvas-auth.test.ts src/gateway/server.health.test.ts src/gateway/server.impl.ts src/gateway/server/connection-health.ts src/gateway/server/connection-health.test.ts src/gateway/server/ws-connection.ts src/gateway/server/ws-connection/message-handler.ts src/gateway/server/ws-types.ts docs/gateway/health.md docs/gateway/protocol.md docs/cli/health.md
  • pnpm check:changed

Note: localhost-binding tests were run outside the sandbox because sandboxed execution rejects listen 127.0.0.1 with EPERM.

@trialanderrorstudios trialanderrorstudios requested a review from a team as a code owner April 30, 2026 03:19
@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation gateway Gateway runtime commands Command implementations size: M labels Apr 30, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 30, 2026

Codex review: needs maintainer review before merge.

Summary
The branch adds an optional per-client Gateway health.connection block, samples RTT from authenticated WebSocket ping/pong, and updates health docs, changelog, and gateway tests.

Reproducibility: yes. Source inspection on current main verifies the feature gap: health types, handler responses, docs, and the existing ping loop do not expose per-client connected, rttMs, or lastHeartbeatAt.

Next step before merge
No repair job is needed because this active source PR has no discrete code defect from review; the remaining action is maintainer review and merge decision for the new public health field.

Security
Cleared: No concrete security or supply-chain concern found; the diff changes authenticated Gateway telemetry, docs, tests, and changelog text without dependency, workflow, secret, install, or permission changes.

Review details

Best possible solution:

Land the telemetry after maintainer review, preserving one authenticated WebSocket ping owner and keeping client-specific connection data as a per-response overlay outside the shared health cache.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection on current main verifies the feature gap: health types, handler responses, docs, and the existing ping loop do not expose per-client connected, rttMs, or lastHeartbeatAt.

Is this the best way to solve the issue?

Yes. Reusing the authenticated 25s WebSocket ping path from #76111 and overlaying client-specific state per health response is the narrowest maintainable implementation I found.

Acceptance criteria:

  • pnpm test src/gateway/server/ws-connection.test.ts src/gateway/server/connection-health.test.ts src/gateway/server-methods/server-methods.test.ts src/gateway/server.health.test.ts src/commands/health.snapshot.test.ts
  • pnpm tsgo:test:src
  • pnpm tsgo:core
  • pnpm check:changed

What I checked:

Likely related people:

  • BunsDev: Merged fix(control-ui): preserve Stop after reconnect #76111 added the current authenticated Gateway WebSocket protocol ping timer that this PR reuses for connection telemetry. (role: recent ping-path owner; confidence: high; commits: 4532e5d85803; files: src/gateway/server/ws-connection.ts, src/gateway/server/ws-connection.test.ts)
  • steipete: Recent current-main history shows gateway WebSocket lazy-loading work and health cache maintenance on the affected paths. (role: recent gateway health and WebSocket maintainer; confidence: medium; commits: 72072d8b2eba, f5fde074bd67; files: src/gateway/server/ws-connection.ts, src/gateway/server-methods/health.ts)
  • vincentkoc: Recent merged work hardened and preserved runtime-backed health snapshot behavior around the same handler surface. (role: adjacent health snapshot maintainer; confidence: medium; commits: be6263da4f51; files: src/gateway/server-methods/health.ts, src/gateway/server/ws-connection.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 1d34564de9ba.

@trialanderrorstudios
Copy link
Copy Markdown
Author

Acknowledging the updated ClawSweeper pass on the same head SHA. The prior "needs maintainer review" state is now stale because the latest bot edit identifies a concrete blocker around delayed pong matching.

I will update this PR so valid delayed pong samples within the stale window still refresh connection health, add the requested regression coverage, and then re-run the focused validation before re-requesting review.

@trialanderrorstudios
Copy link
Copy Markdown
Author

Updated in 09a3f4578f02affbec7eb8c2ec2134ffc01b2896.

What changed:

  • Connection health now tracks a bounded set of pending ping timestamps within the stale window.
  • recordConnectionPong accepts valid delayed pong samples that echo any pending timestamp, not only the latest ping.
  • Accepted pong samples update RTT/lastHeartbeatAtMs and remove the acknowledged timestamp from the pending set.
  • Added regressions for delayed pongs within and outside the stale window.

Validation:

  • pnpm test src/gateway/server/connection-health.test.ts src/gateway/server-maintenance.test.ts
  • pnpm exec oxfmt --check --threads=1 src/gateway/server/connection-health.ts src/gateway/server/connection-health.test.ts src/gateway/server-maintenance.ts
  • pnpm tsgo:test:src
  • pnpm check:changed

@trialanderrorstudios
Copy link
Copy Markdown
Author

Follow-up pushed in 6a01bb51955125a9ea089b326af3d8b9883300b5.

The previous push exposed two stale gateway-method test expectations in CI: those tests now correctly expect the per-client connection overlay on refreshed health responses.

Additional local validation:

  • pnpm test src/gateway/server-methods/server-methods.test.ts src/gateway/server/connection-health.test.ts src/gateway/server-maintenance.test.ts
  • pnpm exec oxfmt --check --threads=1 src/gateway/server-methods/server-methods.test.ts src/gateway/server/connection-health.ts src/gateway/server/connection-health.test.ts src/gateway/server-maintenance.ts
  • pnpm tsgo:test:src
  • pnpm check:changed

@trialanderrorstudios trialanderrorstudios force-pushed the fix-health-connection-telemetry branch from 6a01bb5 to e1520cf Compare May 3, 2026 12:39
@trialanderrorstudios
Copy link
Copy Markdown
Author

Updated in e1520cfe79038e8e43ad6ba59001a2118d172a08.

This rebases the PR onto current main and addresses the ClawSweeper blocker by making connection-health RTT sampling use the authenticated WebSocket ping owner added by merged #76111 (fix(control-ui): preserve Stop after reconnect, merge commit 4532e5d85803) instead of the separate maintenance-loop pinger.

What changed:

  • Removed the server-maintenance connection ping interval and the related startup/shutdown runtime handle plumbing.
  • Wired ws-connection to call pingGatewayClient from the authenticated socket ping path introduced by fix(control-ui): preserve Stop after reconnect #76111.
  • Kept the landed 25s keepalive cadence, with one immediate authenticated ping after client registration so long-lived clients can get their first RTT sample without waiting for a second loop.
  • Preserved the per-client health.connection overlay without mutating the shared cached health snapshot.
  • Updated docs, changelog wording, and the socket/maintenance/startup/shutdown tests for the single-owner path.

Validation:

  • pnpm test src/gateway/server/ws-connection.test.ts src/gateway/server/connection-health.test.ts src/gateway/server-maintenance.test.ts src/gateway/server-methods/server-methods.test.ts src/gateway/server.health.test.ts src/commands/health.snapshot.test.ts src/gateway/server-close.test.ts src/gateway/server-startup-early.test.ts
  • pnpm tsgo:test:src
  • pnpm tsgo:core
  • pnpm exec oxfmt --check --threads=1 CHANGELOG.md docs/cli/health.md docs/gateway/health.md docs/gateway/protocol.md src/gateway/server-close.test.ts src/gateway/server-close.ts src/gateway/server-maintenance.test.ts src/gateway/server-maintenance.ts src/gateway/server-runtime-handles.ts src/gateway/server-startup-early.test.ts src/gateway/server-startup-early.ts src/gateway/server.impl.ts src/gateway/server/connection-health.ts src/gateway/server/ws-connection.test.ts src/gateway/server/ws-connection.ts src/gateway/server-methods/health.ts
  • pnpm check:changed

@trialanderrorstudios trialanderrorstudios force-pushed the fix-health-connection-telemetry branch from e1520cf to 1cd9292 Compare May 3, 2026 17:59
@trialanderrorstudios
Copy link
Copy Markdown
Author

Rebased this PR onto current main at 68918ab and pushed head 1cd9292. Resolved the conflict with main lazy websocket message-handler loading by waiting for vi.dynamicImportSettled() in ws-connection tests while preserving the authenticated websocket ping-owner assertions. Kept current main deferred startup/maintenance lifecycle and retained this PR connection-health overlay plus websocket ping ownership. The prior GitHub failure in checks-node-agentic-control-plane-runtime (server.reload.test.ts cron stop expectation) passes locally on the rebased branch. Local validation: focused gateway health tests passed; full test/vitest/vitest.gateway-server.config.ts passed with 99 files and 870 tests; pnpm tsgo:test:src passed; oxfmt check passed; git diff --check origin/main...HEAD passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

commands Command implementations docs Improvements or additions to documentation gateway Gateway runtime size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Connection-level fields on health (rttMs, lastHeartbeatAt, connected)

1 participant