Skip to content

feat(api): SSE streaming endpoint and /snapshot JSON#203

Merged
inureyes merged 3 commits into
mainfrom
feat/193-sse-events
Apr 20, 2026
Merged

feat(api): SSE streaming endpoint and /snapshot JSON#203
inureyes merged 3 commits into
mainfrom
feat/193-sse-events

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

Adds two HTTP endpoints to all-smi api mode that share the
schema: 1 Snapshot shape used by the snapshot CLI and the record
NDJSON format — one transport, one serializer, three delivery channels
(#193).

  • GET /snapshot — one-shot JSON payload.
  • GET /events — Server-Sent Events stream, one JSON frame per
    collection cycle.

Both work over the existing TCP and Unix Domain Socket transports.

Implementation

  • api/frame_bus.rsFrameBus wrapping tokio::sync::broadcast::Sender<Arc<Snapshot>> (16-frame buffer) plus a RwLock over the latest published frame. publish() never blocks on receivers, so slow SSE clients cannot stall the collection loop.
  • api/collection_loop.rs — extracted the background reader loop from server.rs, added single-shot publish onto FrameBus after each cycle.
  • api/server_state.rs — composite ApiState with FromRef impls so /metrics keeps its SharedState extractor while /events and /snapshot extract FrameBus directly.
  • api/handlers/events.rs — SSE handler. Applies ?include= filter, ?throttle=N (clamped to ≥ collection interval), and ?heartbeat=N (default 30 s). Emits event: snapshot, falls back to event: lag\ndata: {"dropped": N} when the receiver falls behind the broadcast buffer. Last-Event-ID is accepted but never replays history (all-smi has no persisted history). Responds with X-Accel-Buffering: no and Cache-Control: no-store.
  • api/handlers/snapshot.rs — one-shot JSON. Reads the last broadcast frame; falls back to a fresh DefaultSnapshotCollector run when the cached frame is older than 2 × interval or no cycle has published yet. Filters the Snapshot::serde_json output by ?include= and supports ?pretty=1.
  • api/handlers/metrics_render.rs — unchanged Prometheus handler moved into the handlers/ directory.
  • api/server.rs — wired the new routes onto the existing Router; no change to the TCP/UDS listener logic.

Docs:

  • README: new "Streaming (SSE)" subsection under the API block.
  • API.md: new "JSON Endpoints (Streaming + One-Shot)" section with the /snapshot / /events surface.
  • examples/sse_client.html: minimal browser EventSource demo.

Testing

cargo test --test sse_events_test (9 integration tests, all pass):

  • events_emits_at_least_three_frames_within_five_seconds — spec's ≥3 frames in 5 s requirement.
  • events_include_filter_drops_other_sections?include=gpu yields only gpus.
  • events_throttle_reduces_emission_rate?throttle=1 caps output rate over a 2 s window.
  • events_lag_event_emitted_for_slow_receiver — overruns the broadcast buffer with 24 frames before the client polls, asserts the event: lag frame appears.
  • fifty_concurrent_clients_do_not_stall_the_publisher — 50 subscribers, measured publisher tick jitter stays within interval + 40 ms.
  • snapshot_returns_latest_frame, snapshot_include_filter_drops_sections, snapshot_pretty_flag_produces_multiline_body, snapshot_falls_back_to_fresh_collect_when_stale.

Plus new unit tests in api/frame_bus.rs (publish/subscribe/drop), api/handlers/events.rs (throttle/heartbeat clamp), api/handlers/snapshot.rs (include parser).

Test commands run locally:

  • cargo test --lib --features cli
  • cargo test --bin all-smi --features cli
  • cargo test --test sse_events_test --features cli
  • cargo clippy --all-targets --features cli -- -D warnings
  • cargo fmt --all -- --check

Closes #193

Extends `all-smi api` mode with two new HTTP endpoints that share the
`schema: 1` Snapshot shape used by the `snapshot` CLI and the `record`
NDJSON format — one transport, one serializer, three delivery channels
(#193).

* `GET /snapshot` — one-shot JSON. Serves the last published frame;
  falls back to a fresh collection when stale > 2x interval.
  `?include=gpu,cpu,memory,chassis,process,storage` and `?pretty=1`
  supported. Content-Type `application/json`; emits `Cache-Control:
  no-store` and `X-Accel-Buffering: no`.
* `GET /events` — Server-Sent Events. Emits `event: snapshot` per
  collection cycle with the same JSON body. Supports `?include=`,
  `?throttle=N` (clamped to >= collection interval) and `?heartbeat=N`
  (default 30 s). Lagging receivers get `event: lag\ndata: {"dropped":N}`
  and resume with the next live frame. `Last-Event-ID` is accepted but
  never replays history.

Broadcast architecture: a single `tokio::sync::broadcast::Sender<Arc<
Snapshot>>` is fanned out to every SSE client. `FrameBus::publish` is
non-blocking wrt receivers — slow clients cannot stall the publisher;
the small buffer (16 frames) caps memory growth and surfaces gaps as
`lag` events. `bus.latest()` gives `/snapshot` lock-free access to the
most recent frame.

Wire-up: new `api/frame_bus.rs`, `api/server_state.rs` (composite
`ApiState` using `FromRef`), `api/collection_loop.rs` (extracted from
`server.rs`), `api/handlers/{events,snapshot,metrics_render}.rs`. Both
endpoints also work over the existing Unix domain socket transport.

Docs: README gains a "Streaming (SSE)" subsection; API.md documents
both endpoints. `examples/sse_client.html` ships a minimal browser
`EventSource` demo.

Tests: 9 integration tests covering happy-path streaming (≥3 frames in
5 s), include filter, throttle, lag event on slow receivers, /snapshot
pretty/include/stale-fallback branches, and 50 concurrent clients
holding their broadcast slots without stalling the publisher (tick
jitter <= interval + 40 ms).

Closes #193
@inureyes inureyes added type:enhancement New feature or request priority:medium Medium priority issue status:review Under review labels Apr 20, 2026
Security review findings addressed on the /events and /snapshot
endpoints (issue #193):

CORS (CRITICAL):
- Replace wildcard Allow-Origin/Methods/Headers with a deny-by-default
  posture. Set ALL_SMI_API_CORS_ALLOWED_ORIGINS to a comma-separated
  allowlist for opt-in cross-origin access; `*` restores the legacy
  wildcard with a loud warning. Methods are now restricted to GET and
  OPTIONS, headers to the minimum needed for text/event-stream.

Process label truncation (CRITICAL/privacy):
- The Prometheus exporter already caps command/process_name/user label
  values at 256/128/128 bytes to mitigate scrape-response amplification
  and argv-embedded-secret exposure. The JSON /snapshot and SSE
  /events paths bypassed this cap. `filter_snapshot_value` now applies
  the same truncation via a shared `ProcessMetricExporter::
  truncate_for_label` helper so every wire-format surface inherits the
  guarantee.

/snapshot amplification DoS (HIGH):
- When the cached frame is stale or absent, every /snapshot request
  used to spawn its own DefaultSnapshotCollector. A burst of requests
  against a freshly-started server or a stalled collector could
  saturate the Tokio blocking pool. Added a fresh-collect mutex on
  FrameBus so concurrent callers serialize and share the winning
  collector's output.

SSE subscriber cap (HIGH):
- Unbounded /events subscriptions could exhaust file descriptors and
  broadcast-channel slots. Cap at 256 concurrent subscribers by
  default (ALL_SMI_API_MAX_SSE_SUBSCRIBERS env var); over-cap clients
  get 503 Service Unavailable with Retry-After: 5.

Misc hardening:
- `resolve_throttle` clamp(lo, hi) panicked when interval exceeded
  MAX_INTERVAL_SECS; saturate the floor so the handler never panics.
- Truncate Last-Event-ID before logging so a 1 MiB header value cannot
  inflate log lines.
- Fix XSS in examples/sse_client.html: render GPU fields via
  textContent rather than innerHTML so crafted GPU names cannot
  inject HTML into the demo page.

Docs:
- API.md documents the new ALL_SMI_API_CORS_ALLOWED_ORIGINS and
  ALL_SMI_API_MAX_SSE_SUBSCRIBERS env vars and the process-label cap
  guarantee.

Tests:
- 7 new regression tests covering process field truncation in the
  JSON path, clamp-panic guard, and single-flight lock. Full suite:
  cargo test --lib --features cli (928 passed), cargo test --test
  sse_events_test --features cli (11 passed), cargo clippy
  --all-targets --features cli (no warnings), cargo fmt --check.
@inureyes

Copy link
Copy Markdown
Member Author

Security + performance review (PR #203)

Reviewed the SSE streaming + /snapshot endpoints against the issue-specified attack surface. Findings addressed in 43e4a50:

CRITICAL

Wildcard CORS leaked telemetry cross-origin.
src/api/server.rs configured CorsLayer::new().allow_origin(Any).allow_methods(Any).allow_headers(Any) on all routes. Any browser-facing page could fetch /metrics, /snapshot, and /events from any origin and read GPU utilization, process command lines (when --processes is on), usernames, hostnames, and power data. That's especially dangerous for the SSE stream because a single cross-origin EventSource can tail telemetry indefinitely.

  • Fix: deny-by-default CORS. ALL_SMI_API_CORS_ALLOWED_ORIGINS opts in to a comma-separated allowlist; * restores the old wildcard with a loud warning. Methods are now GET+OPTIONS, headers are the minimum for text/event-stream.

Process label caps (from #189 hardening) did not propagate to the JSON/SSE surface.
The Prometheus exporter in src/api/metrics/process.rs already caps command/process_name/user at 256/128/128 bytes to mitigate scrape-response amplification and argv-embedded-secret exposure. The JSON path in /snapshot and /events serialized these fields raw, so the same hardening did not apply. A long argv containing ?password=... / API tokens would be broadcast to every SSE subscriber verbatim.

  • Fix: share the truncation helper. filter_snapshot_value now truncates process fields through ProcessMetricExporter::truncate_for_label so every wire-format path inherits the cap and the ...(N bytes truncated) marker.

HIGH

/snapshot amplification DoS.
When the last broadcast frame is older than 2 * interval (server just started, collector stalled, etc.), each incoming /snapshot request spawned its own DefaultSnapshotCollector and blocking reader set. A burst of requests could saturate Tokio's blocking pool (default 512) and stall /metrics, /events, and other handlers.

  • Fix: single-flight mutex on FrameBus::lock_fresh_collect(). Concurrent callers queue, re-check latest() after acquiring, and share the winning collector's output. The critical section is bounded by FRESH_COLLECT_TIMEOUT (5 s).

Unbounded SSE subscribers.
/events had no cap on concurrent subscribers, so 10 000 leaked EventSource objects from a misbehaving or hostile page would hold 10 000 broadcast-channel slots, FDs, and per-task memory.

  • Fix: ALL_SMI_API_MAX_SSE_SUBSCRIBERS (default 256) caps concurrent /events subscriptions; over-cap clients get 503 Service Unavailable with Retry-After: 5. Set to 0 to disable.

MEDIUM

resolve_throttle clamp-panic.
u64::clamp(lo, hi) panics when lo > hi. If an operator sets --interval above MAX_INTERVAL_SECS (24 h), every /events request panics the handler task. Not a full crash (axum isolates task panics) but returns 500 and spams the log.

  • Fix: saturate the floor at MAX_INTERVAL_SECS before the clamp call.

Last-Event-ID log amplification.
The handler logged the full Last-Event-ID header to the debug trace. HTTP header size is bounded by hyper but attackers can still inflate log lines with a header near the default limit.

  • Fix: truncate at 256 bytes before logging.

XSS in examples/sse_client.html.
tr.innerHTML = \${g.name ?? "—"}...`interpolated GPU fields into HTML. A mock/hostile server returning a GPUnamecontaining<script>` could execute JS when operators opened the demo client.

  • Fix: build every cell via document.createElement("td") + textContent.

Verified safe

  • Query-param parsing (?throttle=-1, ?throttle=NaN, etc.) — serde_urlencoded rejects with 400 at Query<T> extraction; no panic path.
  • Last-Event-ID parsing — handler only calls v.to_str() and logs; no replay path, no state-shared data.
  • Broadcast lag timing side channel — attacker can only observe their own receiver lagging, which requires them to hold a connection and not consume; no novel info leak beyond what normal timestamps reveal.
  • X-Accel-Buffering: no — set on every /events stream response and every /snapshot response (including errors).
  • UDS transport — socket file permissions are 0o600, restrictive enough for a local-IPC surface.
  • ?include=admin-style filter bypassparse_include matches against an explicit allowlist (gpu,cpu,memory,chassis,process,storage and aliases); unknown names are silently dropped. No path-traversal or command-injection surface.
  • Float sanitizationfilter_snapshot_value calls sanitize_json_floats before serialization, so NaN/Inf from flaky drivers cannot fail the response.

Tests + verification

  • cargo test --lib --features cli — 928 passed (+7 new regression tests)
  • cargo test --test sse_events_test --features cli — 11 passed
  • cargo test --bin all-smi --features cli — 1059 passed
  • cargo clippy --all-targets --features cli -- -D warnings — clean
  • cargo fmt --all -- --check — clean

All changes stay within this PR's branch; no new endpoints or breaking JSON schema changes.

Add a "Security notes for SSE/snapshot endpoints" subsection under the
existing "Streaming (SSE)" heading. Covers CORS opt-in
(ALL_SMI_API_CORS_ALLOWED_ORIGINS), SSE subscriber cap
(ALL_SMI_API_MAX_SSE_SUBSCRIBERS), process label truncation, and the
single-flight stale fallback in /snapshot.
@inureyes

Copy link
Copy Markdown
Member Author

PR Finalization

Verification results

tests: 928 lib + 1059 bin + 11 integration — all pass, unchanged.

fmt/clippy: clean.

examples/sse_client.html: DOM is safe — all cell content written via textContent through the addCell helper (lines 186-194); gpuBody.innerHTML = "" only clears the table body (no user-controlled data injected).

API.md: both /snapshot (lines 59-90) and /events (lines 91-171) are fully documented, including query params, keep-alive, Last-Event-ID semantics, X-Accel-Buffering: no reverse-proxy guidance, and the security knobs table.

README "Streaming (SSE)" subsection: already covered /events + /snapshot with query params, keep-alive, Last-Event-ID, and X-Accel-Buffering: no.

README security notes: were missing. Added a "Security notes for SSE/snapshot endpoints" subsection (####) immediately after the demo-client paragraph, covering:

  • ALL_SMI_API_CORS_ALLOWED_ORIGINS CORS opt-in
  • ALL_SMI_API_MAX_SSE_SUBSCRIBERS subscriber cap + 503 behaviour
  • Process label truncation caps (256/128/128 bytes)
  • Single-flight stale fallback in /snapshot

Commit

646e1fa — docs(api): add SSE/snapshot security notes to README Streaming section

Ready for merge.

@inureyes inureyes added status:done Completed and removed status:review Under review labels Apr 20, 2026
@inureyes inureyes merged commit 845d24a into main Apr 20, 2026
4 checks passed
@inureyes inureyes deleted the feat/193-sse-events branch April 20, 2026 20:42
@inureyes inureyes self-assigned this May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:medium Medium priority issue status:done Completed type:enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(api): Server-Sent Events (SSE) streaming endpoint '/events' and '/snapshot' JSON

1 participant