feat(api): SSE streaming endpoint and /snapshot JSON#203
Conversation
Extends `all-smi api` mode with two new HTTP endpoints that share the `schema: 1` Snapshot shape used by the `snapshot` CLI and the `record` NDJSON format — one transport, one serializer, three delivery channels (#193). * `GET /snapshot` — one-shot JSON. Serves the last published frame; falls back to a fresh collection when stale > 2x interval. `?include=gpu,cpu,memory,chassis,process,storage` and `?pretty=1` supported. Content-Type `application/json`; emits `Cache-Control: no-store` and `X-Accel-Buffering: no`. * `GET /events` — Server-Sent Events. Emits `event: snapshot` per collection cycle with the same JSON body. Supports `?include=`, `?throttle=N` (clamped to >= collection interval) and `?heartbeat=N` (default 30 s). Lagging receivers get `event: lag\ndata: {"dropped":N}` and resume with the next live frame. `Last-Event-ID` is accepted but never replays history. Broadcast architecture: a single `tokio::sync::broadcast::Sender<Arc< Snapshot>>` is fanned out to every SSE client. `FrameBus::publish` is non-blocking wrt receivers — slow clients cannot stall the publisher; the small buffer (16 frames) caps memory growth and surfaces gaps as `lag` events. `bus.latest()` gives `/snapshot` lock-free access to the most recent frame. Wire-up: new `api/frame_bus.rs`, `api/server_state.rs` (composite `ApiState` using `FromRef`), `api/collection_loop.rs` (extracted from `server.rs`), `api/handlers/{events,snapshot,metrics_render}.rs`. Both endpoints also work over the existing Unix domain socket transport. Docs: README gains a "Streaming (SSE)" subsection; API.md documents both endpoints. `examples/sse_client.html` ships a minimal browser `EventSource` demo. Tests: 9 integration tests covering happy-path streaming (≥3 frames in 5 s), include filter, throttle, lag event on slow receivers, /snapshot pretty/include/stale-fallback branches, and 50 concurrent clients holding their broadcast slots without stalling the publisher (tick jitter <= interval + 40 ms). Closes #193
Security review findings addressed on the /events and /snapshot endpoints (issue #193): CORS (CRITICAL): - Replace wildcard Allow-Origin/Methods/Headers with a deny-by-default posture. Set ALL_SMI_API_CORS_ALLOWED_ORIGINS to a comma-separated allowlist for opt-in cross-origin access; `*` restores the legacy wildcard with a loud warning. Methods are now restricted to GET and OPTIONS, headers to the minimum needed for text/event-stream. Process label truncation (CRITICAL/privacy): - The Prometheus exporter already caps command/process_name/user label values at 256/128/128 bytes to mitigate scrape-response amplification and argv-embedded-secret exposure. The JSON /snapshot and SSE /events paths bypassed this cap. `filter_snapshot_value` now applies the same truncation via a shared `ProcessMetricExporter:: truncate_for_label` helper so every wire-format surface inherits the guarantee. /snapshot amplification DoS (HIGH): - When the cached frame is stale or absent, every /snapshot request used to spawn its own DefaultSnapshotCollector. A burst of requests against a freshly-started server or a stalled collector could saturate the Tokio blocking pool. Added a fresh-collect mutex on FrameBus so concurrent callers serialize and share the winning collector's output. SSE subscriber cap (HIGH): - Unbounded /events subscriptions could exhaust file descriptors and broadcast-channel slots. Cap at 256 concurrent subscribers by default (ALL_SMI_API_MAX_SSE_SUBSCRIBERS env var); over-cap clients get 503 Service Unavailable with Retry-After: 5. Misc hardening: - `resolve_throttle` clamp(lo, hi) panicked when interval exceeded MAX_INTERVAL_SECS; saturate the floor so the handler never panics. - Truncate Last-Event-ID before logging so a 1 MiB header value cannot inflate log lines. - Fix XSS in examples/sse_client.html: render GPU fields via textContent rather than innerHTML so crafted GPU names cannot inject HTML into the demo page. Docs: - API.md documents the new ALL_SMI_API_CORS_ALLOWED_ORIGINS and ALL_SMI_API_MAX_SSE_SUBSCRIBERS env vars and the process-label cap guarantee. Tests: - 7 new regression tests covering process field truncation in the JSON path, clamp-panic guard, and single-flight lock. Full suite: cargo test --lib --features cli (928 passed), cargo test --test sse_events_test --features cli (11 passed), cargo clippy --all-targets --features cli (no warnings), cargo fmt --check.
Security + performance review (PR #203)Reviewed the SSE streaming + CRITICALWildcard CORS leaked telemetry cross-origin.
Process label caps (from #189 hardening) did not propagate to the JSON/SSE surface.
HIGH
Unbounded SSE subscribers.
MEDIUM
XSS in
Verified safe
Tests + verification
All changes stay within this PR's branch; no new endpoints or breaking JSON schema changes. |
Add a "Security notes for SSE/snapshot endpoints" subsection under the existing "Streaming (SSE)" heading. Covers CORS opt-in (ALL_SMI_API_CORS_ALLOWED_ORIGINS), SSE subscriber cap (ALL_SMI_API_MAX_SSE_SUBSCRIBERS), process label truncation, and the single-flight stale fallback in /snapshot.
PR FinalizationVerification resultstests: 928 lib + 1059 bin + 11 integration — all pass, unchanged. fmt/clippy: clean.
API.md: both README "Streaming (SSE)" subsection: already covered README security notes: were missing. Added a "Security notes for SSE/snapshot endpoints" subsection (
Commit
Ready for merge. |
Summary
Adds two HTTP endpoints to
all-smi apimode that share theschema: 1Snapshot shape used by thesnapshotCLI and therecordNDJSON format — one transport, one serializer, three delivery channels
(#193).
GET /snapshot— one-shot JSON payload.GET /events— Server-Sent Events stream, one JSON frame percollection cycle.
Both work over the existing TCP and Unix Domain Socket transports.
Implementation
api/frame_bus.rs—FrameBuswrappingtokio::sync::broadcast::Sender<Arc<Snapshot>>(16-frame buffer) plus aRwLockover the latest published frame.publish()never blocks on receivers, so slow SSE clients cannot stall the collection loop.api/collection_loop.rs— extracted the background reader loop fromserver.rs, added single-shot publish ontoFrameBusafter each cycle.api/server_state.rs— compositeApiStatewithFromRefimpls so/metricskeeps itsSharedStateextractor while/eventsand/snapshotextractFrameBusdirectly.api/handlers/events.rs— SSE handler. Applies?include=filter,?throttle=N(clamped to ≥ collection interval), and?heartbeat=N(default 30 s). Emitsevent: snapshot, falls back toevent: lag\ndata: {"dropped": N}when the receiver falls behind the broadcast buffer.Last-Event-IDis accepted but never replays history (all-smi has no persisted history). Responds withX-Accel-Buffering: noandCache-Control: no-store.api/handlers/snapshot.rs— one-shot JSON. Reads the last broadcast frame; falls back to a freshDefaultSnapshotCollectorrun when the cached frame is older than2 × intervalor no cycle has published yet. Filters theSnapshot::serde_jsonoutput by?include=and supports?pretty=1.api/handlers/metrics_render.rs— unchanged Prometheus handler moved into thehandlers/directory.api/server.rs— wired the new routes onto the existing Router; no change to the TCP/UDS listener logic.Docs:
/snapshot//eventssurface.examples/sse_client.html: minimal browserEventSourcedemo.Testing
cargo test --test sse_events_test(9 integration tests, all pass):events_emits_at_least_three_frames_within_five_seconds— spec's ≥3 frames in 5 s requirement.events_include_filter_drops_other_sections—?include=gpuyields onlygpus.events_throttle_reduces_emission_rate—?throttle=1caps output rate over a 2 s window.events_lag_event_emitted_for_slow_receiver— overruns the broadcast buffer with 24 frames before the client polls, asserts theevent: lagframe appears.fifty_concurrent_clients_do_not_stall_the_publisher— 50 subscribers, measured publisher tick jitter stays withininterval + 40 ms.snapshot_returns_latest_frame,snapshot_include_filter_drops_sections,snapshot_pretty_flag_produces_multiline_body,snapshot_falls_back_to_fresh_collect_when_stale.Plus new unit tests in
api/frame_bus.rs(publish/subscribe/drop),api/handlers/events.rs(throttle/heartbeat clamp),api/handlers/snapshot.rs(include parser).Test commands run locally:
cargo test --lib --features clicargo test --bin all-smi --features clicargo test --test sse_events_test --features clicargo clippy --all-targets --features cli -- -D warningscargo fmt --all -- --checkCloses #193