feat: add frontend metrics to disconnect#2929
Conversation
|
👋 Hi michaelfeil! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
WalkthroughAdds optional Metrics to disconnect monitoring: updates create_connection_monitor to accept Option<Arc>, propagates it internally, and increments a client-disconnect gauge on disconnect events. Wires metrics through handlers in openai.rs. Introduces a new client_disconnect_gauge in Metrics with increment and getter APIs, and registers it. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant C as Client
participant H as HTTP Handler
participant M as Connection Monitor
participant X as Metrics
C->>H: Initiate request (completions/chat/responses)
H->>M: create_connection_monitor(ctx, Some(metrics))
Note over H,M: Monitor spawned with optional metrics
rect rgba(230,250,230,0.6)
M-->>H: Provide stream_handle, connection_handle
H->>C: Stream response
end
alt Disconnect detected
M->>X: inc_client_disconnect()
Note right of X: client_disconnect_gauge += 1
else Graceful/disabled paths
M-->>M: No metrics update
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/llm/src/http/service/disconnect.rs (1)
100-116: Add missingmetricsargument to allcreate_connection_monitorcalls
- In lib/llm/src/grpc/service/openai.rs (line 55), the call
create_connection_monitor(context.clone()).await
omits the newmetricsparameter. Update to pass an appropriateOption<Arc<Metrics>>(e.g.,Some(state.metrics_clone())).
🧹 Nitpick comments (5)
lib/llm/src/http/service/openai.rs (2)
426-431: Chat completions: metrics plumbed correctly; also hit by rustfmt check.Same note: formatting check failed around this hunk;
cargo fmtshould fix.
662-663: Responses: intentional unused stream handle; format to pass CI.Using
_stream_handleis fine for unary. The rustfmt check flagged this area; please runcargo fmt.lib/llm/src/http/service/metrics.rs (2)
21-21: Prefer a Counter (and “_total” suffix) over a Gauge for disconnect events; consider labels.Client disconnects are monotonically increasing events; Prometheus best practice is an IntCounter (optionally with labels like endpoint/request_type) named
{prefix}_client_disconnects_total. Current IntGauge works but is semantically off.Proposed change (minimal, no labels):
- client_disconnect_gauge: prometheus::IntGauge, + client_disconnect_counter: prometheus::IntCounter,- let client_disconnect_gauge = prometheus::IntGauge::new( - frontend_metric_name("client_disconnects"), - "Number of connections dropped by clients", - ) - .unwrap(); + let client_disconnect_counter = prometheus::IntCounter::new( + frontend_metric_name("client_disconnects_total"), + "Total number of client-initiated disconnects", + ) + .unwrap();- client_disconnect_gauge, + client_disconnect_counter,- /// Increment the gauge for client disconnections - pub fn inc_client_disconnect(&self) { - self.client_disconnect_gauge.inc(); - } + /// Increment the counter for client disconnections + pub fn inc_client_disconnect(&self) { + self.client_disconnect_counter.inc(); + }- /// Get the count of client disconnections - pub fn get_client_disconnect_count(&self) -> i64 { - self.client_disconnect_gauge.get() - } + /// Get the total count of client disconnections + pub fn get_client_disconnect_count(&self) -> u64 { + self.client_disconnect_counter.get() + }- registry.register(Box::new(self.client_disconnect_gauge.clone()))?; + registry.register(Box::new(self.client_disconnect_counter.clone()))?;Optional follow-up: switch to
IntCounterVecwith labelsendpoint,request_typefor better slice-and-dice.Also applies to: 137-142, 206-209, 274-283, 287-287
93-105: Docstring: add the new metric to the list for completeness.Include the disconnect metric in the documented set so operators discover it.
/// The following metrics will be created with the configured prefix: /// - `{prefix}_requests_total` - IntCounterVec for the total number of requests processed /// - `{prefix}_inflight_requests` - IntGaugeVec for the number of inflight requests +/// - `{prefix}_client_disconnects_total` - IntCounter for client-initiated disconnects /// - `{prefix}_request_duration_seconds` - HistogramVec for the duration of requestslib/llm/src/http/service/disconnect.rs (1)
136-139: Counting policy: increments on both connection and stream phases may double-count per request.If you want “one disconnect per request,” dedupe increments (e.g., track a single AtomicBool and only inc once). Current behavior counts each unexpectedly dropped handle.
I can sketch a minimal dedupe using an
Arc<AtomicBool>shared across both awaits if you want it.Also applies to: 150-153
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
lib/llm/src/http/service/disconnect.rs(5 hunks)lib/llm/src/http/service/metrics.rs(4 hunks)lib/llm/src/http/service/openai.rs(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
lib/llm/src/http/service/metrics.rs (1)
lib/llm/src/http/service/service_v2.rs (1)
new(74-86)
lib/llm/src/http/service/openai.rs (2)
lib/llm/src/http/service/disconnect.rs (1)
create_connection_monitor(100-123)lib/llm/src/http/client.rs (2)
context(316-318)context(352-354)
🪛 GitHub Actions: Rust pre-merge checks
lib/llm/src/http/service/openai.rs
[error] 224-230: Rustfmt: formatting check failed (cargo fmt -- --check). Differences detected vs rustfmt output. Run 'cargo fmt' to format the code.
[error] 423-430: Rustfmt: formatting check failed (cargo fmt -- --check). Differences detected vs rustfmt output. Run 'cargo fmt' to format the code.
[error] 659-663: Rustfmt: formatting check failed (cargo fmt -- --check). Differences detected vs rustfmt output. Run 'cargo fmt' to format the code.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
lib/llm/src/http/service/metrics.rs (1)
284-295: Remove outdated warning: metrics are registered on the correct Registry
The code callsstate.metrics_clone().register(®istry)on the sameregistryinstance that’s passed intometrics::router(registry, …), so all metrics will be scrape‐visible.Likely an incorrect or invalid review comment.
|
____ ERROR collecting benchmarks/prefix_data_generator/tests/test_hasher.py ____ |
|
I squashed all commits in 1 to fix ci, #2929 and sign off commit. |
Overview:
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Refactor