feat: frontend, disconnect metrics (#1)#2953
Conversation
|
👋 Hi michaelfeil! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
* add metrics to disconnect * fmt * fmt Signed-off-by: michaelfeil <me@michaelfeil.eu>
5f0a7c4 to
0ac83d5
Compare
WalkthroughAdds optional metrics propagation to connection monitoring: OpenAI HTTP and gRPC handlers now pass a metrics handle into create_connection_monitor; create_connection_monitor and its internal monitor accept an optional Metrics and increment a new client_disconnect gauge on client/stream disconnects. Metrics struct adds the gauge and accessors. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant C as Client
participant H as OpenAI Handler (HTTP/gRPC)
participant D as disconnect::create_connection_monitor
participant M as Monitor Task
participant E as Engine Context
participant X as Metrics (optional)
C->>H: Start request / stream
H->>D: create_connection_monitor(context, Some(metrics))
D->>M: spawn monitor(context, metrics?)
Note over M: Observe client and stream lifecycles
alt Client disconnect or stream error
M->>X: inc_client_disconnect() [if metrics provided]
M->>E: kill()
else Graceful/normal completion
M-->>E: no-op (graceful shutdown)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Pre-merge checks (1 passed, 2 warnings)❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Poem
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Flag potential breaking changes that are not documented:
1. Identify changes to public APIs/exports, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints (including removed/renamed items and changes to types, required params, return values, defaults, or behavior).
2. Ignore purely internal/private changes (e.g., code not exported from package entry points or marked internal).
3. Verify documentation exists: a "Breaking Change" section in the PR description and updates to CHANGELOG.md.Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai help |
ChatThere are 3 ways to chat with CodeRabbit:
CodeRabbit commands
Other keywords and placeholders
Status, support, documentation and community
|
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/llm/src/http/service/metrics.rs (1)
21-27: Use a Counter (monotonic) instead of a Gauge for disconnect eventsDisconnects are events, not a level. Prometheus best practice: model as a counter with
_totalsuffix. CurrentIntGaugerisks accidental decrements and breaks standard rate queries.Apply the following minimal change:
--- a/lib/llm/src/http/service/metrics.rs +++ b/lib/llm/src/http/service/metrics.rs @@ pub struct Metrics { request_counter: IntCounterVec, inflight_gauge: IntGaugeVec, - client_disconnect_gauge: prometheus::IntGauge, + client_disconnect_counter: prometheus::IntCounter, request_duration: HistogramVec, @@ - let client_disconnect_gauge = prometheus::IntGauge::new( - frontend_metric_name("client_disconnects"), - "Number of connections dropped by clients", - ) + let client_disconnect_counter = prometheus::IntCounter::new( + frontend_metric_name("client_disconnects_total"), + "Total number of client disconnect events observed", + ) .unwrap(); @@ Metrics { request_counter, inflight_gauge, - client_disconnect_gauge, + client_disconnect_counter, request_duration, @@ - /// Increment the gauge for client disconnections - pub fn inc_client_disconnect(&self) { - self.client_disconnect_gauge.inc(); - } + /// Increment the counter for client disconnections + pub fn inc_client_disconnect(&self) { + self.client_disconnect_counter.inc(); + } @@ - /// Get the count of client disconnections - pub fn get_client_disconnect_count(&self) -> i64 { - self.client_disconnect_gauge.get() - } + /// Get the total count of client disconnections + pub fn get_client_disconnect_count(&self) -> u64 { + self.client_disconnect_counter.get() + } @@ - registry.register(Box::new(self.client_disconnect_gauge.clone()))?; + registry.register(Box::new(self.client_disconnect_counter.clone()))?;Optional follow-up (future PR): prefer
IntCounterVecwith labels likephase(connection|stream) and/orendpointfor richer diagnostics.Also applies to: 137-142, 205-209, 274-283, 287-287
🧹 Nitpick comments (1)
lib/llm/src/http/service/disconnect.rs (1)
100-103: Disconnect accounting hook — LGTM; consider labeling by phaseThe optional metrics handle is cleanly threaded and increments on unexpected closes in both phases.
Nice-to-have: capture a label for where the drop occurred (“connection” vs “stream”) to aid debugging. This would mean switching to an
IntCounterVecand callinginc_client_disconnect("connection")or("stream"). If you want, I can draft that change.Also applies to: 111-116, 126-131, 136-139, 150-153
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
lib/llm/src/grpc/service/openai.rs(1 hunks)lib/llm/src/http/service/disconnect.rs(5 hunks)lib/llm/src/http/service/metrics.rs(4 hunks)lib/llm/src/http/service/openai.rs(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
lib/llm/src/grpc/service/openai.rs (1)
lib/llm/src/http/service/disconnect.rs (1)
create_connection_monitor(100-123)
lib/llm/src/http/service/metrics.rs (1)
lib/llm/src/http/service/service_v2.rs (1)
new(74-86)
lib/llm/src/http/service/openai.rs (2)
lib/llm/src/http/service/disconnect.rs (1)
create_connection_monitor(100-123)lib/llm/src/http/client.rs (2)
context(316-318)context(352-354)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build and Test - dynamo
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (4)
lib/llm/src/grpc/service/openai.rs (1)
55-56: Metrics propagation into connection monitor — LGTMPassing
Some(state.metrics_clone())ensures gRPC disconnects are counted the same way as HTTP. No issues spotted.lib/llm/src/http/service/openai.rs (3)
423-425: Chat completions wired for disconnect metrics — LGTMMatches the completions path.
656-658: Responses path: metrics handle passed — LGTM
_stream_handleis intentionally unused for unary; fine.
227-229: Approve: All create_connection_monitor call sites updated
Verified every invocation—in both HTTP and gRPC services—now uses the two-argument signature (context.clone(), Some(state.metrics_clone())). No further changes needed.
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/llm/src/http/service/metrics.rs (1)
21-27: Use a Counter (not Gauge) for “client disconnects”.This is a monotonically increasing count; Prometheus best practice is an IntCounter (ideally with “_total” suffix). Using a Gauge risks accidental decrements and confuses rate calculations.
Apply:
- use prometheus::{Encoder, HistogramOpts, HistogramVec, IntCounterVec, IntGaugeVec, Opts}; + use prometheus::{Encoder, HistogramOpts, HistogramVec, IntCounter, IntCounterVec, IntGaugeVec, Opts}; @@ - client_disconnect_gauge: prometheus::IntGauge, + client_disconnect_counter: IntCounter, @@ - let client_disconnect_gauge = prometheus::IntGauge::new( - frontend_metric_name("client_disconnects"), - "Number of connections dropped by clients", - ) + let client_disconnect_counter = prometheus::IntCounter::new( + frontend_metric_name("client_disconnects_total"), + "Total number of connections closed by clients", + ) .unwrap(); @@ - client_disconnect_gauge, + client_disconnect_counter, @@ - /// Increment the gauge for client disconnections - pub fn inc_client_disconnect(&self) { - self.client_disconnect_gauge.inc(); - } + /// Increment the counter for client disconnections + pub fn inc_client_disconnect(&self) { + self.client_disconnect_counter.inc(); + } @@ - /// Get the count of client disconnections - pub fn get_client_disconnect_count(&self) -> i64 { - self.client_disconnect_gauge.get() - } + /// Get the count of client disconnections + pub fn get_client_disconnect_count(&self) -> i64 { + self.client_disconnect_counter.get() as i64 + } @@ - registry.register(Box::new(self.client_disconnect_gauge.clone()))?; + registry.register(Box::new(self.client_disconnect_counter.clone()))?;Optional follow-up: expose as a labeled counter (e.g., by endpoint/model/phase), if you want per-slice visibility.
Also applies to: 137-142, 208-214, 274-283, 287-287
🧹 Nitpick comments (4)
lib/llm/src/http/service/openai.rs (2)
226-244: Pass-through metrics to connection monitor — good; fix error text.The new metrics handle wiring looks correct. Minor nit: the error string still says “chat completions” in the completions handler.
Apply:
- ErrorMessage::internal_server_error(&format!( - "Failed to await chat completions task: {:?}", - e, - )) + ErrorMessage::internal_server_error(&format!( + "Failed to await completions task: {:?}", + e, + ))
656-666: Metrics wiring is good; fix response handler error text.The metrics handle is correctly passed. Nit on the error message: s/“chat completions”/“responses”.
Apply:
- ErrorMessage::internal_server_error(&format!( - "Failed to await chat completions task: {:?}", - e, - )) + ErrorMessage::internal_server_error(&format!( + "Failed to await responses task: {:?}", + e, + ))lib/llm/src/http/service/metrics.rs (1)
137-142: Consider labels for richer attribution.If feasible, prefer an IntCounterVec with labels like phase={connection|stream}, transport={http|grpc}, and optionally model/endpoint to localize drops.
Example:
- let client_disconnect_counter = IntCounter::new( - frontend_metric_name("client_disconnects_total"), ...); + let client_disconnect_counter = IntCounterVec::new( + Opts::new(frontend_metric_name("client_disconnects_total"), + "Total client-initiated disconnects"), + &["phase","transport"] + )?;You can then increment with concrete labels where you know the context.
lib/llm/src/http/service/disconnect.rs (1)
100-103: Minor: make create_connection_monitor a sync function (future cleanup).The function doesn’t await; consider making it
fn(notasync) in a later PR to remove unnecessary.awaitat call sites.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
lib/llm/src/grpc/service/openai.rs(1 hunks)lib/llm/src/http/service/disconnect.rs(5 hunks)lib/llm/src/http/service/metrics.rs(4 hunks)lib/llm/src/http/service/openai.rs(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: Build and Test - dynamo
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (4)
lib/llm/src/http/service/openai.rs (1)
423-425: LGTM: metrics handle plumbed into connection monitor.This correctly enables disconnect tracking for chat completions. No further action.
lib/llm/src/grpc/service/openai.rs (1)
55-57: LGTM: gRPC path now feeds metrics into the disconnect monitor.This mirrors the HTTP path and ensures client drops on streams are recorded.
lib/llm/src/http/service/disconnect.rs (2)
100-116: Disconnect counting and cancellation path look correct.Incrementing on ClosedUnexpectedly for both primary and stream handles matches the intent and avoids false positives when gracefully disarmed.
Also applies to: 130-153
100-116: All create_connection_monitor call sites include the metrics argument – no instances remain using the old single-parameter signature.
nnshah1
left a comment
There was a problem hiding this comment.
@keivenchang @kthui for viz
|
/ok to test 0ac83d5 |
|
@grahamking Not sure what is going on with the CI here. |
|
/ok to test de9098b |
Signed-off-by: michaelfeil <me@michaelfeil.eu>
dropping user requests can lead to various exploits and should be monitored. Additionally, any issues here are hard to debug, and could e.g. help debug upstream components.
Continuation of feat: add frontend metrics to disconnect #2929
Summary by CodeRabbit
New Features
Refactor
Chores