feat: add frontend metrics to disconnect by michaelfeil · Pull Request #2929 · ai-dynamo/dynamo

michaelfeil · 2025-09-08T16:57:17Z

Overview:

Details:

adds metrics for client side dropped requests in the frontend.
dropping user requests can lead to various exploits and should be monitored. Additionally, any issues here are hard to debug, and could e.g. help debug upstream components.

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- Added a Prometheus metric to track client disconnects, enabling monitoring and alerting on disconnect events.
- Introduced an optional metrics hook in connection monitoring to record disconnect counts when provided.
- Exposed methods to increment and read the client disconnect count.
Refactor
- Updated request handling paths to pass the metrics handle into the connection monitor. Behavior remains unchanged when metrics are not supplied.

copy-pr-bot · 2025-09-08T16:57:22Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2025-09-08T16:57:26Z

👋 Hi michaelfeil! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

coderabbitai · 2025-09-08T16:59:47Z

Walkthrough

Adds optional Metrics to disconnect monitoring: updates create_connection_monitor to accept Option<Arc>, propagates it internally, and increments a client-disconnect gauge on disconnect events. Wires metrics through handlers in openai.rs. Introduces a new client_disconnect_gauge in Metrics with increment and getter APIs, and registers it.

Changes

Cohort / File(s)	Summary
Disconnect monitoring API and wiring `lib/llm/src/http/service/disconnect.rs`	create_connection_monitor signature extended with metrics: Option<Arc>; forwarded to monitoring task; on connection/stream disconnect, conditionally calls metrics.inc_client_disconnect(). Existing graceful/disabled handling unchanged.
Metrics system updates `lib/llm/src/http/service/metrics.rs`	Adds client_disconnect_gauge (IntGauge) to Metrics; initializes and registers it; adds inc_client_disconnect() and get_client_disconnect_count(). No changes to existing metrics.
Handler call sites `lib/llm/src/http/service/openai.rs`	Updates three create_connection_monitor invocations to pass Some(state.metrics_clone()); control flow otherwise unchanged.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Client
  participant H as HTTP Handler
  participant M as Connection Monitor
  participant X as Metrics

  C->>H: Initiate request (completions/chat/responses)
  H->>M: create_connection_monitor(ctx, Some(metrics))
  Note over H,M: Monitor spawned with optional metrics

  rect rgba(230,250,230,0.6)
    M-->>H: Provide stream_handle, connection_handle
    H->>C: Stream response
  end

  alt Disconnect detected
    M->>X: inc_client_disconnect()
    Note right of X: client_disconnect_gauge += 1
  else Graceful/disabled paths
    M-->>M: No metrics update
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

Thump-thump goes my fluffy heart, hooray!
I count goodbyes in gauges as they stray—
A nibble of metrics, a byte of grace,
When clients hop off, I note the place.
Streams may end, but logs delight,
Carrots tallied, counters right. 🥕📈

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lib/llm/src/http/service/disconnect.rs (1)

100-116: Add missing metrics argument to all create_connection_monitor calls

In lib/llm/src/grpc/service/openai.rs (line 55), the call
create_connection_monitor(context.clone()).await
omits the new metrics parameter. Update to pass an appropriate Option<Arc<Metrics>> (e.g., Some(state.metrics_clone())).

🧹 Nitpick comments (5)

lib/llm/src/http/service/openai.rs (2)

426-431: Chat completions: metrics plumbed correctly; also hit by rustfmt check.

Same note: formatting check failed around this hunk; cargo fmt should fix.

662-663: Responses: intentional unused stream handle; format to pass CI.

Using _stream_handle is fine for unary. The rustfmt check flagged this area; please run cargo fmt.
lib/llm/src/http/service/metrics.rs (2)
21-21: Prefer a Counter (and “_total” suffix) over a Gauge for disconnect events; consider labels.

Client disconnects are monotonically increasing events; Prometheus best practice is an IntCounter (optionally with labels like endpoint/request_type) named {prefix}_client_disconnects_total. Current IntGauge works but is semantically off.

Proposed change (minimal, no labels):
-    client_disconnect_gauge: prometheus::IntGauge,
+    client_disconnect_counter: prometheus::IntCounter,
-        let client_disconnect_gauge = prometheus::IntGauge::new(
-            frontend_metric_name("client_disconnects"),
-            "Number of connections dropped by clients",
-        )
-        .unwrap();
+        let client_disconnect_counter = prometheus::IntCounter::new(
+            frontend_metric_name("client_disconnects_total"),
+            "Total number of client-initiated disconnects",
+        )
+        .unwrap();
-            client_disconnect_gauge,
+            client_disconnect_counter,
-    /// Increment the gauge for client disconnections
-    pub fn inc_client_disconnect(&self) {
-        self.client_disconnect_gauge.inc();
-    }
+    /// Increment the counter for client disconnections
+    pub fn inc_client_disconnect(&self) {
+        self.client_disconnect_counter.inc();
+    }
-    /// Get the count of client disconnections
-    pub fn get_client_disconnect_count(&self) -> i64 {
-        self.client_disconnect_gauge.get()
-    }
+    /// Get the total count of client disconnections
+    pub fn get_client_disconnect_count(&self) -> u64 {
+        self.client_disconnect_counter.get()
+    }
-        registry.register(Box::new(self.client_disconnect_gauge.clone()))?;
+        registry.register(Box::new(self.client_disconnect_counter.clone()))?;
Optional follow-up: switch to IntCounterVec with labels endpoint, request_type for better slice-and-dice.

Also applies to: 137-142, 206-209, 274-283, 287-287

93-105: Docstring: add the new metric to the list for completeness.

Include the disconnect metric in the documented set so operators discover it.
 /// The following metrics will be created with the configured prefix:
 /// - `{prefix}_requests_total` - IntCounterVec for the total number of requests processed
 /// - `{prefix}_inflight_requests` - IntGaugeVec for the number of inflight requests
+/// - `{prefix}_client_disconnects_total` - IntCounter for client-initiated disconnects
 /// - `{prefix}_request_duration_seconds` - HistogramVec for the duration of requests
lib/llm/src/http/service/disconnect.rs (1)

136-139: Counting policy: increments on both connection and stream phases may double-count per request.

If you want “one disconnect per request,” dedupe increments (e.g., track a single AtomicBool and only inc once). Current behavior counts each unexpectedly dropped handle.

I can sketch a minimal dedupe using an Arc<AtomicBool> shared across both awaits if you want it.

Also applies to: 150-153

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1477f6e and ded51ac.

📒 Files selected for processing (3)

lib/llm/src/http/service/disconnect.rs (5 hunks)
lib/llm/src/http/service/metrics.rs (4 hunks)
lib/llm/src/http/service/openai.rs (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

lib/llm/src/http/service/metrics.rs (1)

lib/llm/src/http/service/service_v2.rs (1)

new (74-86)

lib/llm/src/http/service/openai.rs (2)

lib/llm/src/http/service/disconnect.rs (1)

create_connection_monitor (100-123)

lib/llm/src/http/client.rs (2)

context (316-318)

context (352-354)

🪛 GitHub Actions: Rust pre-merge checks

lib/llm/src/http/service/openai.rs

[error] 224-230: Rustfmt: formatting check failed (cargo fmt -- --check). Differences detected vs rustfmt output. Run 'cargo fmt' to format the code.

[error] 423-430: Rustfmt: formatting check failed (cargo fmt -- --check). Differences detected vs rustfmt output. Run 'cargo fmt' to format the code.

[error] 659-663: Rustfmt: formatting check failed (cargo fmt -- --check). Differences detected vs rustfmt output. Run 'cargo fmt' to format the code.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and Test - dynamo

🔇 Additional comments (1)

lib/llm/src/http/service/metrics.rs (1)

284-295: Remove outdated warning: metrics are registered on the correct Registry
The code calls state.metrics_clone().register(&registry) on the same registry instance that’s passed into metrics::router(registry, …), so all metrics will be scrape‐visible.

Likely an incorrect or invalid review comment.

michaelfeil · 2025-09-08T18:04:22Z

____ ERROR collecting benchmarks/prefix_data_generator/tests/test_hasher.py ____
seems unrelated?

michaelfeil · 2025-09-09T04:10:16Z

I squashed all commits in 1 to fix ci, #2929 and sign off commit.

add metrics to disconnect

0a17322

michaelfeil requested a review from a team as a code owner September 8, 2025 16:57

pull-request-size Bot added the size/M label Sep 8, 2025

github-actions Bot added the external-contribution Pull request is from an external contributor label Sep 8, 2025

Merge branch 'main' into mf/disconnect-metrics

ded51ac

fmt

ba17e00

michaelfeil changed the title ~~add frontend metrics to disconnect~~ feat: add frontend metrics to disconnect Sep 8, 2025

github-actions Bot added the feat label Sep 8, 2025

fmt

373e89c

coderabbitai Bot reviewed Sep 8, 2025

View reviewed changes

Comment thread lib/llm/src/http/service/openai.rs Outdated

Merge branch 'main' into mf/disconnect-metrics

8b8b4dc

michaelfeil mentioned this pull request Sep 9, 2025

feat: frontend, disconnect metrics (#1) #2953

Merged

michaelfeil closed this Sep 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add frontend metrics to disconnect#2929

feat: add frontend metrics to disconnect#2929
michaelfeil wants to merge 5 commits into
ai-dynamo:mainfrom
michaelfeil:mf/disconnect-metrics

michaelfeil commented Sep 8, 2025 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Sep 8, 2025

Uh oh!

github-actions Bot commented Sep 8, 2025

Uh oh!

coderabbitai Bot commented Sep 8, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

michaelfeil commented Sep 8, 2025

Uh oh!

michaelfeil commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michaelfeil commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Sep 8, 2025

Uh oh!

github-actions Bot commented Sep 8, 2025

Uh oh!

coderabbitai Bot commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

michaelfeil commented Sep 8, 2025

Uh oh!

michaelfeil commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

michaelfeil commented Sep 8, 2025 •

edited

Loading

coderabbitai Bot commented Sep 8, 2025 •

edited

Loading