Skip to content

feat: add frontend metrics to disconnect#2929

Closed
michaelfeil wants to merge 5 commits into
ai-dynamo:mainfrom
michaelfeil:mf/disconnect-metrics
Closed

feat: add frontend metrics to disconnect#2929
michaelfeil wants to merge 5 commits into
ai-dynamo:mainfrom
michaelfeil:mf/disconnect-metrics

Conversation

@michaelfeil

@michaelfeil michaelfeil commented Sep 8, 2025

Copy link
Copy Markdown
Contributor

Overview:

Details:

  • adds metrics for client side dropped requests in the frontend.
  • dropping user requests can lead to various exploits and should be monitored. Additionally, any issues here are hard to debug, and could e.g. help debug upstream components.

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Added a Prometheus metric to track client disconnects, enabling monitoring and alerting on disconnect events.
    • Introduced an optional metrics hook in connection monitoring to record disconnect counts when provided.
    • Exposed methods to increment and read the client disconnect count.
  • Refactor

    • Updated request handling paths to pass the metrics handle into the connection monitor. Behavior remains unchanged when metrics are not supplied.

@michaelfeil michaelfeil requested a review from a team as a code owner September 8, 2025 16:57
@copy-pr-bot

copy-pr-bot Bot commented Sep 8, 2025

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

github-actions Bot commented Sep 8, 2025

Copy link
Copy Markdown
Contributor

👋 Hi michaelfeil! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions Bot added the external-contribution Pull request is from an external contributor label Sep 8, 2025
@coderabbitai

coderabbitai Bot commented Sep 8, 2025

Copy link
Copy Markdown
Contributor

Walkthrough

Adds optional Metrics to disconnect monitoring: updates create_connection_monitor to accept Option<Arc>, propagates it internally, and increments a client-disconnect gauge on disconnect events. Wires metrics through handlers in openai.rs. Introduces a new client_disconnect_gauge in Metrics with increment and getter APIs, and registers it.

Changes

Cohort / File(s) Summary
Disconnect monitoring API and wiring
lib/llm/src/http/service/disconnect.rs
create_connection_monitor signature extended with metrics: Option<Arc>; forwarded to monitoring task; on connection/stream disconnect, conditionally calls metrics.inc_client_disconnect(). Existing graceful/disabled handling unchanged.
Metrics system updates
lib/llm/src/http/service/metrics.rs
Adds client_disconnect_gauge (IntGauge) to Metrics; initializes and registers it; adds inc_client_disconnect() and get_client_disconnect_count(). No changes to existing metrics.
Handler call sites
lib/llm/src/http/service/openai.rs
Updates three create_connection_monitor invocations to pass Some(state.metrics_clone()); control flow otherwise unchanged.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Client
  participant H as HTTP Handler
  participant M as Connection Monitor
  participant X as Metrics

  C->>H: Initiate request (completions/chat/responses)
  H->>M: create_connection_monitor(ctx, Some(metrics))
  Note over H,M: Monitor spawned with optional metrics

  rect rgba(230,250,230,0.6)
    M-->>H: Provide stream_handle, connection_handle
    H->>C: Stream response
  end

  alt Disconnect detected
    M->>X: inc_client_disconnect()
    Note right of X: client_disconnect_gauge += 1
  else Graceful/disabled paths
    M-->>M: No metrics update
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

Thump-thump goes my fluffy heart, hooray!
I count goodbyes in gauges as they stray—
A nibble of metrics, a byte of grace,
When clients hop off, I note the place.
Streams may end, but logs delight,
Carrots tallied, counters right. 🥕📈


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@michaelfeil michaelfeil changed the title add frontend metrics to disconnect feat: add frontend metrics to disconnect Sep 8, 2025
@github-actions github-actions Bot added the feat label Sep 8, 2025

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/llm/src/http/service/disconnect.rs (1)

100-116: Add missing metrics argument to all create_connection_monitor calls

  • In lib/llm/src/grpc/service/openai.rs (line 55), the call
    create_connection_monitor(context.clone()).await
    omits the new metrics parameter. Update to pass an appropriate Option<Arc<Metrics>> (e.g., Some(state.metrics_clone())).
🧹 Nitpick comments (5)
lib/llm/src/http/service/openai.rs (2)

426-431: Chat completions: metrics plumbed correctly; also hit by rustfmt check.

Same note: formatting check failed around this hunk; cargo fmt should fix.


662-663: Responses: intentional unused stream handle; format to pass CI.

Using _stream_handle is fine for unary. The rustfmt check flagged this area; please run cargo fmt.

lib/llm/src/http/service/metrics.rs (2)

21-21: Prefer a Counter (and “_total” suffix) over a Gauge for disconnect events; consider labels.

Client disconnects are monotonically increasing events; Prometheus best practice is an IntCounter (optionally with labels like endpoint/request_type) named {prefix}_client_disconnects_total. Current IntGauge works but is semantically off.

Proposed change (minimal, no labels):

-    client_disconnect_gauge: prometheus::IntGauge,
+    client_disconnect_counter: prometheus::IntCounter,
-        let client_disconnect_gauge = prometheus::IntGauge::new(
-            frontend_metric_name("client_disconnects"),
-            "Number of connections dropped by clients",
-        )
-        .unwrap();
+        let client_disconnect_counter = prometheus::IntCounter::new(
+            frontend_metric_name("client_disconnects_total"),
+            "Total number of client-initiated disconnects",
+        )
+        .unwrap();
-            client_disconnect_gauge,
+            client_disconnect_counter,
-    /// Increment the gauge for client disconnections
-    pub fn inc_client_disconnect(&self) {
-        self.client_disconnect_gauge.inc();
-    }
+    /// Increment the counter for client disconnections
+    pub fn inc_client_disconnect(&self) {
+        self.client_disconnect_counter.inc();
+    }
-    /// Get the count of client disconnections
-    pub fn get_client_disconnect_count(&self) -> i64 {
-        self.client_disconnect_gauge.get()
-    }
+    /// Get the total count of client disconnections
+    pub fn get_client_disconnect_count(&self) -> u64 {
+        self.client_disconnect_counter.get()
+    }
-        registry.register(Box::new(self.client_disconnect_gauge.clone()))?;
+        registry.register(Box::new(self.client_disconnect_counter.clone()))?;

Optional follow-up: switch to IntCounterVec with labels endpoint, request_type for better slice-and-dice.

Also applies to: 137-142, 206-209, 274-283, 287-287


93-105: Docstring: add the new metric to the list for completeness.

Include the disconnect metric in the documented set so operators discover it.

 /// The following metrics will be created with the configured prefix:
 /// - `{prefix}_requests_total` - IntCounterVec for the total number of requests processed
 /// - `{prefix}_inflight_requests` - IntGaugeVec for the number of inflight requests
+/// - `{prefix}_client_disconnects_total` - IntCounter for client-initiated disconnects
 /// - `{prefix}_request_duration_seconds` - HistogramVec for the duration of requests
lib/llm/src/http/service/disconnect.rs (1)

136-139: Counting policy: increments on both connection and stream phases may double-count per request.

If you want “one disconnect per request,” dedupe increments (e.g., track a single AtomicBool and only inc once). Current behavior counts each unexpectedly dropped handle.

I can sketch a minimal dedupe using an Arc<AtomicBool> shared across both awaits if you want it.

Also applies to: 150-153

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1477f6e and ded51ac.

📒 Files selected for processing (3)
  • lib/llm/src/http/service/disconnect.rs (5 hunks)
  • lib/llm/src/http/service/metrics.rs (4 hunks)
  • lib/llm/src/http/service/openai.rs (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
lib/llm/src/http/service/metrics.rs (1)
lib/llm/src/http/service/service_v2.rs (1)
  • new (74-86)
lib/llm/src/http/service/openai.rs (2)
lib/llm/src/http/service/disconnect.rs (1)
  • create_connection_monitor (100-123)
lib/llm/src/http/client.rs (2)
  • context (316-318)
  • context (352-354)
🪛 GitHub Actions: Rust pre-merge checks
lib/llm/src/http/service/openai.rs

[error] 224-230: Rustfmt: formatting check failed (cargo fmt -- --check). Differences detected vs rustfmt output. Run 'cargo fmt' to format the code.


[error] 423-430: Rustfmt: formatting check failed (cargo fmt -- --check). Differences detected vs rustfmt output. Run 'cargo fmt' to format the code.


[error] 659-663: Rustfmt: formatting check failed (cargo fmt -- --check). Differences detected vs rustfmt output. Run 'cargo fmt' to format the code.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
lib/llm/src/http/service/metrics.rs (1)

284-295: Remove outdated warning: metrics are registered on the correct Registry
The code calls state.metrics_clone().register(&registry) on the same registry instance that’s passed into metrics::router(registry, …), so all metrics will be scrape‐visible.

Likely an incorrect or invalid review comment.

Comment thread lib/llm/src/http/service/openai.rs Outdated
@michaelfeil

Copy link
Copy Markdown
Contributor Author

____ ERROR collecting benchmarks/prefix_data_generator/tests/test_hasher.py ____
seems unrelated?

@michaelfeil

Copy link
Copy Markdown
Contributor Author

I squashed all commits in 1 to fix ci, #2929 and sign off commit.

@michaelfeil michaelfeil closed this Sep 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contribution Pull request is from an external contributor feat size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant