feat: frontend, disconnect metrics (#1) by michaelfeil · Pull Request #2953 · ai-dynamo/dynamo

michaelfeil · 2025-09-09T04:09:49Z

adds metrics for client side dropped requests in the frontend.
dropping user requests can lead to various exploits and should be monitored. Additionally, any issues here are hard to debug, and could e.g. help debug upstream components.
Continuation of feat: add frontend metrics to disconnect #2929

Summary by CodeRabbit

New Features
- Added a Prometheus metric for client disconnects to surface dropped connections.
- Streaming requests across OpenAI-compatible endpoints now report per-connection metrics when enabled.
Refactor
- Connection monitoring is instrumented to accept optional metrics for per-connection observability without changing runtime behavior.
Chores
- Metrics wiring added to relevant request handlers; no user-facing behavior changes unless monitoring is enabled.

copy-pr-bot · 2025-09-09T04:09:54Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2025-09-09T04:09:57Z

👋 Hi michaelfeil! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

* add metrics to disconnect * fmt * fmt Signed-off-by: michaelfeil <me@michaelfeil.eu>

coderabbitai · 2025-09-09T04:12:13Z

Walkthrough

Adds optional metrics propagation to connection monitoring: OpenAI HTTP and gRPC handlers now pass a metrics handle into create_connection_monitor; create_connection_monitor and its internal monitor accept an optional Metrics and increment a new client_disconnect gauge on client/stream disconnects. Metrics struct adds the gauge and accessors.

Changes

Cohort / File(s)	Summary
Metrics instrumentation `lib/llm/src/http/service/metrics.rs`	Adds `client_disconnect_gauge: prometheus::IntGauge`; initializes and registers it; adds `inc_client_disconnect()` and `get_client_disconnect_count()` methods.
Disconnect monitor API and logic `lib/llm/src/http/service/disconnect.rs`	`create_connection_monitor` signature extended to accept `metrics: Option<Arc<Metrics>>`; internal `connection_monitor` updated to accept metrics and call `inc_client_disconnect()` on client or stream disconnect paths before killing engine context.
HTTP OpenAI handlers wiring `lib/llm/src/http/service/openai.rs`	Handlers updated to call `create_connection_monitor(context.clone(), Some(state.metrics_clone())).await`, passing metrics into the monitor.
gRPC OpenAI handler wiring `lib/llm/src/grpc/service/openai.rs`	Streaming handler updated to call `create_connection_monitor(context.clone(), Some(state.metrics_clone())).await`, passing metrics into the monitor.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Client
  participant H as OpenAI Handler (HTTP/gRPC)
  participant D as disconnect::create_connection_monitor
  participant M as Monitor Task
  participant E as Engine Context
  participant X as Metrics (optional)

  C->>H: Start request / stream
  H->>D: create_connection_monitor(context, Some(metrics))
  D->>M: spawn monitor(context, metrics?)
  Note over M: Observe client and stream lifecycles

  alt Client disconnect or stream error
    M->>X: inc_client_disconnect() [if metrics provided]
    M->>E: kill()
  else Graceful/normal completion
    M-->>E: no-op (graceful shutdown)
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks (1 passed, 2 warnings)

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title Check	⚠️ Warning	The title “feat: frontend, disconnect metrics (#1)” includes an irrelevant “frontend” scope and does not clearly focus on the primary change of adding client disconnect metrics to the LLM service, making it misleading and less concise.	Rename the title to clearly reflect the main change, for example “feat(llm): add metrics for client disconnects,” removing unrelated scope qualifiers.
Description Check	⚠️ Warning	The description does not follow the repository’s template: it lacks the required Overview, Details, Where should the reviewer start, and Related Issues sections and does not clearly describe the specific code changes.	Update the pull request description to include the template headings—Overview, Details, Where should the reviewer start, and Related Issues—and fill in each section with a concise summary of changes, key files, and any linked issue numbers.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Poem

A whisk of wires, a metric hops,
Counting drops when a connection stops.
Gauges twitch as streams unwind,
Numbers tick where clients signed.
I thump approval—tidy clicks! 🥕📈

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
		  - name: "Undocumented Breaking Changes"
			  mode: "warning"
			  instructions: |
				  Flag potential breaking changes that are not documented:
				  1. Identify changes to public APIs/exports, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints (including removed/renamed items and changes to types, required params, return values, defaults, or behavior).
				  2. Ignore purely internal/private changes (e.g., code not exported from package entry points or marked internal).
				  3. Verify documentation exists: a "Breaking Change" section in the PR description and updates to CHANGELOG.md.

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

michaelfeil · 2025-09-09T04:14:24Z

@coderabbitai help

coderabbitai · 2025-09-09T04:14:30Z

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
- @coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
- @coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve to resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, support, documentation and community

Visit our status page to check the current availability of CodeRabbit.
Create a ticket on our support page for assistance with any issues or questions.
Visit our documentation site for detailed information on how to use CodeRabbit.
Join our Discord community to connect with other users and get help from the community.
Follow us on X/Twitter for updates and announcements.

michaelfeil · 2025-09-09T04:14:48Z

@coderabbitai review

coderabbitai · 2025-09-09T04:14:53Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

michaelfeil · 2025-09-09T04:15:16Z

@coderabbitai full review

coderabbitai · 2025-09-09T04:15:22Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lib/llm/src/http/service/metrics.rs (1)

21-27: Use a Counter (monotonic) instead of a Gauge for disconnect events

Disconnects are events, not a level. Prometheus best practice: model as a counter with _total suffix. Current IntGauge risks accidental decrements and breaks standard rate queries.

Apply the following minimal change:

--- a/lib/llm/src/http/service/metrics.rs
+++ b/lib/llm/src/http/service/metrics.rs
@@
 pub struct Metrics {
     request_counter: IntCounterVec,
     inflight_gauge: IntGaugeVec,
-    client_disconnect_gauge: prometheus::IntGauge,
+    client_disconnect_counter: prometheus::IntCounter,
     request_duration: HistogramVec,
@@
-        let client_disconnect_gauge = prometheus::IntGauge::new(
-            frontend_metric_name("client_disconnects"),
-            "Number of connections dropped by clients",
-        )
+        let client_disconnect_counter = prometheus::IntCounter::new(
+            frontend_metric_name("client_disconnects_total"),
+            "Total number of client disconnect events observed",
+        )
         .unwrap();
@@
         Metrics {
             request_counter,
             inflight_gauge,
-            client_disconnect_gauge,
+            client_disconnect_counter,
             request_duration,
@@
-    /// Increment the gauge for client disconnections
-    pub fn inc_client_disconnect(&self) {
-        self.client_disconnect_gauge.inc();
-    }
+    /// Increment the counter for client disconnections
+    pub fn inc_client_disconnect(&self) {
+        self.client_disconnect_counter.inc();
+    }
@@
-    /// Get the count of client disconnections
-    pub fn get_client_disconnect_count(&self) -> i64 {
-        self.client_disconnect_gauge.get()
-    }
+    /// Get the total count of client disconnections
+    pub fn get_client_disconnect_count(&self) -> u64 {
+        self.client_disconnect_counter.get()
+    }
@@
-        registry.register(Box::new(self.client_disconnect_gauge.clone()))?;
+        registry.register(Box::new(self.client_disconnect_counter.clone()))?;

Optional follow-up (future PR): prefer IntCounterVec with labels like phase (connection|stream) and/or endpoint for richer diagnostics.

Also applies to: 137-142, 205-209, 274-283, 287-287

🧹 Nitpick comments (1)

lib/llm/src/http/service/disconnect.rs (1)

100-103: Disconnect accounting hook — LGTM; consider labeling by phase

The optional metrics handle is cleanly threaded and increments on unexpected closes in both phases.

Nice-to-have: capture a label for where the drop occurred (“connection” vs “stream”) to aid debugging. This would mean switching to an IntCounterVec and calling inc_client_disconnect("connection") or ("stream"). If you want, I can draft that change.

Also applies to: 111-116, 126-131, 136-139, 150-153

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09c7b73 and 0ac83d5.

📒 Files selected for processing (4)

lib/llm/src/grpc/service/openai.rs (1 hunks)
lib/llm/src/http/service/disconnect.rs (5 hunks)
lib/llm/src/http/service/metrics.rs (4 hunks)
lib/llm/src/http/service/openai.rs (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

lib/llm/src/grpc/service/openai.rs (1)

lib/llm/src/http/service/disconnect.rs (1)

create_connection_monitor (100-123)

lib/llm/src/http/service/metrics.rs (1)

lib/llm/src/http/service/service_v2.rs (1)

new (74-86)

lib/llm/src/http/service/openai.rs (2)

lib/llm/src/http/service/disconnect.rs (1)

create_connection_monitor (100-123)

lib/llm/src/http/client.rs (2)

context (316-318)

context (352-354)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Build and Test - dynamo
GitHub Check: pre-merge-rust (.)
GitHub Check: pre-merge-rust (lib/runtime/examples)
GitHub Check: pre-merge-rust (lib/bindings/python)

🔇 Additional comments (4)

lib/llm/src/grpc/service/openai.rs (1)

55-56: Metrics propagation into connection monitor — LGTM

Passing Some(state.metrics_clone()) ensures gRPC disconnects are counted the same way as HTTP. No issues spotted.

lib/llm/src/http/service/openai.rs (3)

423-425: Chat completions wired for disconnect metrics — LGTM

Matches the completions path.

656-658: Responses path: metrics handle passed — LGTM

_stream_handle is intentionally unused for unary; fine.

227-229: Approve: All create_connection_monitor call sites updated
Verified every invocation—in both HTTP and gRPC services—now uses the two-argument signature (context.clone(), Some(state.metrics_clone())). No further changes needed.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lib/llm/src/http/service/metrics.rs (1)

21-27: Use a Counter (not Gauge) for “client disconnects”.

This is a monotonically increasing count; Prometheus best practice is an IntCounter (ideally with “_total” suffix). Using a Gauge risks accidental decrements and confuses rate calculations.

Apply:

- use prometheus::{Encoder, HistogramOpts, HistogramVec, IntCounterVec, IntGaugeVec, Opts};
+ use prometheus::{Encoder, HistogramOpts, HistogramVec, IntCounter, IntCounterVec, IntGaugeVec, Opts};
@@
-    client_disconnect_gauge: prometheus::IntGauge,
+    client_disconnect_counter: IntCounter,
@@
-        let client_disconnect_gauge = prometheus::IntGauge::new(
-            frontend_metric_name("client_disconnects"),
-            "Number of connections dropped by clients",
-        )
+        let client_disconnect_counter = prometheus::IntCounter::new(
+            frontend_metric_name("client_disconnects_total"),
+            "Total number of connections closed by clients",
+        )
         .unwrap();
@@
-            client_disconnect_gauge,
+            client_disconnect_counter,
@@
-    /// Increment the gauge for client disconnections
-    pub fn inc_client_disconnect(&self) {
-        self.client_disconnect_gauge.inc();
-    }
+    /// Increment the counter for client disconnections
+    pub fn inc_client_disconnect(&self) {
+        self.client_disconnect_counter.inc();
+    }
@@
-    /// Get the count of client disconnections
-    pub fn get_client_disconnect_count(&self) -> i64 {
-        self.client_disconnect_gauge.get()
-    }
+    /// Get the count of client disconnections
+    pub fn get_client_disconnect_count(&self) -> i64 {
+        self.client_disconnect_counter.get() as i64
+    }
@@
-        registry.register(Box::new(self.client_disconnect_gauge.clone()))?;
+        registry.register(Box::new(self.client_disconnect_counter.clone()))?;

Optional follow-up: expose as a labeled counter (e.g., by endpoint/model/phase), if you want per-slice visibility.

Also applies to: 137-142, 208-214, 274-283, 287-287

🧹 Nitpick comments (4)

lib/llm/src/http/service/openai.rs (2)
226-244: Pass-through metrics to connection monitor — good; fix error text.

The new metrics handle wiring looks correct. Minor nit: the error string still says “chat completions” in the completions handler.

Apply:
-            ErrorMessage::internal_server_error(&format!(
-                "Failed to await chat completions task: {:?}",
-                e,
-            ))
+            ErrorMessage::internal_server_error(&format!(
+                "Failed to await completions task: {:?}",
+                e,
+            ))
656-666: Metrics wiring is good; fix response handler error text.

The metrics handle is correctly passed. Nit on the error message: s/“chat completions”/“responses”.

Apply:
-            ErrorMessage::internal_server_error(&format!(
-                "Failed to await chat completions task: {:?}",
-                e,
-            ))
+            ErrorMessage::internal_server_error(&format!(
+                "Failed to await responses task: {:?}",
+                e,
+            ))
lib/llm/src/http/service/metrics.rs (1)
137-142: Consider labels for richer attribution.

If feasible, prefer an IntCounterVec with labels like phase={connection|stream}, transport={http|grpc}, and optionally model/endpoint to localize drops.

Example:
- let client_disconnect_counter = IntCounter::new(
-   frontend_metric_name("client_disconnects_total"), ...);
+ let client_disconnect_counter = IntCounterVec::new(
+   Opts::new(frontend_metric_name("client_disconnects_total"),
+             "Total client-initiated disconnects"),
+   &["phase","transport"]
+ )?;
You can then increment with concrete labels where you know the context.
lib/llm/src/http/service/disconnect.rs (1)

100-103: Minor: make create_connection_monitor a sync function (future cleanup).

The function doesn’t await; consider making it fn (not async) in a later PR to remove unnecessary .await at call sites.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09c7b73 and 0ac83d5.

📒 Files selected for processing (4)

lib/llm/src/grpc/service/openai.rs (1 hunks)
lib/llm/src/http/service/disconnect.rs (5 hunks)
lib/llm/src/http/service/metrics.rs (4 hunks)
lib/llm/src/http/service/openai.rs (3 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: Build and Test - dynamo
GitHub Check: pre-merge-rust (lib/bindings/python)
GitHub Check: pre-merge-rust (lib/runtime/examples)
GitHub Check: pre-merge-rust (.)
GitHub Check: pre-merge-rust (.)
GitHub Check: pre-merge-rust (lib/runtime/examples)
GitHub Check: pre-merge-rust (lib/bindings/python)

🔇 Additional comments (4)

lib/llm/src/http/service/openai.rs (1)

423-425: LGTM: metrics handle plumbed into connection monitor.

This correctly enables disconnect tracking for chat completions. No further action.

lib/llm/src/grpc/service/openai.rs (1)

55-57: LGTM: gRPC path now feeds metrics into the disconnect monitor.

This mirrors the HTTP path and ensures client drops on streams are recorded.

lib/llm/src/http/service/disconnect.rs (2)

100-116: Disconnect counting and cancellation path look correct.

Incrementing on ClosedUnexpectedly for both primary and stream handles matches the intent and avoids false positives when gracefully disarmed.

Also applies to: 130-153

100-116: All create_connection_monitor call sites include the metrics argument – no instances remain using the old single-parameter signature.

nnshah1

@keivenchang @kthui for viz

grahamking · 2025-09-09T15:47:45Z

/ok to test 0ac83d5

michaelfeil · 2025-09-09T22:35:54Z

@grahamking Not sure what is going on with the CI here.

grahamking · 2025-09-10T18:21:49Z

/ok to test de9098b

Signed-off-by: michaelfeil <me@michaelfeil.eu> Signed-off-by: ayushag <ayushag@nvidia.com>

Signed-off-by: michaelfeil <me@michaelfeil.eu> Signed-off-by: zhongdaor <zhongdaor@nvidia.com>

Signed-off-by: michaelfeil <me@michaelfeil.eu>

michaelfeil requested a review from a team as a code owner September 9, 2025 04:09

pull-request-size Bot added the size/M label Sep 9, 2025

github-actions Bot added external-contribution Pull request is from an external contributor feat labels Sep 9, 2025

Mf/disconnect metrics (#1)

0ac83d5

* add metrics to disconnect * fmt * fmt Signed-off-by: michaelfeil <me@michaelfeil.eu>

michaelfeil force-pushed the mf/track-cancellations branch from 5f0a7c4 to 0ac83d5 Compare September 9, 2025 04:12

michaelfeil closed this Sep 9, 2025

michaelfeil reopened this Sep 9, 2025

coderabbitai Bot reviewed Sep 9, 2025

View reviewed changes

nnshah1 approved these changes Sep 9, 2025

View reviewed changes

copy-pr-bot Bot temporarily deployed to GITLAB September 9, 2025 15:47 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB September 9, 2025 15:48 Inactive

Merge branch 'main' into mf/track-cancellations

de9098b

copy-pr-bot Bot temporarily deployed to GITLAB September 10, 2025 18:21 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB September 10, 2025 18:24 Inactive

grahamking enabled auto-merge (squash) September 10, 2025 18:31

grahamking approved these changes Sep 10, 2025

View reviewed changes

grahamking merged commit e42746a into ai-dynamo:main Sep 10, 2025
15 of 16 checks passed

ayushag-nv pushed a commit that referenced this pull request Sep 15, 2025

feat: frontend, disconnect metrics (#1) (#2953)

031f95c

Signed-off-by: michaelfeil <me@michaelfeil.eu> Signed-off-by: ayushag <ayushag@nvidia.com>

zhongdaor-nv pushed a commit that referenced this pull request Sep 15, 2025

feat: frontend, disconnect metrics (#1) (#2953)

aba1dd1

Signed-off-by: michaelfeil <me@michaelfeil.eu> Signed-off-by: zhongdaor <zhongdaor@nvidia.com>

yao531441 pushed a commit to yao531441/dynamo that referenced this pull request May 13, 2026

feat: frontend, disconnect metrics (#1) (ai-dynamo#2953)

4d5c088

Signed-off-by: michaelfeil <me@michaelfeil.eu>

Conversation

michaelfeil commented Sep 9, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Sep 9, 2025

Uh oh!

github-actions Bot commented Sep 9, 2025

Uh oh!

coderabbitai Bot commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks (1 passed, 2 warnings)

Poem

Uh oh!

michaelfeil commented Sep 9, 2025

Uh oh!

coderabbitai Bot commented Sep 9, 2025

Chat

CodeRabbit commands

Other keywords and placeholders

Status, support, documentation and community

Uh oh!

michaelfeil commented Sep 9, 2025

Uh oh!

coderabbitai Bot commented Sep 9, 2025

Uh oh!

michaelfeil commented Sep 9, 2025

Uh oh!

coderabbitai Bot commented Sep 9, 2025

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

nnshah1 left a comment

Choose a reason for hiding this comment

Uh oh!

grahamking commented Sep 9, 2025

Uh oh!

michaelfeil commented Sep 9, 2025

Uh oh!

grahamking commented Sep 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

michaelfeil commented Sep 9, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Sep 9, 2025 •

edited

Loading