Skip to content

feat: frontend, disconnect metrics (#1)#2953

Merged
grahamking merged 2 commits into
ai-dynamo:mainfrom
michaelfeil:mf/track-cancellations
Sep 10, 2025
Merged

feat: frontend, disconnect metrics (#1)#2953
grahamking merged 2 commits into
ai-dynamo:mainfrom
michaelfeil:mf/track-cancellations

Conversation

@michaelfeil

@michaelfeil michaelfeil commented Sep 9, 2025

Copy link
Copy Markdown
Contributor
  • adds metrics for client side dropped requests in the frontend.
    dropping user requests can lead to various exploits and should be monitored. Additionally, any issues here are hard to debug, and could e.g. help debug upstream components.
    Continuation of feat: add frontend metrics to disconnect #2929

Summary by CodeRabbit

  • New Features

    • Added a Prometheus metric for client disconnects to surface dropped connections.
    • Streaming requests across OpenAI-compatible endpoints now report per-connection metrics when enabled.
  • Refactor

    • Connection monitoring is instrumented to accept optional metrics for per-connection observability without changing runtime behavior.
  • Chores

    • Metrics wiring added to relevant request handlers; no user-facing behavior changes unless monitoring is enabled.

@michaelfeil michaelfeil requested a review from a team as a code owner September 9, 2025 04:09
@copy-pr-bot

copy-pr-bot Bot commented Sep 9, 2025

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

github-actions Bot commented Sep 9, 2025

Copy link
Copy Markdown
Contributor

👋 Hi michaelfeil! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions Bot added external-contribution Pull request is from an external contributor feat labels Sep 9, 2025
* add metrics to disconnect

* fmt

* fmt

Signed-off-by: michaelfeil <me@michaelfeil.eu>
@michaelfeil michaelfeil force-pushed the mf/track-cancellations branch from 5f0a7c4 to 0ac83d5 Compare September 9, 2025 04:12
@coderabbitai

coderabbitai Bot commented Sep 9, 2025

Copy link
Copy Markdown
Contributor

Walkthrough

Adds optional metrics propagation to connection monitoring: OpenAI HTTP and gRPC handlers now pass a metrics handle into create_connection_monitor; create_connection_monitor and its internal monitor accept an optional Metrics and increment a new client_disconnect gauge on client/stream disconnects. Metrics struct adds the gauge and accessors.

Changes

Cohort / File(s) Summary
Metrics instrumentation
lib/llm/src/http/service/metrics.rs
Adds client_disconnect_gauge: prometheus::IntGauge; initializes and registers it; adds inc_client_disconnect() and get_client_disconnect_count() methods.
Disconnect monitor API and logic
lib/llm/src/http/service/disconnect.rs
create_connection_monitor signature extended to accept metrics: Option<Arc<Metrics>>; internal connection_monitor updated to accept metrics and call inc_client_disconnect() on client or stream disconnect paths before killing engine context.
HTTP OpenAI handlers wiring
lib/llm/src/http/service/openai.rs
Handlers updated to call create_connection_monitor(context.clone(), Some(state.metrics_clone())).await, passing metrics into the monitor.
gRPC OpenAI handler wiring
lib/llm/src/grpc/service/openai.rs
Streaming handler updated to call create_connection_monitor(context.clone(), Some(state.metrics_clone())).await, passing metrics into the monitor.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Client
  participant H as OpenAI Handler (HTTP/gRPC)
  participant D as disconnect::create_connection_monitor
  participant M as Monitor Task
  participant E as Engine Context
  participant X as Metrics (optional)

  C->>H: Start request / stream
  H->>D: create_connection_monitor(context, Some(metrics))
  D->>M: spawn monitor(context, metrics?)
  Note over M: Observe client and stream lifecycles

  alt Client disconnect or stream error
    M->>X: inc_client_disconnect() [if metrics provided]
    M->>E: kill()
  else Graceful/normal completion
    M-->>E: no-op (graceful shutdown)
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks (1 passed, 2 warnings)

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Title Check ⚠️ Warning The title “feat: frontend, disconnect metrics (#1)” includes an irrelevant “frontend” scope and does not clearly focus on the primary change of adding client disconnect metrics to the LLM service, making it misleading and less concise. Rename the title to clearly reflect the main change, for example “feat(llm): add metrics for client disconnects,” removing unrelated scope qualifiers.
Description Check ⚠️ Warning The description does not follow the repository’s template: it lacks the required Overview, Details, Where should the reviewer start, and Related Issues sections and does not clearly describe the specific code changes. Update the pull request description to include the template headings—Overview, Details, Where should the reviewer start, and Related Issues—and fill in each section with a concise summary of changes, key files, and any linked issue numbers.
✅ Passed checks (1 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Poem

A whisk of wires, a metric hops,
Counting drops when a connection stops.
Gauges twitch as streams unwind,
Numbers tick where clients signed.
I thump approval—tidy clicks! 🥕📈

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
		  - name: "Undocumented Breaking Changes"
			  mode: "warning"
			  instructions: |
				  Flag potential breaking changes that are not documented:
				  1. Identify changes to public APIs/exports, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints (including removed/renamed items and changes to types, required params, return values, defaults, or behavior).
				  2. Ignore purely internal/private changes (e.g., code not exported from package entry points or marked internal).
				  3. Verify documentation exists: a "Breaking Change" section in the PR description and updates to CHANGELOG.md.

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@michaelfeil michaelfeil closed this Sep 9, 2025
@michaelfeil michaelfeil reopened this Sep 9, 2025
@michaelfeil

Copy link
Copy Markdown
Contributor Author

@coderabbitai help

@coderabbitai

coderabbitai Bot commented Sep 9, 2025

Copy link
Copy Markdown
Contributor

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
    • @coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
    • @coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve to resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, support, documentation and community

  • Visit our status page to check the current availability of CodeRabbit.
  • Create a ticket on our support page for assistance with any issues or questions.
  • Visit our documentation site for detailed information on how to use CodeRabbit.
  • Join our Discord community to connect with other users and get help from the community.
  • Follow us on X/Twitter for updates and announcements.

@michaelfeil

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Sep 9, 2025

Copy link
Copy Markdown
Contributor
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@michaelfeil

Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Sep 9, 2025

Copy link
Copy Markdown
Contributor
✅ Actions performed

Full review triggered.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/llm/src/http/service/metrics.rs (1)

21-27: Use a Counter (monotonic) instead of a Gauge for disconnect events

Disconnects are events, not a level. Prometheus best practice: model as a counter with _total suffix. Current IntGauge risks accidental decrements and breaks standard rate queries.

Apply the following minimal change:

--- a/lib/llm/src/http/service/metrics.rs
+++ b/lib/llm/src/http/service/metrics.rs
@@
 pub struct Metrics {
     request_counter: IntCounterVec,
     inflight_gauge: IntGaugeVec,
-    client_disconnect_gauge: prometheus::IntGauge,
+    client_disconnect_counter: prometheus::IntCounter,
     request_duration: HistogramVec,
@@
-        let client_disconnect_gauge = prometheus::IntGauge::new(
-            frontend_metric_name("client_disconnects"),
-            "Number of connections dropped by clients",
-        )
+        let client_disconnect_counter = prometheus::IntCounter::new(
+            frontend_metric_name("client_disconnects_total"),
+            "Total number of client disconnect events observed",
+        )
         .unwrap();
@@
         Metrics {
             request_counter,
             inflight_gauge,
-            client_disconnect_gauge,
+            client_disconnect_counter,
             request_duration,
@@
-    /// Increment the gauge for client disconnections
-    pub fn inc_client_disconnect(&self) {
-        self.client_disconnect_gauge.inc();
-    }
+    /// Increment the counter for client disconnections
+    pub fn inc_client_disconnect(&self) {
+        self.client_disconnect_counter.inc();
+    }
@@
-    /// Get the count of client disconnections
-    pub fn get_client_disconnect_count(&self) -> i64 {
-        self.client_disconnect_gauge.get()
-    }
+    /// Get the total count of client disconnections
+    pub fn get_client_disconnect_count(&self) -> u64 {
+        self.client_disconnect_counter.get()
+    }
@@
-        registry.register(Box::new(self.client_disconnect_gauge.clone()))?;
+        registry.register(Box::new(self.client_disconnect_counter.clone()))?;

Optional follow-up (future PR): prefer IntCounterVec with labels like phase (connection|stream) and/or endpoint for richer diagnostics.

Also applies to: 137-142, 205-209, 274-283, 287-287

🧹 Nitpick comments (1)
lib/llm/src/http/service/disconnect.rs (1)

100-103: Disconnect accounting hook — LGTM; consider labeling by phase

The optional metrics handle is cleanly threaded and increments on unexpected closes in both phases.

Nice-to-have: capture a label for where the drop occurred (“connection” vs “stream”) to aid debugging. This would mean switching to an IntCounterVec and calling inc_client_disconnect("connection") or ("stream"). If you want, I can draft that change.

Also applies to: 111-116, 126-131, 136-139, 150-153

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09c7b73 and 0ac83d5.

📒 Files selected for processing (4)
  • lib/llm/src/grpc/service/openai.rs (1 hunks)
  • lib/llm/src/http/service/disconnect.rs (5 hunks)
  • lib/llm/src/http/service/metrics.rs (4 hunks)
  • lib/llm/src/http/service/openai.rs (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
lib/llm/src/grpc/service/openai.rs (1)
lib/llm/src/http/service/disconnect.rs (1)
  • create_connection_monitor (100-123)
lib/llm/src/http/service/metrics.rs (1)
lib/llm/src/http/service/service_v2.rs (1)
  • new (74-86)
lib/llm/src/http/service/openai.rs (2)
lib/llm/src/http/service/disconnect.rs (1)
  • create_connection_monitor (100-123)
lib/llm/src/http/client.rs (2)
  • context (316-318)
  • context (352-354)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (4)
lib/llm/src/grpc/service/openai.rs (1)

55-56: Metrics propagation into connection monitor — LGTM

Passing Some(state.metrics_clone()) ensures gRPC disconnects are counted the same way as HTTP. No issues spotted.

lib/llm/src/http/service/openai.rs (3)

423-425: Chat completions wired for disconnect metrics — LGTM

Matches the completions path.


656-658: Responses path: metrics handle passed — LGTM

_stream_handle is intentionally unused for unary; fine.


227-229: Approve: All create_connection_monitor call sites updated
Verified every invocation—in both HTTP and gRPC services—now uses the two-argument signature (context.clone(), Some(state.metrics_clone())). No further changes needed.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/llm/src/http/service/metrics.rs (1)

21-27: Use a Counter (not Gauge) for “client disconnects”.

This is a monotonically increasing count; Prometheus best practice is an IntCounter (ideally with “_total” suffix). Using a Gauge risks accidental decrements and confuses rate calculations.

Apply:

- use prometheus::{Encoder, HistogramOpts, HistogramVec, IntCounterVec, IntGaugeVec, Opts};
+ use prometheus::{Encoder, HistogramOpts, HistogramVec, IntCounter, IntCounterVec, IntGaugeVec, Opts};
@@
-    client_disconnect_gauge: prometheus::IntGauge,
+    client_disconnect_counter: IntCounter,
@@
-        let client_disconnect_gauge = prometheus::IntGauge::new(
-            frontend_metric_name("client_disconnects"),
-            "Number of connections dropped by clients",
-        )
+        let client_disconnect_counter = prometheus::IntCounter::new(
+            frontend_metric_name("client_disconnects_total"),
+            "Total number of connections closed by clients",
+        )
         .unwrap();
@@
-            client_disconnect_gauge,
+            client_disconnect_counter,
@@
-    /// Increment the gauge for client disconnections
-    pub fn inc_client_disconnect(&self) {
-        self.client_disconnect_gauge.inc();
-    }
+    /// Increment the counter for client disconnections
+    pub fn inc_client_disconnect(&self) {
+        self.client_disconnect_counter.inc();
+    }
@@
-    /// Get the count of client disconnections
-    pub fn get_client_disconnect_count(&self) -> i64 {
-        self.client_disconnect_gauge.get()
-    }
+    /// Get the count of client disconnections
+    pub fn get_client_disconnect_count(&self) -> i64 {
+        self.client_disconnect_counter.get() as i64
+    }
@@
-        registry.register(Box::new(self.client_disconnect_gauge.clone()))?;
+        registry.register(Box::new(self.client_disconnect_counter.clone()))?;

Optional follow-up: expose as a labeled counter (e.g., by endpoint/model/phase), if you want per-slice visibility.

Also applies to: 137-142, 208-214, 274-283, 287-287

🧹 Nitpick comments (4)
lib/llm/src/http/service/openai.rs (2)

226-244: Pass-through metrics to connection monitor — good; fix error text.

The new metrics handle wiring looks correct. Minor nit: the error string still says “chat completions” in the completions handler.

Apply:

-            ErrorMessage::internal_server_error(&format!(
-                "Failed to await chat completions task: {:?}",
-                e,
-            ))
+            ErrorMessage::internal_server_error(&format!(
+                "Failed to await completions task: {:?}",
+                e,
+            ))

656-666: Metrics wiring is good; fix response handler error text.

The metrics handle is correctly passed. Nit on the error message: s/“chat completions”/“responses”.

Apply:

-            ErrorMessage::internal_server_error(&format!(
-                "Failed to await chat completions task: {:?}",
-                e,
-            ))
+            ErrorMessage::internal_server_error(&format!(
+                "Failed to await responses task: {:?}",
+                e,
+            ))
lib/llm/src/http/service/metrics.rs (1)

137-142: Consider labels for richer attribution.

If feasible, prefer an IntCounterVec with labels like phase={connection|stream}, transport={http|grpc}, and optionally model/endpoint to localize drops.

Example:

- let client_disconnect_counter = IntCounter::new(
-   frontend_metric_name("client_disconnects_total"), ...);
+ let client_disconnect_counter = IntCounterVec::new(
+   Opts::new(frontend_metric_name("client_disconnects_total"),
+             "Total client-initiated disconnects"),
+   &["phase","transport"]
+ )?;

You can then increment with concrete labels where you know the context.

lib/llm/src/http/service/disconnect.rs (1)

100-103: Minor: make create_connection_monitor a sync function (future cleanup).

The function doesn’t await; consider making it fn (not async) in a later PR to remove unnecessary .await at call sites.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09c7b73 and 0ac83d5.

📒 Files selected for processing (4)
  • lib/llm/src/grpc/service/openai.rs (1 hunks)
  • lib/llm/src/http/service/disconnect.rs (5 hunks)
  • lib/llm/src/http/service/metrics.rs (4 hunks)
  • lib/llm/src/http/service/openai.rs (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (4)
lib/llm/src/http/service/openai.rs (1)

423-425: LGTM: metrics handle plumbed into connection monitor.

This correctly enables disconnect tracking for chat completions. No further action.

lib/llm/src/grpc/service/openai.rs (1)

55-57: LGTM: gRPC path now feeds metrics into the disconnect monitor.

This mirrors the HTTP path and ensures client drops on streams are recorded.

lib/llm/src/http/service/disconnect.rs (2)

100-116: Disconnect counting and cancellation path look correct.

Incrementing on ClosedUnexpectedly for both primary and stream handles matches the intent and avoids false positives when gracefully disarmed.

Also applies to: 130-153


100-116: All create_connection_monitor call sites include the metrics argument – no instances remain using the old single-parameter signature.

@nnshah1 nnshah1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grahamking

Copy link
Copy Markdown
Contributor

/ok to test 0ac83d5

@michaelfeil

Copy link
Copy Markdown
Contributor Author

@grahamking Not sure what is going on with the CI here.

@grahamking

Copy link
Copy Markdown
Contributor

/ok to test de9098b

@grahamking grahamking merged commit e42746a into ai-dynamo:main Sep 10, 2025
15 of 16 checks passed
ayushag-nv pushed a commit that referenced this pull request Sep 15, 2025
Signed-off-by: michaelfeil <me@michaelfeil.eu>
Signed-off-by: ayushag <ayushag@nvidia.com>
zhongdaor-nv pushed a commit that referenced this pull request Sep 15, 2025
Signed-off-by: michaelfeil <me@michaelfeil.eu>
Signed-off-by: zhongdaor <zhongdaor@nvidia.com>
yao531441 pushed a commit to yao531441/dynamo that referenced this pull request May 13, 2026
Signed-off-by: michaelfeil <me@michaelfeil.eu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contribution Pull request is from an external contributor feat size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants