Skip to content

Fix num running requests (load) wrong cleared for ongoing requests#15116

Merged
fzyzcjy merged 38 commits intosgl-project:mainfrom
fzyzcjy:feat/rm_reset_load
Dec 15, 2025
Merged

Fix num running requests (load) wrong cleared for ongoing requests#15116
fzyzcjy merged 38 commits intosgl-project:mainfrom
fzyzcjy:feat/rm_reset_load

Conversation

@fzyzcjy
Copy link
Copy Markdown
Collaborator

@fzyzcjy fzyzcjy commented Dec 14, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented Dec 14, 2025

/tag-and-rerun-ci

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @fzyzcjy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a bug where the count of running requests (worker load) was not being correctly managed, particularly for ongoing requests. It introduces a robust RAII-based mechanism, WorkerLoadGuardV2, to automatically track worker load throughout the lifecycle of a request, ensuring that the load count is always accurate. Concurrently, it centralizes all circuit breaker metric reporting within the CircuitBreaker component, enhancing observability and maintainability. These changes lead to more reliable load balancing and circuit breaking behavior within the system.

Highlights

  • Refactored Worker Load Management: Introduced WorkerLoadGuardV2, an RAII (Resource Acquisition Is Initialization) guard, to automatically increment and decrement worker load, ensuring accurate tracking of ongoing requests throughout their lifecycle.
  • Centralized Circuit Breaker Metrics: All RouterMetrics calls related to circuit breaker state, outcomes, and transitions have been moved into the CircuitBreaker struct itself, improving consistency, reducing redundancy, and enabling per-worker metric labeling.
  • Removed Periodic Load Reset: The previous periodic reset mechanism for worker loads in the WorkerRegistry has been eliminated, as the new RAII guard handles load tracking on a granular, per-request basis, making the system more reactive and accurate.
  • Improved Streaming Load Handling: The logic for decrementing load for streaming requests has been simplified by leveraging the WorkerLoadGuardV2 and setting it to None upon stream completion, ensuring load is correctly released even for long-lived connections.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides an excellent fix for a critical bug in tracking the number of running requests, especially for ongoing streaming requests. The introduction of the WorkerLoadGuardV2 RAII guard is a significant improvement, making load management much more robust and less error-prone by ensuring increment_load and decrement_load are always correctly paired, even across error paths and different response types. This change has led to cleaner, more maintainable code in router.rs and has correctly allowed for the removal of the periodic load reset workaround in worker_registry.rs. Additionally, the refactoring of circuit breaker metrics into the CircuitBreaker struct itself is a great design choice that improves encapsulation. Overall, these changes substantially enhance the correctness and maintainability of the codebase.

I am having trouble creating individual review comments. Click here to see my feedback.

sgl-model-gateway/src/core/circuit_breaker.rs (316-320)

medium

Creating the publish_gauge_metrics helper function is a great refactoring. It centralizes the logic for publishing gauge metrics, reduces code duplication in record_outcome, transition_to, and reset, and makes it easier to manage which metrics are published and when. The TODO comment about potentially publishing on every change is noted, but the current approach of publishing on key events seems like a reasonable and efficient balance.

sgl-model-gateway/src/core/worker.rs (1037-1053)

medium

The introduction of WorkerLoadGuardV2 is an excellent application of the RAII pattern. Using a guard whose Drop implementation decrements the load counter is a much more robust and idiomatic Rust approach than manually tracking and decrementing the load. This will prevent entire classes of bugs related to incorrect load counting, especially in complex scenarios involving streaming and error handling. The TODO comment about migrating fully to V2 is also noted and appreciated.

sgl-model-gateway/src/core/worker_registry.rs (462-469)

medium

Removing the periodic load reset logic is a fantastic improvement. This kind of reset is often a workaround for incorrect state management. Its removal, along with the associated counter variables, demonstrates confidence that the root cause of the load counting issue has been properly fixed with the new WorkerLoadGuardV2, leading to a more correct and reliable system.

sgl-model-gateway/src/routers/http/router.rs (237-238)

medium

Replacing the manual load_incremented boolean flag with the WorkerLoadGuardV2 RAII guard greatly improves the code's correctness and readability. This ensures that the load is always decremented when the guard goes out of scope, simplifying error handling and logic flow.

@fzyzcjy fzyzcjy marked this pull request as ready for review December 14, 2025 14:39
@fzyzcjy fzyzcjy merged commit 89ad390 into sgl-project:main Dec 15, 2025
59 checks passed
Liwansi added a commit to iforgetmyname/sglang that referenced this pull request Dec 15, 2025
…n_eagle3_npu

* 'main' of https://github.com/sgl-project/sglang: (89 commits)
  [model-gateway] Remove legacy RouterMetrics and Rename SmgMetrics to Metrics and smg_labels to metrics_labels (sgl-project#15160)
  [diffusion] fix: fix video model sp when resolution is not specified (sgl-project#15047)
  [diffusion] fix: fix pytorch non-writable array warning (sgl-project#15017)
  [diffusion] fix: cache dit with parallel (sgl-project#15163)
  chore: change npu pr-test a2 runner (sgl-project#15152)
  [Feature] Fuse mrope all in 1 kernel (sgl-project#14906)
  Fix num running requests (load) wrong cleared for ongoing requests (sgl-project#15116)
  Fused two elementwise kernels for k_nope and k_pe concat (sgl-project#14862)
  fix: adding date and fixing release name issue (sgl-project#15174)
  [CPU] Add Gemma3RMSNorm kernel in sgl-kernel and add ut (sgl-project#9324)
  feature: PR wheel (sgl-project#15170)
  [diffusion] model: support mutli-image input and qwen-image-edit-2509 (sgl-project#15005)
  fix CompressedTensorsW8A8Int8 min_capability (sgl-project#13914)
  Tiny improve summary text in `bench_one_batch_server.py` (sgl-project#15158)
  [model-gateway] add mcp and discovery metrics (sgl-project#15156)
  fix: move ci-bot (sgl-project#15154)
  Fix import warnings (sgl-project#15144)
  ci: adding errors to Github summary (sgl-project#14778)
  [model-gateway] Add streaming metrics for harmony gRPC router (sgl-project#15147)
  [model-gateway] upgrade axum and axum server (sgl-project#15146)
  ...

# Conflicts:
#	python/sglang/srt/server_args.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants