Skip to content

[Bug Fix] Fix non-streaming request abort failure when --enable-metrics is enabled#20625

Merged
hnyls2002 merged 9 commits intosgl-project:mainfrom
fanghao566:fix/non-streaming-abort-with-metrics
Mar 23, 2026
Merged

[Bug Fix] Fix non-streaming request abort failure when --enable-metrics is enabled#20625
hnyls2002 merged 9 commits intosgl-project:mainfrom
fanghao566:fix/non-streaming-abort-with-metrics

Conversation

@fanghao566
Copy link
Copy Markdown
Contributor

Motivation

When --enable-metrics is enabled, non-streaming requests to /v1/chat/completions are
not aborted when the client disconnects (e.g., curl cancelled with Ctrl+C). Without
--enable-metrics, abort works correctly.

Root Cause

add_prometheus_track_response_middleware() uses @app.middleware("http"), which is
backed by Starlette's BaseHTTPMiddleware. Its call_next() replaces the ASGI receive
callable with an internal channel, breaking request.is_disconnected() in downstream
route handlers. The disconnect detection in tokenizer_manager.py _wait_one_response()
relies on request.is_disconnected(), which always returns False when
BaseHTTPMiddleware is in the middleware stack.

This is a known Starlette issue:

Steps to Reproduce

  1. Launch server with --enable-metrics
  2. Send a non-streaming request: curl -X POST http://localhost:30000/v1/chat/completions -d '...'
  3. Cancel curl with Ctrl+C while the request is processing
  4. Observe: server does NOT abort the request (no "Abort" in logs)
  5. Repeat without --enable-metrics: server correctly aborts the request

Fixes #20623

Solution

Patch @app.middleware("http") to use a pure ASGI middleware adapter (_PureASGIDispatch)
that provides a fixed call_next. The fixed call_next passes ASGI receive through
untouched instead of replacing it, so request.is_disconnected() keeps working.

The upstream dispatch function (track_http_status_code) is not modified — only the
call_next it receives is replaced. Any future changes to the dispatch function body
(e.g., new counters, routing keys in v0.5.9+) work automatically without updating the patch.

Changes

File Change
python/sglang/srt/utils/http_middleware_patch.py New file — Pure ASGI call_next adapter + patch_app_http_middleware()
python/sglang/srt/utils/common.py +3 lines — comment + import + call before @app.middleware("http")
test/registered/scheduler/test_abort_with_metrics.py New file — Unit tests for abort + metrics with --enable-metrics

Test Plan

  • test_abort_non_streaming_with_metrics — Launch server with --enable-metrics,
    send 16 concurrent non-streaming requests in a subprocess, terminate the subprocess
    to simulate client disconnect, verify server logs contain "Abort"
  • test_metrics_still_work — Launch server with --enable-metrics, send a normal
    request, verify /metrics endpoint returns sglang:http_requests_total and
    sglang:http_responses_total counters
  • Existing test_abort.py tests still pass

Run tests:

python test/registered/scheduler/test_abort_with_metrics.py -v

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a significant issue preventing non-streaming requests from being properly aborted by the server when Prometheus metrics were active. The solution involves a targeted patch to the HTTP middleware, which re-establishes the correct propagation of client disconnect signals, thereby restoring the expected abort behavior without negatively impacting the existing metrics collection system.

Highlights

  • Bug Fix for Request Abort: Addressed a critical bug where non-streaming requests to /v1/chat/completions were not aborted upon client disconnection when the --enable-metrics flag was enabled.
  • Middleware Patch Implementation: Implemented a pure ASGI middleware adapter (_PureASGIDispatch) to patch Starlette's @app.middleware("http"). This ensures the ASGI receive callable is passed through untouched, restoring request.is_disconnected() functionality.
  • Comprehensive Test Coverage: Introduced new unit tests (test_abort_with_metrics.py) to specifically validate that non-streaming requests are correctly aborted with metrics enabled and to confirm that Prometheus metrics collection remains functional after the patch.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/srt/utils/common.py
    • Added an import for patch_app_http_middleware and called it within add_prometheus_track_response_middleware to apply the fix.
  • python/sglang/srt/utils/http_middleware_patch.py
    • Introduced a new module containing _PureASGIDispatch and patch_app_http_middleware to fix the middleware behavior.
  • test/registered/scheduler/test_abort_with_metrics.py
    • Added new tests to verify non-streaming abort with metrics and confirm metrics functionality.
Activity
  • Identified and reproduced a bug where non-streaming requests were not aborted when --enable-metrics was enabled, linking to issue [Bug] Non-streaming requests are not aborted on client disconnect when --enable-metrics is set #20623.
  • Developed a solution involving a pure ASGI middleware adapter to fix the request.is_disconnected() issue caused by BaseHTTPMiddleware.
  • Implemented comprehensive unit tests, including test_abort_non_streaming_with_metrics and test_metrics_still_work, to validate the fix and ensure no regressions.
  • Confirmed that existing test_abort.py tests continue to pass.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a well-designed fix for an issue where non-streaming requests were not being aborted correctly when metrics are enabled. The approach of patching Starlette's HTTP middleware to prevent it from replacing the ASGI receive callable is clever and effectively resolves the problem. The new tests are thorough, covering both the abort functionality and ensuring that metrics collection remains operational. My review includes a couple of minor suggestions for improving code quality in the new files.

Comment thread python/sglang/srt/utils/http_middleware_patch.py Outdated
Comment thread test/registered/scheduler/test_abort_with_metrics.py Outdated
@fanghao566 fanghao566 force-pushed the fix/non-streaming-abort-with-metrics branch from bb8fdb4 to ce4f55e Compare March 15, 2026 12:02
…cs is enabled

@app.middleware("http") uses Starlette's BaseHTTPMiddleware whose call_next()
replaces the ASGI `receive` callable. This breaks request.is_disconnected()
in downstream handlers, preventing non-streaming request abort on client
disconnect when --enable-metrics is set.
Add a pure ASGI middleware adapter (http_middleware_patch.py) that provides a
fixed call_next passing `receive` through untouched. The upstream dispatch
function body is not modified — only the call_next it receives is replaced.
Fixes sgl-project#20623
@fanghao566 fanghao566 force-pushed the fix/non-streaming-abort-with-metrics branch from 6534e2a to 938ef42 Compare March 15, 2026 12:58
@sufeng-buaa
Copy link
Copy Markdown
Collaborator

LGTM. I have verified it locally, and it’s OK.

@sufeng-buaa
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@fanghao566
Copy link
Copy Markdown
Contributor Author

Hi, the CI failures in the latest run are all unrelated to this PR's changes (HTTP middleware patch for non-streaming abort). Here's a breakdown:

Failed Check Cause
stage-b-test-small-1-gpu (1) pip install OSError — cuda_bindings-12.9.6.dist-info/METADATA not found on runner
stage-b-test-large-1-gpu (8) Flaky registered/lora/test_lora_update.py
per-commit-1-npu-a2 (1) Ascend NPU test test_ascend_autoround_moe.py failure
multimodal-gen-test-8-npu-a3 Ascend NPU test_diffusion_generation performance assertion max retry exceeded
wait-for-stage-b Cascading from stage-b-test-small-1-gpu (1)
pr-test-finish Cascading from wait-for-stage-b

None of the failed tests are related to the files changed in this PR. Could a maintainer please re-run the failed checks? Thanks!

@sufeng-buaa
Copy link
Copy Markdown
Collaborator

Hi, the CI failures in the latest run are all unrelated to this PR's changes (HTTP middleware patch for non-streaming abort). Here's a breakdown:

stage-b-test-small-1-gpu (1) was caused by other commits on the main branch. Try waiting for half a day and then rebasing onto the latest code to see if that helps. Other failed test cases can be retried multiple times.

@fanghao566
Copy link
Copy Markdown
Contributor Author

@sufeng-buaa Hi, I've rebased and re-run CI a few times now, but the failures keep coming from unrelated tests. The main branch CI itself seems unstable recently. Could you help re-run the failed checks or merge this PR directly if the failures are confirmed unrelated? Thanks!

@sufeng-buaa
Copy link
Copy Markdown
Collaborator

sufeng-buaa commented Mar 22, 2026

@sufeng-buaa Hi, I've rebased and re-run CI a few times now, but the failures keep coming from unrelated tests. The main branch CI itself seems unstable recently. Could you help re-run the failed checks or merge this PR directly if the failures are confirmed unrelated? Thanks!

I don’t have permission to merge directly unless all CI checks have fully passed. You can ask @hnyls2002 to see whether it can be merged directly. I can help rerun the failed cases a few more times.

@hnyls2002
Copy link
Copy Markdown
Collaborator

@sufeng-buaa @fanghao566 We were in the CI maintenance yesterday. And I believe at this moment, all CUDA CI is good.

See #21065

)

register_cuda_ci(est_time=180, suite="stage-b-test-small-1-gpu")
register_amd_ci(est_time=300, suite="stage-b-test-small-1-gpu-amd")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

common features you do not register to AMD

Copy link
Copy Markdown
Collaborator

@hnyls2002 hnyls2002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test: prefer mock over launching a real server

The test in test_abort_with_metrics.py launches a full server with model loading (est ~180s) to verify a middleware-level ASGI receive passthrough fix. This can be tested much faster and more precisely with unittest.mock:

  • Construct a fake ASGI app that checks request.is_disconnected() inside the handler
  • Wrap it with _PureASGIDispatch
  • Simulate disconnect by making receive return {"type": "http.disconnect"}
  • Assert is_disconnected() returns True

This directly tests the fix (ASGI receive passthrough) without needing inference, model loading, or GPU time. The existing test_abort.py already covers the full e2e abort flow.

Also:

  • Remove register_amd_ci — this is a backend-independent test (middleware behavior), no need to duplicate across backends
  • Use setUpClass / tearDownClass to share the server if an integration test is still desired

Ref: .claude/skills/write-sglang-test/SKILL.md — rules 4 and 7 (Model & Backend Selection table)

Replace the integration test that launches a real inference server
(~180s) with a lightweight unit test that directly exercises
_PureASGIDispatch using fake ASGI receive/send callables.

Changes:
- Test _PureASGIDispatch receive passthrough via mock ASGI pipeline
- Remove register_amd_ci (backend-independent middleware test)
- Reduce est_time from 180s to 10s
@fanghao566
Copy link
Copy Markdown
Contributor Author

Test: prefer mock over launching a real server

The test in test_abort_with_metrics.py launches a full server with model loading (est ~180s) to verify a middleware-level ASGI receive passthrough fix. This can be tested much faster and more precisely with unittest.mock:

  • Construct a fake ASGI app that checks request.is_disconnected() inside the handler
  • Wrap it with _PureASGIDispatch
  • Simulate disconnect by making receive return {"type": "http.disconnect"}
  • Assert is_disconnected() returns True

This directly tests the fix (ASGI receive passthrough) without needing inference, model loading, or GPU time. The existing test_abort.py already covers the full e2e abort flow.

Also:

  • Remove register_amd_ci — this is a backend-independent test (middleware behavior), no need to duplicate across backends
  • Use setUpClass / tearDownClass to share the server if an integration test is still desired

Ref: .claude/skills/write-sglang-test/SKILL.md — rules 4 and 7 (Model & Backend Selection table)

@hnyls2002 Hi, I've updated the test per your feedback:

  • Replaced the full server integration test with a mock ASGI unit test
  • Removed register_amd_ci
  • Reduced est_time from 180s to 10s

Could you please re-review? Thanks!

@hnyls2002
Copy link
Copy Markdown
Collaborator

Only the middleware patch was updated. Basic GPU tests and newly added CPU tests passed.

@hnyls2002 hnyls2002 merged commit 2b47bd3 into sgl-project:main Mar 23, 2026
36 of 76 checks passed
0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026
dutsc pushed a commit to dutsc/sglang that referenced this pull request Mar 30, 2026
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Non-streaming requests are not aborted on client disconnect when --enable-metrics is set

3 participants