[Bug Fix] Fix non-streaming request abort failure when --enable-metrics is enabled#20625
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a significant issue preventing non-streaming requests from being properly aborted by the server when Prometheus metrics were active. The solution involves a targeted patch to the HTTP middleware, which re-establishes the correct propagation of client disconnect signals, thereby restoring the expected abort behavior without negatively impacting the existing metrics collection system. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request provides a well-designed fix for an issue where non-streaming requests were not being aborted correctly when metrics are enabled. The approach of patching Starlette's HTTP middleware to prevent it from replacing the ASGI receive callable is clever and effectively resolves the problem. The new tests are thorough, covering both the abort functionality and ensuring that metrics collection remains operational. My review includes a couple of minor suggestions for improving code quality in the new files.
bb8fdb4 to
ce4f55e
Compare
…cs is enabled
@app.middleware("http") uses Starlette's BaseHTTPMiddleware whose call_next()
replaces the ASGI `receive` callable. This breaks request.is_disconnected()
in downstream handlers, preventing non-streaming request abort on client
disconnect when --enable-metrics is set.
Add a pure ASGI middleware adapter (http_middleware_patch.py) that provides a
fixed call_next passing `receive` through untouched. The upstream dispatch
function body is not modified — only the call_next it receives is replaced.
Fixes sgl-project#20623
6534e2a to
938ef42
Compare
|
LGTM. I have verified it locally, and it’s OK. |
|
/tag-and-rerun-ci |
|
Hi, the CI failures in the latest run are all unrelated to this PR's changes (HTTP middleware patch for non-streaming abort). Here's a breakdown:
None of the failed tests are related to the files changed in this PR. Could a maintainer please re-run the failed checks? Thanks! |
|
|
@sufeng-buaa Hi, I've rebased and re-run CI a few times now, but the failures keep coming from unrelated tests. The main branch CI itself seems unstable recently. Could you help re-run the failed checks or merge this PR directly if the failures are confirmed unrelated? Thanks! |
I don’t have permission to merge directly unless all CI checks have fully passed. You can ask @hnyls2002 to see whether it can be merged directly. I can help rerun the failed cases a few more times. |
|
@sufeng-buaa @fanghao566 We were in the CI maintenance yesterday. And I believe at this moment, all CUDA CI is good. See #21065 |
| ) | ||
|
|
||
| register_cuda_ci(est_time=180, suite="stage-b-test-small-1-gpu") | ||
| register_amd_ci(est_time=300, suite="stage-b-test-small-1-gpu-amd") |
There was a problem hiding this comment.
common features you do not register to AMD
hnyls2002
left a comment
There was a problem hiding this comment.
Test: prefer mock over launching a real server
The test in test_abort_with_metrics.py launches a full server with model loading (est ~180s) to verify a middleware-level ASGI receive passthrough fix. This can be tested much faster and more precisely with unittest.mock:
- Construct a fake ASGI app that checks
request.is_disconnected()inside the handler - Wrap it with
_PureASGIDispatch - Simulate disconnect by making
receivereturn{"type": "http.disconnect"} - Assert
is_disconnected()returnsTrue
This directly tests the fix (ASGI receive passthrough) without needing inference, model loading, or GPU time. The existing test_abort.py already covers the full e2e abort flow.
Also:
- Remove
register_amd_ci— this is a backend-independent test (middleware behavior), no need to duplicate across backends - Use
setUpClass/tearDownClassto share the server if an integration test is still desired
Ref: .claude/skills/write-sglang-test/SKILL.md — rules 4 and 7 (Model & Backend Selection table)
Replace the integration test that launches a real inference server (~180s) with a lightweight unit test that directly exercises _PureASGIDispatch using fake ASGI receive/send callables. Changes: - Test _PureASGIDispatch receive passthrough via mock ASGI pipeline - Remove register_amd_ci (backend-independent middleware test) - Reduce est_time from 180s to 10s
@hnyls2002 Hi, I've updated the test per your feedback:
Could you please re-review? Thanks! |
|
Only the middleware patch was updated. Basic GPU tests and newly added CPU tests passed. |
Motivation
When
--enable-metricsis enabled, non-streaming requests to/v1/chat/completionsarenot aborted when the client disconnects (e.g., curl cancelled with Ctrl+C). Without
--enable-metrics, abort works correctly.Root Cause
add_prometheus_track_response_middleware()uses@app.middleware("http"), which isbacked by Starlette's
BaseHTTPMiddleware. Itscall_next()replaces the ASGIreceivecallable with an internal channel, breaking
request.is_disconnected()in downstreamroute handlers. The disconnect detection in
tokenizer_manager.py_wait_one_response()relies on
request.is_disconnected(), which always returnsFalsewhenBaseHTTPMiddlewareis in the middleware stack.This is a known Starlette issue:
Steps to Reproduce
--enable-metricscurl -X POST http://localhost:30000/v1/chat/completions -d '...'--enable-metrics: server correctly aborts the requestFixes #20623
Solution
Patch
@app.middleware("http")to use a pure ASGI middleware adapter (_PureASGIDispatch)that provides a fixed
call_next. The fixedcall_nextpasses ASGIreceivethroughuntouched instead of replacing it, so
request.is_disconnected()keeps working.The upstream dispatch function (
track_http_status_code) is not modified — only thecall_nextit receives is replaced. Any future changes to the dispatch function body(e.g., new counters, routing keys in v0.5.9+) work automatically without updating the patch.
Changes
python/sglang/srt/utils/http_middleware_patch.pycall_nextadapter +patch_app_http_middleware()python/sglang/srt/utils/common.py@app.middleware("http")test/registered/scheduler/test_abort_with_metrics.py--enable-metricsTest Plan
test_abort_non_streaming_with_metrics— Launch server with--enable-metrics,send 16 concurrent non-streaming requests in a subprocess, terminate the subprocess
to simulate client disconnect, verify server logs contain "Abort"
test_metrics_still_work— Launch server with--enable-metrics, send a normalrequest, verify
/metricsendpoint returnssglang:http_requests_totalandsglang:http_responses_totalcounterstest_abort.pytests still passRun tests: