[RL] support abort all and fix abort on waiting queue#6855
[RL] support abort all and fix abort on waiting queue#6855zhuzilin wants to merge 3 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Hello @zhuzilin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! Gemini here, providing a summary of this pull request. This PR aims to enhance the request abortion functionality. Specifically, it introduces the ability to abort all pending and running requests by sending an empty request ID (rid) to the /abort_request endpoint. Additionally, it refines the handling of requests that are aborted while still in the waiting queue, ensuring they are properly marked as finished and return an empty result with an 'abort' finish reason, preventing unnecessary prefill operations.
Highlights
- Abort All Requests: Adds support for aborting all requests (both in the waiting queue and currently running/processing) when the
/abort_requestendpoint receives a request with an emptyrid. - Improved Waiting Queue Abort Handling: Modifies the internal handling of requests aborted while in the waiting queue. Instead of simply removing them, they are now explicitly marked as finished, and a specific response indicating an 'abort' finish reason is generated and returned, avoiding subsequent processing like prefill.
- API Protocol Update: Updates the OpenAI API protocol definitions to include 'abort' as a valid
finish_reasonfor both completion and chat completion stream responses.
Changelog
- python/sglang/srt/managers/scheduler.py
- Added logic to the
abort_requestmethod (lines 2036-2038) to identify all requests in the waiting queue for deletion if the receivedridis empty. - Added logic to the
abort_requestmethod (lines 2055-2057) to mark all non-finished requests in the running and current batches for abortion if the receivedridis empty.
- Added logic to the
- python/sglang/srt/managers/tokenizer_manager.py
- Modified the
abort_requestmethod (line 788) to allow an emptyridto bypass the check if theridexists inrid_to_state, enabling the 'abort all' signal to be sent to the scheduler. - Rewrote the
_handle_abort_reqmethod (lines 1420-1436) to explicitly set the request state to finished, append a result dictionary with an empty text and an 'abort' finish reason, and set the event, instead of just popping the state. This ensures proper cleanup and client notification for requests aborted before prefill.
- Modified the
- python/sglang/srt/openai_api/protocol.py
- Added 'abort' to the
Literaltype forfinish_reasonin theCompletionResponseStreamChoicemodel (line 213). - Added 'abort' to the
Literaltype forfinish_reasonin theChatCompletionResponseStreamChoicemodel (line 448).
- Added 'abort' to the
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Requests line up in queue,
Some run, some wait for you.
Send empty ID,
All set free,
Aborted, fresh and new.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces two valuable features: the ability to abort all requests using an empty rid and ensuring that requests aborted from the waiting queue return an empty result without prefill. The changes in scheduler.py and openai_api/protocol.py look good and directly address these goals.
I've identified a couple of areas in tokenizer_manager.py that could be improved, particularly regarding state cleanup and the accuracy of reported prompt_tokens for aborted requests. Additionally, adding unit tests for the new abort functionalities would be beneficial for long-term maintainability.
Overall, good work on implementing these features!
Summary of Findings
- Potential Memory Leak in Abort Handling: In
tokenizer_manager.py, the_handle_abort_reqmethod marks aborted requests as finished but doesn't remove their state fromself.rid_to_state. This could potentially lead to a memory leak over time as aborted request states accumulate. (Commented withhighseverity) - Accuracy of
prompt_tokensfor Aborted Requests: For requests aborted before prefill,_handle_abort_reqintokenizer_manager.pysetsprompt_tokensto 1. This might not accurately reflect the number of tokens processed by the model, potentially impacting metrics. Suggesting to set it to 0 or the actual (unprocessed) prompt length. (Commented withmediumseverity) - Unit Testing: The PR introduces new functionality for aborting requests (abort all, abort from waiting queue). It would be beneficial to add unit tests to cover these new behaviors to ensure correctness and prevent regressions. (Not commented directly as per severity settings, but important for code quality)
- API Documentation: The
/abort_requestendpoint now supports a new behavior (emptyridfor aborting all requests). This change should be documented for API users. (Not commented directly as per severity settings, but important for usability)
Merge Readiness
This pull request makes good progress on enhancing request abortion capabilities. However, there are a couple of issues identified (a potential memory leak and metrics accuracy for prompt_tokens) that should be addressed before merging. The memory leak is a high-severity concern. Additionally, consider adding unit tests for the new abort functionalities and updating any relevant API documentation.
I am unable to approve pull requests, but I recommend addressing the high and medium severity comments before this PR is merged by other reviewers.
| state = self.rid_to_state[recv_obj.rid] | ||
| state.finished = True | ||
| state.out_list.append( | ||
| { | ||
| "text": "", | ||
| "meta_info": { | ||
| "id": recv_obj.rid, | ||
| "finish_reason": { | ||
| "type": "abort", | ||
| "message": "Abort before prefill", | ||
| }, | ||
| "prompt_tokens": 1, | ||
| "completion_tokens": 0, | ||
| }, | ||
| } | ||
| ) | ||
| state.event.set() |
There was a problem hiding this comment.
It appears that when a request is handled by _handle_abort_req, its state is marked as finished, but it's not removed from self.rid_to_state. In contrast, _handle_batch_output does remove finished requests (e.g., del self.rid_to_state[rid] around line 1223 in the full file).
Could this lead to a memory leak if aborted request states accumulate in self.rid_to_state? If so, should del self.rid_to_state[recv_obj.rid] be added at the end of this handler, similar to how normally completed requests are handled?
| "type": "abort", | ||
| "message": "Abort before prefill", | ||
| }, | ||
| "prompt_tokens": 1, |
There was a problem hiding this comment.
The meta_info for an aborted request sets "prompt_tokens": 1. If a request is aborted "before prefill", it's likely that its prompt tokens haven't been processed by the core inference engine.
Would it be more accurate to set "prompt_tokens": 0 here, or perhaps use the actual length of the input prompt if it's readily available (e.g., from state.obj.input_ids if tokenized, or 0 if not yet tokenized/processed)? Using 1 might be misleading for metrics or accounting if no tokens were actually processed by the model.
| "prompt_tokens": 1, | |
| "prompt_tokens": 0, |
|
please rebase |
| # Delete requests in the waiting queue | ||
| to_del = [] | ||
| for i, req in enumerate(self.waiting_queue): | ||
| if recv_req.rid == "": |
There was a problem hiding this comment.
use a constant ABORT_ALL_RID instead of ""
| }, | ||
| } | ||
| ) | ||
| state.event.set() |
There was a problem hiding this comment.
will this make sure the state is deleted in self.rid_to_state?
|
This PR will be discarded and use #6698 instead. Thanks for contribution. |
Motivation
This PR is trying to achieve 2 goals:
ridin/abort_request./generatewithout making sglang do prefill on the requests.Thanks @yitianlian for his contribution on this design.
Modifications
Note that we always return
"text": ""for requests in waiting queue, maybe we should support the case where user hoping for"output_ids": []?Thank you for your time on reviewing this PR :)
Checklist