[AsyncRL][1/N] Add abort_generation to vllm engine and pause/continue generation to client#537
Conversation
…ontinue generation
There was a problem hiding this comment.
Code Review
This pull request introduces functionality to pause, resume, and abort generation requests, primarily for the vLLM engine, as a foundational step for in-flight weight updates in Async RL. The changes include adding an abort_generation method to the inference engine interface, implementing it for vLLM, and adding pause_generation and resume_generation logic in InferenceEngineClient. A comprehensive test suite is also added to validate the new abort functionality. My review identifies a critical blocking call within an async function and inconsistencies in handling the paused state across different methods. I've provided specific code suggestions to address these points. Overall, the changes are well-structured and move effectively towards the desired feature.
skyrl-train/skyrl_train/inference_engines/inference_engine_client.py
Outdated
Show resolved
Hide resolved
skyrl-train/skyrl_train/inference_engines/inference_engine_client.py
Outdated
Show resolved
Hide resolved
skyrl-train/skyrl_train/inference_engines/inference_engine_client.py
Outdated
Show resolved
Hide resolved
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces important functionality for managing generation requests by adding abort_generation to the vLLM engine and implementing pause/resume logic in the InferenceEngineClient. These changes are a solid foundation for supporting in-flight weight updates. The code is well-structured, and the new unit tests effectively validate the abort mechanism for both running and waiting requests. My review includes suggestions to improve the efficiency of the polling mechanism for resuming generation, clarify some documentation for better developer understanding, and refine error handling to make the client more robust.
skyrl-train/skyrl_train/inference_engines/inference_engine_client.py
Outdated
Show resolved
Hide resolved
skyrl-train/skyrl_train/inference_engines/inference_engine_client.py
Outdated
Show resolved
Hide resolved
skyrl-train/skyrl_train/inference_engines/inference_engine_client.py
Outdated
Show resolved
Hide resolved
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces an important capability for asynchronous reinforcement learning by adding support for pausing, resuming, and aborting generation requests. The implementation is well-structured, with changes spanning the base interface, the client, and the vLLM engine implementation. The addition of abort_generation to the vLLM engine and the pause/resume logic in the InferenceEngineClient are the core of this change. The new unit tests are also a great addition to validate the abort functionality.
My review focuses on improving the robustness of the pausing mechanism, enhancing API consistency, and expanding test coverage to ensure the new features are fully reliable. Overall, this is a solid first step towards supporting in-flight weight updates.
| if self.generation_paused_event.is_set(): | ||
| raise RuntimeError("Generation is already paused, cannot pause again.") | ||
| self.generation_paused_event.set() | ||
| await asyncio.sleep(ABORT_GENERATION_GRACE_PERIOD_SECONDS) |
There was a problem hiding this comment.
The use of asyncio.sleep with a fixed grace period to wait for in-flight requests to reach the engine's scheduler creates a potential race condition. If a request takes longer than ABORT_GENERATION_GRACE_PERIOD_SECONDS to be scheduled (e.g., due to high system load or network latency in a distributed Ray environment), it will not be aborted by pause_generation. This could lead to an inconsistent state where some requests continue running after a pause has been initiated.
A more robust solution would involve a deterministic way to ensure all requests are accounted for. While that might add complexity, a pragmatic improvement would be to make this grace period configurable in the application's settings rather than a hardcoded constant. This would allow operators to tune the value based on their specific deployment's characteristics.
There was a problem hiding this comment.
Good point, we could add a running request counts in InferenceEngineClient and make sure we have aborted that many requests. Will leave it as a future TODO. I believe this sleep() should account for most scenarios.
|
GPU CI running here: https://github.com/NovaSky-AI/SkyRL/actions/runs/18673845719 |
|
ah GPU CI failed due to trying to pickle the thread.Event, |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces the ability to pause, resume, and abort inference generation, primarily for the vLLM engine, as a foundational step for in-flight weight updates. The changes include adding abort_generation to the base interface and implementing it for the async vLLM engine, and adding pause_generation and resume_generation logic to the InferenceEngineClient using a threading.Event for state management. The implementation is well-structured and includes a comprehensive test case to validate the new functionality for both running and waiting requests.
My review focuses on improving configurability and documentation clarity. I've suggested making the hardcoded grace period and polling intervals configurable for better flexibility and performance tuning. I also pointed out a minor inaccuracy in a docstring to ensure the behavior of different endpoints is clearly documented.
skyrl-train/skyrl_train/inference_engines/inference_engine_client.py
Outdated
Show resolved
Hide resolved
|
The new gemini comments are redundant with above rounds |
Clarify steps for aborting requests in documentation.
| return await resp.json() | ||
|
|
||
| async def abort_generation(self) -> None: | ||
| raise NotImplementedError("Abort generation is not supported for remote inference engines.") |
There was a problem hiding this comment.
Once you have support on Ray wrapped inference engines, tasks like supporting and testing the remote engine code path are good tasks to create an issue for and hand off to others!
There was a problem hiding this comment.
good point! Added to the tracker
| if self.generation_paused_event.is_set(): | ||
| raise RuntimeError("Generation is already paused, cannot pause again.") | ||
| self.generation_paused_event.set() | ||
| await asyncio.sleep(ABORT_GENERATION_GRACE_PERIOD_SECONDS) |
There was a problem hiding this comment.
I'm pretty suspicious of using a constant here, which can be brittle, but I get why it's used :) A possible alternative is to query vLLM engines until the total number of outstanding requests in the scheduler matches the number of outstanding requests that the inference engine client knows of. But happy to let that come later
There was a problem hiding this comment.
Yep, briefly disccused about it in the gemini comment. Added to the tracker as well
|
|
||
| tokenizer = AutoTokenizer.from_pretrained(MODEL) | ||
|
|
||
| for api in ["chat_completion", "completion"]: |
There was a problem hiding this comment.
To double check, the reason why you test both chat_completion and completion here is because we are calling it directly on the engine object (rather than through the inference engine client, which doesn't yet support pause/resume for completion), is that right?
There was a problem hiding this comment.
That is correct, I won't support completion retry yet. But for pause and continue, we can test both since we directly call engine's function
|
|
||
|
|
||
| @pytest.mark.vllm | ||
| def test_abort_generation_vllm_engine(ray_init_fixture): |
There was a problem hiding this comment.
Can we add a todo to also test the inference engine client? Namely, this test doesn't cover logic in chat_completion, such as calling _wait_for_generation_to_resume
There was a problem hiding this comment.
Assuming the 2/N PR will be merged soon, I will skip the TODO comment here:)
… generation to client (NovaSky-AI#537) This PR is the first step of supporting in-flight weight updates for Async RL described in literatures like AReal and PipelineRL. We implement the following: - `abort_generation` for vLLM engine, which calls `self.engine.abort(...)`, making all in-flight requests return the already completed tokens - `pause_generation()`, `resume_generation()`, and `_wait_for_generation_to_resume()`, revolving around the control event `generation_paused_event` Note this does PR does not affect any existing codepath. We also add unit tests where we test aborting 2 running and 2 waiting requests -- all should return, where latter return zero completion tokens. Tracked in NovaSky-AI#536 --------- Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>
… generation to client (NovaSky-AI#537) This PR is the first step of supporting in-flight weight updates for Async RL described in literatures like AReal and PipelineRL. We implement the following: - `abort_generation` for vLLM engine, which calls `self.engine.abort(...)`, making all in-flight requests return the already completed tokens - `pause_generation()`, `resume_generation()`, and `_wait_for_generation_to_resume()`, revolving around the control event `generation_paused_event` Note this does PR does not affect any existing codepath. We also add unit tests where we test aborting 2 running and 2 waiting requests -- all should return, where latter return zero completion tokens. Tracked in NovaSky-AI#536 --------- Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>
This PR is the first step of supporting in-flight weight updates for Async RL described in literatures like AReal and PipelineRL.
We implement the following:
abort_generationfor vLLM engine, which callsself.engine.abort(...), making all in-flight requests return the already completed tokenspause_generation(),resume_generation(), and_wait_for_generation_to_resume(), revolving around the control eventgeneration_paused_eventNote this does PR does not affect any existing codepath.
We also add unit tests where we test aborting 2 running and 2 waiting requests -- all should return, where latter return zero completion tokens.
Tracked in #536