[AsyncRL][1/N] Add abort_generation to vllm engine and pause/continue generation to client by CharlieFRuan · Pull Request #537 · NovaSky-AI/SkyRL

CharlieFRuan · 2025-10-21T05:11:21Z

This PR is the first step of supporting in-flight weight updates for Async RL described in literatures like AReal and PipelineRL.

We implement the following:

abort_generation for vLLM engine, which calls self.engine.abort(...), making all in-flight requests return the already completed tokens
pause_generation(), resume_generation(), and _wait_for_generation_to_resume(), revolving around the control event generation_paused_event

Note this does PR does not affect any existing codepath.

We also add unit tests where we test aborting 2 running and 2 waiting requests -- all should return, where latter return zero completion tokens.

Tracked in #536

…ontinue generation

gemini-code-assist

Code Review

This pull request introduces functionality to pause, resume, and abort generation requests, primarily for the vLLM engine, as a foundational step for in-flight weight updates in Async RL. The changes include adding an abort_generation method to the inference engine interface, implementing it for vLLM, and adding pause_generation and resume_generation logic in InferenceEngineClient. A comprehensive test suite is also added to validate the new abort functionality. My review identifies a critical blocking call within an async function and inconsistencies in handling the paused state across different methods. I've provided specific code suggestions to address these points. Overall, the changes are well-structured and move effectively towards the desired feature.

skyrl-train/skyrl_train/inference_engines/inference_engine_client.py

CharlieFRuan · 2025-10-21T05:25:32Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces important functionality for managing generation requests by adding abort_generation to the vLLM engine and implementing pause/resume logic in the InferenceEngineClient. These changes are a solid foundation for supporting in-flight weight updates. The code is well-structured, and the new unit tests effectively validate the abort mechanism for both running and waiting requests. My review includes suggestions to improve the efficiency of the polling mechanism for resuming generation, clarify some documentation for better developer understanding, and refine error handling to make the client more robust.

skyrl-train/skyrl_train/inference_engines/inference_engine_client.py

CharlieFRuan · 2025-10-21T05:34:56Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces an important capability for asynchronous reinforcement learning by adding support for pausing, resuming, and aborting generation requests. The implementation is well-structured, with changes spanning the base interface, the client, and the vLLM engine implementation. The addition of abort_generation to the vLLM engine and the pause/resume logic in the InferenceEngineClient are the core of this change. The new unit tests are also a great addition to validate the abort functionality.

My review focuses on improving the robustness of the pausing mechanism, enhancing API consistency, and expanding test coverage to ensure the new features are fully reliable. Overall, this is a solid first step towards supporting in-flight weight updates.

gemini-code-assist · 2025-10-21T05:37:20Z

skyrl-train/skyrl_train/inference_engines/inference_engine_client.py

+        if self.generation_paused_event.is_set():
+            raise RuntimeError("Generation is already paused, cannot pause again.")
+        self.generation_paused_event.set()
+        await asyncio.sleep(ABORT_GENERATION_GRACE_PERIOD_SECONDS)


The use of asyncio.sleep with a fixed grace period to wait for in-flight requests to reach the engine's scheduler creates a potential race condition. If a request takes longer than ABORT_GENERATION_GRACE_PERIOD_SECONDS to be scheduled (e.g., due to high system load or network latency in a distributed Ray environment), it will not be aborted by pause_generation. This could lead to an inconsistent state where some requests continue running after a pause has been initiated.

A more robust solution would involve a deterministic way to ensure all requests are accounted for. While that might add complexity, a pragmatic improvement would be to make this grace period configurable in the application's settings rather than a hardcoded constant. This would allow operators to tune the value based on their specific deployment's characteristics.

Good point, we could add a running request counts in InferenceEngineClient and make sure we have aborted that many requests. Will leave it as a future TODO. I believe this sleep() should account for most scenarios.

skyrl-train/skyrl_train/inference_engines/inference_engine_client.py

skyrl-train/tests/gpu/gpu_ci/test_pause_and_continue_generation.py

CharlieFRuan · 2025-10-21T05:57:14Z

GPU CI running here: https://github.com/NovaSky-AI/SkyRL/actions/runs/18673845719

CharlieFRuan · 2025-10-21T23:40:37Z

ah GPU CI failed due to trying to pickle the thread.Event, ~~fixing~~ fixed, new run here: https://github.com/NovaSky-AI/SkyRL/actions/runs/18701237223

CharlieFRuan · 2025-10-22T00:13:06Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces the ability to pause, resume, and abort inference generation, primarily for the vLLM engine, as a foundational step for in-flight weight updates. The changes include adding abort_generation to the base interface and implementing it for the async vLLM engine, and adding pause_generation and resume_generation logic to the InferenceEngineClient using a threading.Event for state management. The implementation is well-structured and includes a comprehensive test case to validate the new functionality for both running and waiting requests.

My review focuses on improving configurability and documentation clarity. I've suggested making the hardcoded grace period and polling intervals configurable for better flexibility and performance tuning. I also pointed out a minor inaccuracy in a docstring to ensure the behavior of different endpoints is clearly documented.

skyrl-train/skyrl_train/inference_engines/inference_engine_client.py

CharlieFRuan · 2025-10-22T00:15:47Z

The new gemini comments are redundant with above rounds

Clarify steps for aborting requests in documentation.

tyler-griggs · 2025-10-24T17:44:46Z

skyrl-train/skyrl_train/inference_engines/remote_inference_engine.py

            return await resp.json()

+    async def abort_generation(self) -> None:
+        raise NotImplementedError("Abort generation is not supported for remote inference engines.")


Once you have support on Ray wrapped inference engines, tasks like supporting and testing the remote engine code path are good tasks to create an issue for and hand off to others!

good point! Added to the tracker

tyler-griggs · 2025-10-24T17:47:35Z

skyrl-train/skyrl_train/inference_engines/inference_engine_client.py

+        if self.generation_paused_event.is_set():
+            raise RuntimeError("Generation is already paused, cannot pause again.")
+        self.generation_paused_event.set()
+        await asyncio.sleep(ABORT_GENERATION_GRACE_PERIOD_SECONDS)


I'm pretty suspicious of using a constant here, which can be brittle, but I get why it's used :) A possible alternative is to query vLLM engines until the total number of outstanding requests in the scheduler matches the number of outstanding requests that the inference engine client knows of. But happy to let that come later

Yep, briefly disccused about it in the gemini comment. Added to the tracker as well

tyler-griggs · 2025-10-24T17:49:33Z

skyrl-train/tests/gpu/gpu_ci/test_pause_and_continue_generation.py

+
+    tokenizer = AutoTokenizer.from_pretrained(MODEL)
+
+    for api in ["chat_completion", "completion"]:


To double check, the reason why you test both chat_completion and completion here is because we are calling it directly on the engine object (rather than through the inference engine client, which doesn't yet support pause/resume for completion), is that right?

That is correct, I won't support completion retry yet. But for pause and continue, we can test both since we directly call engine's function

tyler-griggs · 2025-10-24T17:50:51Z

skyrl-train/tests/gpu/gpu_ci/test_pause_and_continue_generation.py

+
+
+@pytest.mark.vllm
+def test_abort_generation_vllm_engine(ray_init_fixture):


Can we add a todo to also test the inference engine client? Namely, this test doesn't cover logic in chat_completion, such as calling _wait_for_generation_to_resume

Assuming the 2/N PR will be merged soon, I will skip the TODO comment here:)

… generation to client (NovaSky-AI#537) This PR is the first step of supporting in-flight weight updates for Async RL described in literatures like AReal and PipelineRL. We implement the following: - `abort_generation` for vLLM engine, which calls `self.engine.abort(...)`, making all in-flight requests return the already completed tokens - `pause_generation()`, `resume_generation()`, and `_wait_for_generation_to_resume()`, revolving around the control event `generation_paused_event` Note this does PR does not affect any existing codepath. We also add unit tests where we test aborting 2 running and 2 waiting requests -- all should return, where latter return zero completion tokens. Tracked in NovaSky-AI#536 --------- Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>

[train][AsyncRL][1/N] Add abort_generation to vllm engine and pause/c…

74ed033

…ontinue generation

CharlieFRuan changed the title ~~[AsyncRL][1/N] Add abort_generation to vllm engine and pause/continue generation~~ [AsyncRL][1/N] Add abort_generation to vllm engine and pause/continue generation to client Oct 21, 2025

CharlieFRuan mentioned this pull request Oct 21, 2025

[Tracker][AsyncRL] Support asynchronous RL such as in-flight weight update #536

Open

10 tasks

gemini-code-assist bot reviewed Oct 21, 2025

View reviewed changes

CharlieFRuan added 2 commits October 21, 2025 05:23

change time.sleep to await asyncio.sleep()

14ac142

update doc

1a97871

gemini-code-assist bot reviewed Oct 21, 2025

View reviewed changes

address gemini comments

8a41ce9

gemini-code-assist bot reviewed Oct 21, 2025

View reviewed changes

CharlieFRuan requested a review from tyler-griggs October 21, 2025 06:12

do not pickle threading.Event to fix CI

f6421f5

gemini-code-assist bot reviewed Oct 22, 2025

View reviewed changes

Improve documentation for aborting requests

d0c81d9

Clarify steps for aborting requests in documentation.

tyler-griggs reviewed Oct 24, 2025

View reviewed changes

tyler-griggs approved these changes Oct 25, 2025

View reviewed changes

CharlieFRuan merged commit 94651ac into main Oct 25, 2025
3 checks passed

CharlieFRuan deleted the charlie-1020-abort-gen branch October 27, 2025 04:23


		tokenizer = AutoTokenizer.from_pretrained(MODEL)

		for api in ["chat_completion", "completion"]:



		@pytest.mark.vllm
		def test_abort_generation_vllm_engine(ray_init_fixture):

Conversation

CharlieFRuan commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CharlieFRuan commented Oct 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CharlieFRuan commented Oct 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

CharlieFRuan commented Oct 21, 2025

Uh oh!

CharlieFRuan commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CharlieFRuan commented Oct 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CharlieFRuan commented Oct 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CharlieFRuan Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CharlieFRuan commented Oct 21, 2025 •

edited

Loading

CharlieFRuan commented Oct 21, 2025 •

edited

Loading

CharlieFRuan Oct 25, 2025 •

edited

Loading