Skip to content

[AsyncRL][1/N] Add abort_generation to vllm engine and pause/continue generation to client#537

Merged
CharlieFRuan merged 6 commits intomainfrom
charlie-1020-abort-gen
Oct 25, 2025
Merged

[AsyncRL][1/N] Add abort_generation to vllm engine and pause/continue generation to client#537
CharlieFRuan merged 6 commits intomainfrom
charlie-1020-abort-gen

Conversation

@CharlieFRuan
Copy link
Member

@CharlieFRuan CharlieFRuan commented Oct 21, 2025

This PR is the first step of supporting in-flight weight updates for Async RL described in literatures like AReal and PipelineRL.

We implement the following:

  • abort_generation for vLLM engine, which calls self.engine.abort(...), making all in-flight requests return the already completed tokens
  • pause_generation(), resume_generation(), and _wait_for_generation_to_resume(), revolving around the control event generation_paused_event

Note this does PR does not affect any existing codepath.

We also add unit tests where we test aborting 2 running and 2 waiting requests -- all should return, where latter return zero completion tokens.

Tracked in #536

@CharlieFRuan CharlieFRuan changed the title [AsyncRL][1/N] Add abort_generation to vllm engine and pause/continue generation [AsyncRL][1/N] Add abort_generation to vllm engine and pause/continue generation to client Oct 21, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces functionality to pause, resume, and abort generation requests, primarily for the vLLM engine, as a foundational step for in-flight weight updates in Async RL. The changes include adding an abort_generation method to the inference engine interface, implementing it for vLLM, and adding pause_generation and resume_generation logic in InferenceEngineClient. A comprehensive test suite is also added to validate the new abort functionality. My review identifies a critical blocking call within an async function and inconsistencies in handling the paused state across different methods. I've provided specific code suggestions to address these points. Overall, the changes are well-structured and move effectively towards the desired feature.

@CharlieFRuan
Copy link
Member Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces important functionality for managing generation requests by adding abort_generation to the vLLM engine and implementing pause/resume logic in the InferenceEngineClient. These changes are a solid foundation for supporting in-flight weight updates. The code is well-structured, and the new unit tests effectively validate the abort mechanism for both running and waiting requests. My review includes suggestions to improve the efficiency of the polling mechanism for resuming generation, clarify some documentation for better developer understanding, and refine error handling to make the client more robust.

@CharlieFRuan
Copy link
Member Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an important capability for asynchronous reinforcement learning by adding support for pausing, resuming, and aborting generation requests. The implementation is well-structured, with changes spanning the base interface, the client, and the vLLM engine implementation. The addition of abort_generation to the vLLM engine and the pause/resume logic in the InferenceEngineClient are the core of this change. The new unit tests are also a great addition to validate the abort functionality.

My review focuses on improving the robustness of the pausing mechanism, enhancing API consistency, and expanding test coverage to ensure the new features are fully reliable. Overall, this is a solid first step towards supporting in-flight weight updates.

if self.generation_paused_event.is_set():
raise RuntimeError("Generation is already paused, cannot pause again.")
self.generation_paused_event.set()
await asyncio.sleep(ABORT_GENERATION_GRACE_PERIOD_SECONDS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of asyncio.sleep with a fixed grace period to wait for in-flight requests to reach the engine's scheduler creates a potential race condition. If a request takes longer than ABORT_GENERATION_GRACE_PERIOD_SECONDS to be scheduled (e.g., due to high system load or network latency in a distributed Ray environment), it will not be aborted by pause_generation. This could lead to an inconsistent state where some requests continue running after a pause has been initiated.

A more robust solution would involve a deterministic way to ensure all requests are accounted for. While that might add complexity, a pragmatic improvement would be to make this grace period configurable in the application's settings rather than a hardcoded constant. This would allow operators to tune the value based on their specific deployment's characteristics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, we could add a running request counts in InferenceEngineClient and make sure we have aborted that many requests. Will leave it as a future TODO. I believe this sleep() should account for most scenarios.

@CharlieFRuan
Copy link
Member Author

@CharlieFRuan
Copy link
Member Author

CharlieFRuan commented Oct 21, 2025

ah GPU CI failed due to trying to pickle the thread.Event, fixing fixed, new run here: https://github.com/NovaSky-AI/SkyRL/actions/runs/18701237223

@CharlieFRuan
Copy link
Member Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the ability to pause, resume, and abort inference generation, primarily for the vLLM engine, as a foundational step for in-flight weight updates. The changes include adding abort_generation to the base interface and implementing it for the async vLLM engine, and adding pause_generation and resume_generation logic to the InferenceEngineClient using a threading.Event for state management. The implementation is well-structured and includes a comprehensive test case to validate the new functionality for both running and waiting requests.

My review focuses on improving configurability and documentation clarity. I've suggested making the hardcoded grace period and polling intervals configurable for better flexibility and performance tuning. I also pointed out a minor inaccuracy in a docstring to ensure the behavior of different endpoints is clearly documented.

@CharlieFRuan
Copy link
Member Author

The new gemini comments are redundant with above rounds

Clarify steps for aborting requests in documentation.
return await resp.json()

async def abort_generation(self) -> None:
raise NotImplementedError("Abort generation is not supported for remote inference engines.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you have support on Ray wrapped inference engines, tasks like supporting and testing the remote engine code path are good tasks to create an issue for and hand off to others!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! Added to the tracker

if self.generation_paused_event.is_set():
raise RuntimeError("Generation is already paused, cannot pause again.")
self.generation_paused_event.set()
await asyncio.sleep(ABORT_GENERATION_GRACE_PERIOD_SECONDS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty suspicious of using a constant here, which can be brittle, but I get why it's used :) A possible alternative is to query vLLM engines until the total number of outstanding requests in the scheduler matches the number of outstanding requests that the inference engine client knows of. But happy to let that come later

Copy link
Member Author

@CharlieFRuan CharlieFRuan Oct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, briefly disccused about it in the gemini comment. Added to the tracker as well


tokenizer = AutoTokenizer.from_pretrained(MODEL)

for api in ["chat_completion", "completion"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To double check, the reason why you test both chat_completion and completion here is because we are calling it directly on the engine object (rather than through the inference engine client, which doesn't yet support pause/resume for completion), is that right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct, I won't support completion retry yet. But for pause and continue, we can test both since we directly call engine's function



@pytest.mark.vllm
def test_abort_generation_vllm_engine(ray_init_fixture):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a todo to also test the inference engine client? Namely, this test doesn't cover logic in chat_completion, such as calling _wait_for_generation_to_resume

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the 2/N PR will be merged soon, I will skip the TODO comment here:)

@CharlieFRuan CharlieFRuan merged commit 94651ac into main Oct 25, 2025
3 checks passed
@CharlieFRuan CharlieFRuan deleted the charlie-1020-abort-gen branch October 27, 2025 04:23
li-boxuan pushed a commit to li-boxuan/SkyRL that referenced this pull request Nov 23, 2025
… generation to client (NovaSky-AI#537)

This PR is the first step of supporting in-flight weight updates for
Async RL described in literatures like AReal and PipelineRL.

We implement the following:
- `abort_generation` for vLLM engine, which calls
`self.engine.abort(...)`, making all in-flight requests return the
already completed tokens
- `pause_generation()`, `resume_generation()`, and
`_wait_for_generation_to_resume()`, revolving around the control event
`generation_paused_event`

Note this does PR does not affect any existing codepath.

We also add unit tests where we test aborting 2 running and 2 waiting
requests -- all should return, where latter return zero completion
tokens.

Tracked in NovaSky-AI#536

---------

Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>
dzorlu pushed a commit to fleet-ai/SkyRL that referenced this pull request Feb 4, 2026
… generation to client (NovaSky-AI#537)

This PR is the first step of supporting in-flight weight updates for
Async RL described in literatures like AReal and PipelineRL.

We implement the following:
- `abort_generation` for vLLM engine, which calls
`self.engine.abort(...)`, making all in-flight requests return the
already completed tokens
- `pause_generation()`, `resume_generation()`, and
`_wait_for_generation_to_resume()`, revolving around the control event
`generation_paused_event`

Note this does PR does not affect any existing codepath.

We also add unit tests where we test aborting 2 running and 2 waiting
requests -- all should return, where latter return zero completion
tokens.

Tracked in NovaSky-AI#536

---------

Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants