[Continuous Batching] Snapshot generation outputs without mutating request state#46670
Conversation
|
|
||
| def to_generation_output(self): | ||
| """Convert the request state to a GenerationOutput object.""" | ||
| prompt_ids = self.initial_tokens |
There was a problem hiding this comment.
Could we move this definition to an else block to improve readability? Same with generated_tokens I think
| """Convert the request state to a GenerationOutput object.""" | ||
| prompt_ids = self.initial_tokens | ||
| generated_tokens = self.generated_tokens[:] | ||
| logprobs = self.logprobs[:] |
There was a problem hiding this comment.
No need to define intermediate variables here, no? We can just inline them when creating GenerationOutput
2d74078 to
268744f
Compare
|
Thanks, addressed both nits. I kept the list copies for the mutable output fields so the snapshot behavior stays intact. |
|
CI Dashboard: View test results in Grafana |
remi-or
left a comment
There was a problem hiding this comment.
LGTM! Thanks for the contribution.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Thanks @remi-or! Those nits were good calls, and the code reads cleaner now. Appreciate the careful review and merge. |
What does this PR do?
This fixes
RequestState.to_generation_output()in continuous batching so thatbuilding a
GenerationOutputdoes not mutate the active request state or returnlive aliases for growing output buffers.
Streaming continuous batching can call
to_generation_output()after everygenerated token. On current
main, that method returns the same list objectsheld by the active
RequestStateforgenerated_tokens,logprobs, andtimestamps. It also mutatesself.generated_tokensandself.initial_tokensin the
_true_initial_tokenssoft-reset path.That creates two observable issues:
generated
prompt-suffix tokens back into
state.generated_tokens, so the request canterminate with fewer generated tokens than requested
This PR makes output conversion a snapshot operation. It keeps
prompt_idsbyreference on the normal path to avoid copying the prompt for every streamed
token, copies only the growing buffers, and computes the soft-reset
prompt/generated split locally without changing
RequestState.I added two CPU-only regression tests:
test_generation_outputs_are_snapshotstest_streaming_output_after_soft_reset_does_not_shorten_generationThese fail on the previous implementation and pass with this change.
As an additional check, I ran a CUDA end-to-end run with real continuous
batching streaming requests,
use_cuda_graph=True, forced soft reset(
offload_calls=7,soft_reset_calls=7),do_sample=False, andmax_new_tokens=30. On the unpatched source, several requests stopped at 21-23generated tokens and some previously returned first chunks later had length
20/22/8. With this patch, all 12 requests reached 30 generated tokens and all
first chunks remained length 1. I can provide a small repro harness if useful.
Code Agent Policy
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.