Skip to content

[Continuous Batching] Snapshot generation outputs without mutating request state#46670

Merged
remi-or merged 1 commit into
huggingface:mainfrom
Incheonkirin:fix-cb-streaming-output-snapshot
Jun 16, 2026
Merged

[Continuous Batching] Snapshot generation outputs without mutating request state#46670
remi-or merged 1 commit into
huggingface:mainfrom
Incheonkirin:fix-cb-streaming-output-snapshot

Conversation

@Incheonkirin

Copy link
Copy Markdown
Contributor

What does this PR do?

This fixes RequestState.to_generation_output() in continuous batching so that
building a GenerationOutput does not mutate the active request state or return
live aliases for growing output buffers.

Streaming continuous batching can call to_generation_output() after every
generated token. On current main, that method returns the same list objects
held by the active RequestState for generated_tokens, logprobs, and
timestamps. It also mutates self.generated_tokens and self.initial_tokens
in the _true_initial_tokens soft-reset path.

That creates two observable issues:

  • a previously returned streaming output can change after later tokens are
    generated
  • after soft reset, repeated output conversion can fold already-generated
    prompt-suffix tokens back into state.generated_tokens, so the request can
    terminate with fewer generated tokens than requested

This PR makes output conversion a snapshot operation. It keeps prompt_ids by
reference on the normal path to avoid copying the prompt for every streamed
token, copies only the growing buffers, and computes the soft-reset
prompt/generated split locally without changing RequestState.

I added two CPU-only regression tests:

  • test_generation_outputs_are_snapshots
  • test_streaming_output_after_soft_reset_does_not_shorten_generation

These fail on the previous implementation and pass with this change.

As an additional check, I ran a CUDA end-to-end run with real continuous
batching streaming requests, use_cuda_graph=True, forced soft reset
(offload_calls=7, soft_reset_calls=7), do_sample=False, and
max_new_tokens=30. On the unpatched source, several requests stopped at 21-23
generated tokens and some previously returned first chunks later had length
20/22/8. With this patch, all 12 requests reached 30 generated tokens and all
first chunks remained length 1. I can provide a small repro harness if useful.

Code Agent Policy

  • I confirm that this is not a pure code agent PR.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

@Rocketknight1

Copy link
Copy Markdown
Member

cc @remi-or @McPatate for CB

@remi-or remi-or left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thanks for the contribution, I just have some style-related nits :)


def to_generation_output(self):
"""Convert the request state to a GenerationOutput object."""
prompt_ids = self.initial_tokens

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this definition to an else block to improve readability? Same with generated_tokens I think

"""Convert the request state to a GenerationOutput object."""
prompt_ids = self.initial_tokens
generated_tokens = self.generated_tokens[:]
logprobs = self.logprobs[:]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to define intermediate variables here, no? We can just inline them when creating GenerationOutput

@Incheonkirin Incheonkirin force-pushed the fix-cb-streaming-output-snapshot branch from 2d74078 to 268744f Compare June 15, 2026 23:15
@Incheonkirin

Copy link
Copy Markdown
Contributor Author

Thanks, addressed both nits. I kept the list copies for the mutable output fields so the snapshot behavior stays intact.

@github-actions

Copy link
Copy Markdown
Contributor

CI Dashboard: View test results in Grafana

@remi-or remi-or left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the contribution.

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@remi-or remi-or added this pull request to the merge queue Jun 16, 2026
Merged via the queue into huggingface:main with commit 7ff490a Jun 16, 2026
121 checks passed
@Incheonkirin

Copy link
Copy Markdown
Contributor Author

Thanks @remi-or! Those nits were good calls, and the code reads cleaner now. Appreciate the careful review and merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants