Skip to content

Support sliding window in CB#40688

Merged
ArthurZucker merged 40 commits intohuggingface:mainfrom
remi-or:conbat-hybrid
Sep 9, 2025
Merged

Support sliding window in CB#40688
ArthurZucker merged 40 commits intohuggingface:mainfrom
remi-or:conbat-hybrid

Conversation

@remi-or
Copy link
Collaborator

@remi-or remi-or commented Sep 4, 2025

This PR introduces support for sliding window attention in CB, using an allocator that can support both full attention and sliding window attention. Code works with almost similar performances (665 -> 655 tok/s in the example) but I am putting the PR in draft status until:

  • Code is tested on Nvidia with and without sliding window
  • TODO on memory computation is addressed
  • Code is refactored in some places to get smaller files / more code reuse
  • Way more documentation is added to explain the allocation mechanism
  • TODO adressed for very small window sizes

@remi-or remi-or self-assigned this Sep 4, 2025
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@remi-or remi-or force-pushed the conbat-hybrid branch 7 times, most recently from c3f6e76 to dbf7b83 Compare September 8, 2025 09:54
@remi-or remi-or marked this pull request as ready for review September 8, 2025 11:25
@remi-or
Copy link
Collaborator Author

remi-or commented Sep 8, 2025

Removing draft status, benchmark with / without PR gives:

Hardware Implem Troughput on main Throughput after PR
MI325 Eager 1200.55tok/s 1294.30tok/s
MI325 SDPA 666.04 tok/s 687.0 tok/s
H100 Eager 680.58tok/s 728.02tok/s
H100 SDPA 902.37tok/s 974.84tok/s
H100 Flash 1412.85tok/s 1564.68tok/s

with python3 examples/pytorch/continuous_batching.py --attn $attn -mp none --slice-inputs --samples 100

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very very nice! 🚀

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove global var SLIDING_WINDOW

Copy link
Member

@McPatate McPatate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

input_ids = torch.tensor([input_ids]).to("cuda")
attention_mask = torch.ones_like(input_ids)
outputs = model.generate(input_ids, attention_mask=attention_mask, generation_config=generation_config)
# attention_mask = torch.ones_like(input_ids)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# attention_mask = torch.ones_like(input_ids)

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throws a warning otherwise

self._generation_thread = threading.Thread(target=self._run_generation_loop)
self._generation_thread.start()
logger.info("Continuous batching manager started.")
logger.info(f"Continuous batching manager started at {time.time() - self.creation_time} seconds")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.info(f"Continuous batching manager started at {time.time() - self.creation_time} seconds")
logger.info(f"Continuous batching manager started at {time.time() - self.creation_time}")

This isn't really useful imo, usually you configure your logger to include a timestamp. I'd still keep the start_time in case we need it for metrics and whatnot

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept it as debug with better message

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
@ArthurZucker ArthurZucker merged commit 1cdbbb3 into huggingface:main Sep 9, 2025
21 of 23 checks passed
@qgallouedec qgallouedec mentioned this pull request Sep 11, 2025
4 tasks
vijayabhaskar-ev pushed a commit to vijayabhaskar-ev/transformers that referenced this pull request Oct 2, 2025
* CB example: better compare feature

* Cache managers, still issue w/ effective length

* WIP -- fix for effective length

* Renames

* Wroking, need better parity checks, we mind be missing 1 token

* Small fixes

* Fixed wrong attn mask and broke cache into pieces

* Warmup is slowing down things, disabling it

* Cache was too big, fixed

* Simplified index objects

* Added a profile option to the example

* Avoid calls to memory reporing tools

* Restore full attention read indices for better latency

* Adressed some TODOS and style

* Docstrings for cache managers

* Docstrings for Schedulers

* Refactor scheudlers

* [Important] Cache fix for sliding window, check with small sw size

* Updated doc for cache memory compute and cache as a whole

* Moved a todo

* Nits and style

* Fix for when sliding window is smaller than max batch per token

* Paged interface update

* Support for FLash in new API

* Fix example CB

* Fix bug in CB for paged

* Revert example

* Style

* Review compliance

* Style

* Styleeeee

* Removed NO_SLIDING_WINDOW

* Review huggingface#2 compliance

* Better art

* Turn cum_seqlens_k in a dict

* Attn mask is now a dict

* Update examples/pytorch/continuous_batching.py

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

* Adressed McPatate pro review

* Style and fix

---------

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
snorkelopstesting1-a11y pushed a commit to snorkel-marlin-repos/huggingface_transformers_pr_40688_b9926daa-ae1d-440f-92f4-527975c3c44e that referenced this pull request Oct 11, 2025
Original PR #40688 by remi-or
Original: huggingface/transformers#40688
snorkelopstesting1-a11y added a commit to snorkel-marlin-repos/huggingface_transformers_pr_40688_b9926daa-ae1d-440f-92f4-527975c3c44e that referenced this pull request Oct 11, 2025
snorkelopstesting1-a11y pushed a commit to snorkel-marlin-repos/huggingface_transformers_pr_40688_86a525f7-187b-44c1-9196-444938d2c9a2 that referenced this pull request Oct 11, 2025
Original PR #40688 by remi-or
Original: huggingface/transformers#40688
snorkelopstesting1-a11y added a commit to snorkel-marlin-repos/huggingface_transformers_pr_40688_86a525f7-187b-44c1-9196-444938d2c9a2 that referenced this pull request Oct 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants