Support sliding window in CB by remi-or · Pull Request #40688 · huggingface/transformers

remi-or · 2025-09-04T10:07:31Z

This PR introduces support for sliding window attention in CB, using an allocator that can support both full attention and sliding window attention. Code works with almost similar performances (665 -> 655 tok/s in the example) but I am putting the PR in draft status until:

Code is tested on Nvidia with and without sliding window
TODO on memory computation is addressed
Code is refactored in some places to get smaller files / more code reuse
Way more documentation is added to explain the allocation mechanism
TODO adressed for very small window sizes

HuggingFaceDocBuilderDev · 2025-09-04T10:16:31Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

remi-or · 2025-09-08T11:26:55Z

Removing draft status, benchmark with / without PR gives:

Hardware	Implem	Troughput on main	Throughput after PR
MI325	Eager	1200.55tok/s	1294.30tok/s
MI325	SDPA	666.04 tok/s	687.0 tok/s
H100	Eager	680.58tok/s	728.02tok/s
H100	SDPA	902.37tok/s	974.84tok/s
H100	Flash	1412.85tok/s	1564.68tok/s

with python3 examples/pytorch/continuous_batching.py --attn $attn -mp none --slice-inputs --samples 100

ArthurZucker

Very very nice! 🚀

src/transformers/generation/continuous_batching/cache.py

src/transformers/generation/continuous_batching/continuous_api.py

src/transformers/integrations/eager_paged.py

src/transformers/integrations/flash_paged.py

tests/generation/test_continuos_batching.py

ArthurZucker

Let's remove global var SLIDING_WINDOW

examples/pytorch/continuous_batching.py

src/transformers/generation/continuous_batching/__init__.py

src/transformers/generation/continuous_batching/cache.py

src/transformers/generation/continuous_batching/cache_manager.py

McPatate

🔥

examples/pytorch/continuous_batching.py

McPatate · 2025-09-09T09:43:43Z

examples/pytorch/continuous_batching.py

        input_ids = torch.tensor([input_ids]).to("cuda")
-        attention_mask = torch.ones_like(input_ids)
-        outputs = model.generate(input_ids, attention_mask=attention_mask, generation_config=generation_config)
+        # attention_mask = torch.ones_like(input_ids)


Suggested change

# attention_mask = torch.ones_like(input_ids)

?

Throws a warning otherwise

examples/pytorch/continuous_batching.py

src/transformers/generation/continuous_batching/continuous_api.py

McPatate · 2025-09-09T11:45:02Z

src/transformers/generation/continuous_batching/continuous_api.py

        self._generation_thread = threading.Thread(target=self._run_generation_loop)
        self._generation_thread.start()
-        logger.info("Continuous batching manager started.")
+        logger.info(f"Continuous batching manager started at {time.time() - self.creation_time} seconds")


Suggested change

logger.info(f"Continuous batching manager started at {time.time() - self.creation_time} seconds")

logger.info(f"Continuous batching manager started at {time.time() - self.creation_time}")

This isn't really useful imo, usually you configure your logger to include a timestamp. I'd still keep the start_time in case we need it for metrics and whatnot

Kept it as debug with better message

src/transformers/generation/continuous_batching/continuous_api.py

src/transformers/generation/continuous_batching/cache.py

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

src/transformers/integrations/flash_paged.py

* CB example: better compare feature * Cache managers, still issue w/ effective length * WIP -- fix for effective length * Renames * Wroking, need better parity checks, we mind be missing 1 token * Small fixes * Fixed wrong attn mask and broke cache into pieces * Warmup is slowing down things, disabling it * Cache was too big, fixed * Simplified index objects * Added a profile option to the example * Avoid calls to memory reporing tools * Restore full attention read indices for better latency * Adressed some TODOS and style * Docstrings for cache managers * Docstrings for Schedulers * Refactor scheudlers * [Important] Cache fix for sliding window, check with small sw size * Updated doc for cache memory compute and cache as a whole * Moved a todo * Nits and style * Fix for when sliding window is smaller than max batch per token * Paged interface update * Support for FLash in new API * Fix example CB * Fix bug in CB for paged * Revert example * Style * Review compliance * Style * Styleeeee * Removed NO_SLIDING_WINDOW * Review huggingface#2 compliance * Better art * Turn cum_seqlens_k in a dict * Attn mask is now a dict * Update examples/pytorch/continuous_batching.py Co-authored-by: Luc Georges <McPatate@users.noreply.github.com> * Adressed McPatate pro review * Style and fix --------- Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

Original PR #40688 by remi-or Original: huggingface/transformers#40688

Merged from original PR #40688 Original: huggingface/transformers#40688

Original PR #40688 by remi-or Original: huggingface/transformers#40688

Merged from original PR #40688 Original: huggingface/transformers#40688

remi-or self-assigned this Sep 4, 2025

remi-or force-pushed the conbat-hybrid branch 7 times, most recently from c3f6e76 to dbf7b83 Compare September 8, 2025 09:54

remi-or marked this pull request as ready for review September 8, 2025 11:25

github-actions bot requested review from Rocketknight1 and gante September 8, 2025 11:25

ArthurZucker reviewed Sep 8, 2025

View reviewed changes

ArthurZucker approved these changes Sep 8, 2025

View reviewed changes

remi-or added 15 commits September 9, 2025 09:41

CB example: better compare feature

5af5538

Cache managers, still issue w/ effective length

ff602be

WIP -- fix for effective length

f01e975

Renames

ff53d5d

Wroking, need better parity checks, we mind be missing 1 token

dfcccdb

Small fixes

2febeea

Fixed wrong attn mask and broke cache into pieces

b1ef615

Warmup is slowing down things, disabling it

5d25ac4

Cache was too big, fixed

c0f2569

Simplified index objects

327c9f0

Added a profile option to the example

60c5fe9

Avoid calls to memory reporing tools

0dd6890

Restore full attention read indices for better latency

de2d148

Adressed some TODOS and style

8f84b8f

Docstrings for cache managers

1fbf753

remi-or added 13 commits September 9, 2025 09:41

Support for FLash in new API

31ec1dc

Fix example CB

fc55ca5

Fix bug in CB for paged

c70e59c

Revert example

bfad7cd

Style

2697302

Review compliance

7ca964f

Style

4a7b7a4

Styleeeee

62cce19

Removed NO_SLIDING_WINDOW

beed491

Review huggingface#2 compliance

a26f83c

Better art

edc441f

Turn cum_seqlens_k in a dict

398baa1

Attn mask is now a dict

0fe77e8

remi-or force-pushed the conbat-hybrid branch from 20cfcc5 to 0fe77e8 Compare September 9, 2025 09:44

McPatate approved these changes Sep 9, 2025

View reviewed changes

Update examples/pytorch/continuous_batching.py

f500a6d

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

vasqu reviewed Sep 9, 2025

View reviewed changes

src/transformers/integrations/flash_paged.py Outdated Show resolved Hide resolved

remi-or and others added 3 commits September 9, 2025 13:12

Adressed McPatate pro review

21ecdd9

Style and fix

6a8b1e0

Merge branch 'main' into conbat-hybrid

2995af3

ArthurZucker merged commit 1cdbbb3 into huggingface:main Sep 9, 2025
21 of 23 checks passed

ArthurZucker mentioned this pull request Sep 11, 2025

feat: add sliding window attention to Continuous Batching #39225

Closed

qgallouedec mentioned this pull request Sep 11, 2025

generate_batch failing #40835

Closed

4 tasks

snorkelopstesting1-a11y pushed a commit to snorkel-marlin-repos/huggingface_transformers_pr_40688_b9926daa-ae1d-440f-92f4-527975c3c44e that referenced this pull request Oct 11, 2025

Support sliding window in CB

0f970ab

Original PR #40688 by remi-or Original: huggingface/transformers#40688

snorkelopstesting1-a11y mentioned this pull request Oct 11, 2025

Support sliding window in CB snorkel-marlin-repos/huggingface_transformers_pr_40688_b9926daa-ae1d-440f-92f4-527975c3c44e#1

Merged

4 tasks

snorkelopstesting1-a11y pushed a commit to snorkel-marlin-repos/huggingface_transformers_pr_40688_86a525f7-187b-44c1-9196-444938d2c9a2 that referenced this pull request Oct 11, 2025

Support sliding window in CB

1b85183

Original PR #40688 by remi-or Original: huggingface/transformers#40688

snorkelopstesting1-a11y mentioned this pull request Oct 11, 2025

Support sliding window in CB snorkel-marlin-repos/huggingface_transformers_pr_40688_86a525f7-187b-44c1-9196-444938d2c9a2#1

Merged

4 tasks

	logger.info(f"Continuous batching manager started at {time.time() - self.creation_time} seconds")
	logger.info(f"Continuous batching manager started at {time.time() - self.creation_time}")

Conversation

remi-or commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Sep 4, 2025

Uh oh!

remi-or commented Sep 8, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

McPatate left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

McPatate Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

remi-or Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

McPatate Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

remi-or Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

remi-or commented Sep 4, 2025 •

edited

Loading