Support sliding window in CB#40688
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
c3f6e76 to
dbf7b83
Compare
|
Removing draft status, benchmark with / without PR gives:
with |
src/transformers/generation/continuous_batching/continuous_api.py
Outdated
Show resolved
Hide resolved
ArthurZucker
left a comment
There was a problem hiding this comment.
Let's remove global var SLIDING_WINDOW
20cfcc5 to
0fe77e8
Compare
| input_ids = torch.tensor([input_ids]).to("cuda") | ||
| attention_mask = torch.ones_like(input_ids) | ||
| outputs = model.generate(input_ids, attention_mask=attention_mask, generation_config=generation_config) | ||
| # attention_mask = torch.ones_like(input_ids) |
There was a problem hiding this comment.
| # attention_mask = torch.ones_like(input_ids) |
?
There was a problem hiding this comment.
Throws a warning otherwise
src/transformers/generation/continuous_batching/continuous_api.py
Outdated
Show resolved
Hide resolved
| self._generation_thread = threading.Thread(target=self._run_generation_loop) | ||
| self._generation_thread.start() | ||
| logger.info("Continuous batching manager started.") | ||
| logger.info(f"Continuous batching manager started at {time.time() - self.creation_time} seconds") |
There was a problem hiding this comment.
| logger.info(f"Continuous batching manager started at {time.time() - self.creation_time} seconds") | |
| logger.info(f"Continuous batching manager started at {time.time() - self.creation_time}") |
This isn't really useful imo, usually you configure your logger to include a timestamp. I'd still keep the start_time in case we need it for metrics and whatnot
There was a problem hiding this comment.
Kept it as debug with better message
src/transformers/generation/continuous_batching/continuous_api.py
Outdated
Show resolved
Hide resolved
src/transformers/generation/continuous_batching/continuous_api.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
* CB example: better compare feature * Cache managers, still issue w/ effective length * WIP -- fix for effective length * Renames * Wroking, need better parity checks, we mind be missing 1 token * Small fixes * Fixed wrong attn mask and broke cache into pieces * Warmup is slowing down things, disabling it * Cache was too big, fixed * Simplified index objects * Added a profile option to the example * Avoid calls to memory reporing tools * Restore full attention read indices for better latency * Adressed some TODOS and style * Docstrings for cache managers * Docstrings for Schedulers * Refactor scheudlers * [Important] Cache fix for sliding window, check with small sw size * Updated doc for cache memory compute and cache as a whole * Moved a todo * Nits and style * Fix for when sliding window is smaller than max batch per token * Paged interface update * Support for FLash in new API * Fix example CB * Fix bug in CB for paged * Revert example * Style * Review compliance * Style * Styleeeee * Removed NO_SLIDING_WINDOW * Review huggingface#2 compliance * Better art * Turn cum_seqlens_k in a dict * Attn mask is now a dict * Update examples/pytorch/continuous_batching.py Co-authored-by: Luc Georges <McPatate@users.noreply.github.com> * Adressed McPatate pro review * Style and fix --------- Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
Original PR #40688 by remi-or Original: huggingface/transformers#40688
Merged from original PR #40688 Original: huggingface/transformers#40688
Original PR #40688 by remi-or Original: huggingface/transformers#40688
Merged from original PR #40688 Original: huggingface/transformers#40688
This PR introduces support for sliding window attention in CB, using an allocator that can support both full attention and sliding window attention. Code works with almost similar performances (665 -> 655 tok/s in the example) but I am putting the PR in draft status until: