Add heuristic default block sizes for different cases in ragged attention kernel#8922
Add heuristic default block sizes for different cases in ragged attention kernel#8922yaochengji merged 8 commits intomasterfrom
Conversation
| @@ -856,6 +867,22 @@ def test_ragged_paged_attention_wrapper_without_dynamo( | |||
| use_dynamo=False, | |||
| ) | |||
There was a problem hiding this comment.
Remove the old _test_ragged_paged_attention?
There was a problem hiding this comment.
We'd better also test the non-None block size parameter, that's why it's still there.
| @@ -817,6 +812,22 @@ def test_ragged_paged_attention_wrapper_with_dynamo( | |||
| use_dynamo=True, | |||
There was a problem hiding this comment.
Remove the old _test_ragged_paged_attention?
There was a problem hiding this comment.
We'd better also test the non-None block size parameter, that's why it's still there.
| raise NotImplementedError("TPU version must be 4 or higher.") | ||
| # NOTE: the TPU v4's vmem capacity is 16MB | ||
| if tpu_version == 4: | ||
| vmem_limit_bytes = 16 * 1024 * 1024 |
There was a problem hiding this comment.
It is fine, even we set more than 16MB, it will still use 16 MB
| soft_cap=soft_cap, | ||
| pad_tokens_and_seqs=pad_tokens_and_seqs, | ||
| use_dynamo=True, | ||
| num_kv_pages_per_block=None, |
There was a problem hiding this comment.
nit: consider to make the block sizes a parameter, eg num_kv_pages_per_block=[16, None], similar for num_queries_per_block
There was a problem hiding this comment.
Thanks for the suggestion, done.
| # This heristic is based on the initial kernel micro benchmarking: | ||
| # When the token_num is small, there's no long request of prefill. | ||
| # While when it's larger, the block size is adjusted for it. | ||
| if token_num <= 128: |
There was a problem hiding this comment.
I wonder if we should choose the block sizes in vLLM. If we choose to do in torch_xla, then we need to change it in torch_xla and wait for the wheel tmr. If we do in vLLM, it'd be more convenient. wdyt?
There was a problem hiding this comment.
If vLLM pass a non-None block size, the default value will not be used.
There was a problem hiding this comment.
I wonder why we couldn't do _get_default_ragged_paged_attention_block_size in vLLM..
There was a problem hiding this comment.
Usually it's a good idea to put the tuned-parameter table in the kernel lib, not in the app lib.
64d9bad to
8286d56
Compare
No description provided.