feat: support bmm fp8 by zhyncs · Pull Request #469 · flashinfer-ai/flashinfer

zhyncs · 2024-08-26T18:04:57Z

torch.bmm doesn't support fp8 and torch._scaled_mm doesn't support 3d, so I write this one. @yzh119 cc @merrymercy @Ying1123 @ispobock

Thanks @yzh119 for assisting with debug.

AType: fp8 e4m3, fp8 e5m2
BType: fp8 e4m3, fp8 e5m2
DType: bf16, fp16

Does not support both AType and BType fp8 e5m2. ref https://docs.nvidia.com/cuda/cublas/#cublasltmatmul

pytest python/tests/test_bmm_fp8.py

works on H100

=================================================================================== test session starts ===================================================================================
platform linux -- Python 3.12.4, pytest-8.3.2, pluggy-1.5.0
rootdir: /flashinfer
collected 8 items

python/tests/test_bmm_fp8.py ...s...s                                                                                                                                                                       [100%]

============================================================================== 6 passed, 2 skipped in 2.16s ===============================================================================

yzh119 · 2024-08-26T18:24:47Z

Another suggestion is to move group gemm and bmm fp8 to a common gemm.py, we should also update the group_gemm.rst (to gemm.rst) as well.

zhyncs · 2024-08-26T18:25:47Z

Another suggestion is to move group gemm and bmm fp8 to a common gemm.py, we should also update the group_gemm.rst (to gemm.rst) as well.

make sense

yzh119

LGTM, thanks for your contribution @zhyncs !

The documentation was not indexed properly in #469 , this PR fixes the issue.

@LiuXiaoxuanPKU

🤖 I have created a release *beep* *boop* --- ## [0.1.6](v0.1.5...v0.1.6) (2024-08-27) ### SM75 Support Starting from [0.1.6](v0.1.5...v0.1.6), our pre-built wheels include experimental support sm75 (Turing architecture GPUs such as Tesla T4, Quadro RTX 6000 and RTX 2080). ### API Changes #### `plan`/`run` Since [0.1.6](v0.1.5...v0.1.6) on, `begin_forward`/`forward`/`end_forward` APIs are replaced with the new `plan`/`run` API. - `forward` is renamed to `run`, which is more precise and consistent with the naming convention of cutlass's python API. - `begin_forward` is renamed to `plan`, which is consistent with the naming convention of nvmath API. - `end_forward` is deprecated and has no effect after this PR. There is some slight difference between the old `forward` and the new `run` API: - All extra arguments such as `causal` and `logits_soft_cap` will be provided in `plan` (previously `begin_forward`) API, and cached until next `plan` call, and we only need to provide query and KV-Cache tensors in `run` API. The old `begin_forward`/`forward`/`end_forward` APIs are still functional, but we will gradually deprecate them in future releases. Check [#466](#466) for more details. #### `MultiLevelCascadeAttentionWrapper` Since [0.1.6](v0.1.5...v0.1.6) on, we introduce a new `MultiLevelCascadeAttentionWrapper` API for cascade inference, which supports multi-level cascade inference where all levels' KV-Cache can be managed in a unified Paged KV-Cache. See [documentation](https://docs.flashinfer.ai/api/python/cascade.html#flashinfer.cascade.MultiLevelCascadeAttentionWrapper) and [tutorial](https://docs.flashinfer.ai/tutorials/kv_layout.html#multi-level-cascade-inference-data-layout) on API usage and layout explaination. The old `BatchDecodeWithSharedPrefixPagedKVCacheWrapper` and `BatchPrefillWithSharedPrefixPagedKVCacheWrapper` will be deprecated in future releases. ### Features * sm75 support ([#448](#448), [#449](#449)) * add `MultiLevelCascadeAttentionWrapper` API ([#462](#462)) ([1e37989](1e37989)) * add accept num, emit num metric for ChainSpeculativeSampling ([#450](#450)) ([fa38b5e](fa38b5e)) * support bmm fp8 ([#469](#469)) ([f1c0b68](f1c0b68)) ### Refactor * refactor: replace `begin_forward`/`forward`/`end_forward` with `plan`/`run` [#466](#466) ### Misc * misc: improve error handling of sampling kernels ([#456](#456)) ([0dce178](0dce178)) ### Performance Improvements * slight optimization on f16->f8 fragment layout swizzling ([#453](#453)) ([0d61871](0d61871)) * slight optimization on fragment layout swizzle ([#458](#458)) ([7c397cb](7c397cb)) * use persistent kernel for merging attention states ([#459](#459)) ([be6bf5b](be6bf5b)) ### Acknowledgement We thank [@LiuXiaoxuanPKU](https://github.com/LiuXiaoxuanPKU) on enhance of speculative sampling operator, [@merrymercy](https://github.com/merrymercy) on API change suggestion and [@zhyncs](https://github.com/zhyncs) on integrating fp8 BMM cublas implementation. --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Zihao Ye <expye@outlook.com>

zhyncs and others added 13 commits August 26, 2024 20:41

init

86cb8b8

fix

3db2db4

kernel is right

d7799c6

add comment

133eb2e

use torch

a4e4d5c

upd

3d30ad6

enable fast_accum

091fd1c

add workspace

d0f6b28

support template

ca70021

update test

d649f3b

add scale

4f9fccf

upd

032f29c

update test

2a184ad

zhyncs added the feature request New feature or request label Aug 26, 2024

zhyncs requested a review from yzh119 August 26, 2024 18:04

zhyncs self-assigned this Aug 26, 2024

yzh119 reviewed Aug 26, 2024

View reviewed changes

Comment thread python/flashinfer/bmm_fp8.py Outdated

zhyncs added 2 commits August 27, 2024 04:29

add doc string

7c22602

update

e53d50a

yzh119 reviewed Aug 26, 2024

View reviewed changes

Comment thread python/flashinfer/gemm.py Outdated

update

125b91a

yzh119 approved these changes Aug 26, 2024

View reviewed changes

yzh119 merged commit f1c0b68 into main Aug 26, 2024

github-actions Bot mentioned this pull request Aug 26, 2024

chore(main): release 0.1.6 #447

Merged

zhyncs deleted the fp8-bmm-scale branch August 26, 2024 19:32

yzh119 mentioned this pull request Aug 27, 2024

doc: fix fp8 bmm documentation #470

Merged

yzh119 added a commit that referenced this pull request Aug 27, 2024

doc: fix fp8 bmm documentation (#470)

d357a91

The documentation was not indexed properly in #469 , this PR fixes the issue.

This was referenced Aug 30, 2024

[Feature] support W8A8(FP8) and KV Cache FP8 for DeepSeek V2 sgl-project/sglang#1156

Closed

feat: fix fp8 for MLA and support bmm fp8 for DeepSeek V2 sgl-project/sglang#1285

Merged

github-actions Bot mentioned this pull request Dec 25, 2024

chore(main): release 0.3.0 #698

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support bmm fp8#469

feat: support bmm fp8#469
yzh119 merged 16 commits intomainfrom
fp8-bmm-scale

zhyncs commented Aug 26, 2024

Uh oh!

Uh oh!

yzh119 commented Aug 26, 2024

Uh oh!

zhyncs commented Aug 26, 2024

Uh oh!

Uh oh!

yzh119 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhyncs commented Aug 26, 2024

Uh oh!

Uh oh!

yzh119 commented Aug 26, 2024

Uh oh!

zhyncs commented Aug 26, 2024

Uh oh!

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants