[EVT] Add support for Row/Col broadcast PtrArray#2033
Conversation
|
@ANIKET-SHIVAM @hwu36 Do you have a timeline to review this? We need this feature enabled on cutlass ASAP to unblock our usecases at Meta, e.g., pytorch/FBGEMM#3560 |
include/cutlass/epilogue/fusion/sm90_visitor_load_tma_warpspecialized.hpp
Outdated
Show resolved
Hide resolved
include/cutlass/epilogue/fusion/sm90_visitor_load_tma_warpspecialized.hpp
Outdated
Show resolved
Hide resolved
|
Thanks for taking a look @Skylion007, I've incorporated your feedback if you'd like to check again to make sure this all looks good. |
|
Thanks! Can you merge this to cutlass to unblock e.g., pytorch/FBGEMM#3560? |
|
Need a nvidia employee to to do that. @eqy might know who to ping. |
|
cc. @hwu36 |
|
We are working on this. |
bfba683 to
dadd7c2
Compare
dadd7c2 to
2d14744
Compare
|
@ANIKET-SHIVAM I've refactored this PR so that the functionality is built into the existing Row/Col EVT nodes. Can you take another look? |
|
Thanks for the changes, @jwfromm. Looks fine. I'll add some unit tests for these later, so that they keep getting tested. Im assuming PyTorch integration is still fine after these changes. |
|
Yes everything compiles, integrates, and runs nicely on top of this PR. |
|
@ANIKET-SHIVAM Feel free to let us know if there is anything needed from our side to merge this PR |
|
@hwu36 can we merge this plz :) |
* Handle MNK Sm90{Row, Col}Reduction problem shapes (NVIDIA#1803)
* add is_last_tile
* Improve sm90 mixed dtype kernel (NVIDIA#1883)
* Add GMMA shape m64n40k16 (NVIDIA#1864)
* Add all supported GMMA shapes (NVIDIA#1890)
* add maximum support (NVIDIA#1833)
* fix typo (NVIDIA#1853)
* fix by adding public (NVIDIA#1753)
* added mapping for bf16 to torch::kBFloat16 (NVIDIA#1843)
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
* Fix README (NVIDIA#1658)
* Fix README
* Improve README
---------
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
* Adjusting code indentation (NVIDIA#1639)
* Include of regular_tile_iterator.h fixed for NVRTC (NVIDIA#1765)
* Include of regular_tile_iterator.h fixed for NVRTC
* More include fixed for NVRTC
* Update gemm_f16n_f16t_f32t_tensor_op_f32_sm80.cu with include "cutlass/gemm/device/gemm_universal.h" (NVIDIA#1569)
fix compile with `cmake .. -DCUTLASS_ENABLE_TESTS=ON -DCUTLASS_TEST_LEVEL=2`
* remove redundant hardcoded packing configs in mixed dtype gemm (NVIDIA#1894)
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>
* fix wrong A/BLayout in MMA_Traits for binary mma and append other MMA_Traits support (NVIDIA#1856)
* fix wrong A/BLayout in MMA_Traits<SM80_16x8x256_S32U1U1S32_TN_XORPOPC> and append support for m8n8k128, m16n8k128 mma.and.popc in MMA_Traits instantiation
* add "print" template for subbyte_reference<T>
* Add a print for the uint{x}b_t type. (NVIDIA#1871)
* Refactor some GroupedGEMM logic (NVIDIA#1899)
* feat: support kFactor 8 used in mma tensor op tile iterator (NVIDIA#1512)
* Update publications (NVIDIA#1912)
* remove restriction of stride == kernel in nhwc_pooling (NVIDIA#1896)
* fix undefined in device code error (NVIDIA#1880)
* Fix the racing condition of mixed-input gemm when writing the registers (NVIDIA#1931)
* move two warpgroup_wait
* merge main
---------
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>
* Fix `cutlass` python library with cuda `12.6.2.post1` (NVIDIA#1942)
* Fix `cutlass` python library with cuda `12.6.2.post1`
Previously we had this error:
```
File "/storage/home/cutlass/python/cutlass/backend/operation.py", line 39, in <listcomp>
_version_splits = [int(x) for x in __version__.split("rc")[0].split(".")]
^^^^^^
ValueError: invalid literal for int() with base 10: 'post1'
```
* Update sm90_utils.py
* Update generator.py
* Update python/cutlass_library/generator.py
Co-authored-by: Jack Kosaian <jackkosaian@gmail.com>
* Update python/cutlass_library/sm90_utils.py
Co-authored-by: Jack Kosaian <jackkosaian@gmail.com>
---------
Co-authored-by: Jack Kosaian <jackkosaian@gmail.com>
* add {uint4, uint2, int2} => {fp16, bf16} conversion (NVIDIA#1966)
* Improve mixed dtype GEMM (NVIDIA#1972)
* update
* fix a typo
* fix a typo that fails the compiling when ElementScale is not the same as MmaType (NVIDIA#1977)
* Fix CuTe README Typo (NVIDIA#1951)
* Fix Typo (NVIDIA#1962)
* 3.6.0 update (NVIDIA#2005)
* 3.6.0 update
* doc and swap stuff
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Update CHANGELOG.md
* Update 0x_gemm_tutorial.md (NVIDIA#1982)
Shouldn't this be BLK_M, BLK_**K**, k
* fix bug: arch/mma_sm60.h Mma<2,2,1> calculate wrong (NVIDIA#1989)
* fix mem fence (NVIDIA#2030)
Co-authored-by: yuzhai <yuzhai@nvidia.com>
* Add half->int8 saturate conversion to promise valid range (NVIDIA#1983)
* Add half->int8 saturate conversion to promise valid range
* add gpu only macro
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Add vector-types back to platform.h (NVIDIA#2026)
* Fix typo in library_defaults.py (NVIDIA#2024)
* Fix Typos (NVIDIA#2021)
* Fix Typo
* Fix Typo
* Add Line Break (NVIDIA#2020)
* Blockwise Scaling for FP8 (NVIDIA#1932)
* F8 Blockwise Scaling
* two more NumProducerThreadEvents
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* fix assertion in integer_subbytes.h (NVIDIA#1961)
* CUTLASS 3.7 (NVIDIA#2045)
* CUTLASS 3.7
* clean up changelog
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* update 3.7 docs (NVIDIA#2051)
* update docs
* update docs
* update docs
* update docs
* update docs
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com>
* CUTLASS 3.8 Release (NVIDIA#2059)
* CUTLASS 3.8 Release
* update
* Update README.md
* Revert "Update README.md"
This reverts commit b353e36.
* update
* update
---------
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* fix cuda 12.6 issues (NVIDIA#2066)
* fix a readme broken link (NVIDIA#2069)
* Update README.md
* Groupwise scaling along M for FP8 gemm (NVIDIA#2037)
* FP8 groupwise scaling along M
* small updates
---------
Co-authored-by: zl <zl@deepseek.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* bugfix generic-k code in top-k with softmax (NVIDIA#1993)
* bugfix generic-k code in top-k with softmax
* Update include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp
Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>
* Update examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu
Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>
---------
Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>
* [EVT] Add support for Row/Col broadcast PtrArray (NVIDIA#2033)
* Add group support to EVT row/col broadcast.
* small modifications
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* v3.8.0 update (NVIDIA#2082)
* 3.8 update
* fix Markus' name
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com>
* [WA] Fix compiling errors
---------
Co-authored-by: Saagar Jha <saagar@saagarjha.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
Co-authored-by: Sergey Klevtsov <141879860+sklevtsov-nvidia@users.noreply.github.com>
Co-authored-by: Tri Dao <tridao@users.noreply.github.com>
Co-authored-by: Xinyu Yang <ltyxy@buaa.edu.cn>
Co-authored-by: sijialou <sijia.lou@intel.com>
Co-authored-by: Bogumil Sapinski Mobica <48835513+Bogumil-Sapinski-Mobica@users.noreply.github.com>
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
Co-authored-by: Lei Mao <dukeleimao@gmail.com>
Co-authored-by: 103yiran <1039105206@qq.com>
Co-authored-by: MaxAkaAltmer <MaxAkaAltmer@yandex.ru>
Co-authored-by: 侯奇 <houqi1993@gmail.com>
Co-authored-by: Lain <28486541+IwakuraRein@users.noreply.github.com>
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>
Co-authored-by: Caleb_Du <59528230+CalebDu@users.noreply.github.com>
Co-authored-by: LiYu Lu <luliyucoordinate@outlook.com>
Co-authored-by: azhurkevich <101208641+azhurkevich@users.noreply.github.com>
Co-authored-by: chenwei <15601910741@163.com>
Co-authored-by: Wenlei Bao <142055114+wenlei-bao@users.noreply.github.com>
Co-authored-by: LiuQiang <thorneliu@gmail.com>
Co-authored-by: dan_the_3rd <43445237+danthe3rd@users.noreply.github.com>
Co-authored-by: Jack Kosaian <jackkosaian@gmail.com>
Co-authored-by: Yujia Zhai <yzhai015@ucr.edu>
Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Andrew O'Neill <foolusion@gmail.com>
Co-authored-by: Dongxu.Wang <wangdongxuking61@gmail.com>
Co-authored-by: ZZK <359521840@qq.com>
Co-authored-by: Driss Guessous <32754868+drisspg@users.noreply.github.com>
Co-authored-by: ZincCat <52513999+zinccat@users.noreply.github.com>
Co-authored-by: Manish Gupta <mgupta.iitr@gmail.com>
Co-authored-by: bobliao <codechaser@163.com>
Co-authored-by: mihir-awatramani <162148077+mihir-awatramani@users.noreply.github.com>
Co-authored-by: Liang <44948473+soundOfDestiny@users.noreply.github.com>
Co-authored-by: zl <zl@deepseek.com>
Co-authored-by: Tadej Ciglarič <tadej.c@gmail.com>
Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>
Co-authored-by: Josh Fromm <jwfromm@meta.com>
* Add group support to EVT row/col broadcast. * small modifications --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Add group support to EVT row/col broadcast. * small modifications --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Add group support to EVT row/col broadcast. * small modifications --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
To enable FP8 grouped gemm with rowwise scaling in the epilogue, we need to be able to provide a list of pointers to the scales for each group. This PR extends Sm90Row/ColBroadcast to support PTRArray to handle this case. Now if the ElementInput type is specified as a pointer, the corresponding input is an array of pointers, enabling grouped gemm. For example in our case we define EVT nodes like this: