[EVT] Add support for Row/Col broadcast PtrArray by jwfromm · Pull Request #2033 · NVIDIA/cutlass

jwfromm · 2025-01-08T18:09:42Z

To enable FP8 grouped gemm with rowwise scaling in the epilogue, we need to be able to provide a list of pointers to the scales for each group. This PR extends Sm90Row/ColBroadcast to support PTRArray to handle this case. Now if the ElementInput type is specified as a pointer, the corresponding input is an array of pointers, enabling grouped gemm. For example in our case we define EVT nodes like this:

  using XScale = cutlass::epilogue::fusion::Sm90ColBroadcast<
      0,
      TileShape,
      ElementComputeEpilogue*, // Indicate input is array of pointers.
      ElementComputeEpilogue,
      cute::Stride<cute::Int<1>, cute::Int<0>, cute::Int<0>>>;

  using WScale = cutlass::epilogue::fusion::Sm90RowBroadcast<
      0,
      TileShape,
      ElementComputeEpilogue*, // Indicate input is array of pointers.
      ElementComputeEpilogue,
      cute::Stride<cute::Int<0>, cute::Int<1>, cute::Int<0>>>;

jiawenliu64 · 2025-01-10T19:01:15Z

@ANIKET-SHIVAM @hwu36 Do you have a timeline to review this? We need this feature enabled on cutlass ASAP to unblock our usecases at Meta, e.g., pytorch/FBGEMM#3560

include/cutlass/epilogue/fusion/sm90_visitor_load_tma_warpspecialized.hpp

jwfromm · 2025-01-13T23:14:00Z

Thanks for taking a look @Skylion007, I've incorporated your feedback if you'd like to check again to make sure this all looks good.

Skylion007

LGTM

jiawenliu64 · 2025-01-15T03:32:26Z

Thanks! Can you merge this to cutlass to unblock e.g., pytorch/FBGEMM#3560?

Skylion007 · 2025-01-15T03:46:18Z

Need a nvidia employee to to do that. @eqy might know who to ping.

jiawenliu64 · 2025-01-15T03:48:24Z

cc. @hwu36

hwu36 · 2025-01-15T03:50:00Z

We are working on this.

jwfromm · 2025-01-23T17:33:59Z

@ANIKET-SHIVAM I've refactored this PR so that the functionality is built into the existing Row/Col EVT nodes. Can you take another look?

ANIKET-SHIVAM · 2025-01-24T22:47:08Z

Thanks for the changes, @jwfromm. Looks fine. I'll add some unit tests for these later, so that they keep getting tested. Im assuming PyTorch integration is still fine after these changes.

jwfromm · 2025-01-26T22:55:28Z

Yes everything compiles, integrates, and runs nicely on top of this PR.

jiawenliu64 · 2025-01-31T04:16:06Z

@ANIKET-SHIVAM Feel free to let us know if there is anything needed from our side to merge this PR

ANIKET-SHIVAM · 2025-01-31T18:34:50Z

@hwu36 can we merge this plz :)

* Handle MNK Sm90{Row, Col}Reduction problem shapes (NVIDIA#1803) * add is_last_tile * Improve sm90 mixed dtype kernel (NVIDIA#1883) * Add GMMA shape m64n40k16 (NVIDIA#1864) * Add all supported GMMA shapes (NVIDIA#1890) * add maximum support (NVIDIA#1833) * fix typo (NVIDIA#1853) * fix by adding public (NVIDIA#1753) * added mapping for bf16 to torch::kBFloat16 (NVIDIA#1843) Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com> * Fix README (NVIDIA#1658) * Fix README * Improve README --------- Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com> * Adjusting code indentation (NVIDIA#1639) * Include of regular_tile_iterator.h fixed for NVRTC (NVIDIA#1765) * Include of regular_tile_iterator.h fixed for NVRTC * More include fixed for NVRTC * Update gemm_f16n_f16t_f32t_tensor_op_f32_sm80.cu with include "cutlass/gemm/device/gemm_universal.h" (NVIDIA#1569) fix compile with `cmake .. -DCUTLASS_ENABLE_TESTS=ON -DCUTLASS_TEST_LEVEL=2` * remove redundant hardcoded packing configs in mixed dtype gemm (NVIDIA#1894) Co-authored-by: Siyuan Fu <siyuanf@nvidia.com> * fix wrong A/BLayout in MMA_Traits for binary mma and append other MMA_Traits support (NVIDIA#1856) * fix wrong A/BLayout in MMA_Traits<SM80_16x8x256_S32U1U1S32_TN_XORPOPC> and append support for m8n8k128, m16n8k128 mma.and.popc in MMA_Traits instantiation * add "print" template for subbyte_reference<T> * Add a print for the uint{x}b_t type. (NVIDIA#1871) * Refactor some GroupedGEMM logic (NVIDIA#1899) * feat: support kFactor 8 used in mma tensor op tile iterator (NVIDIA#1512) * Update publications (NVIDIA#1912) * remove restriction of stride == kernel in nhwc_pooling (NVIDIA#1896) * fix undefined in device code error (NVIDIA#1880) * Fix the racing condition of mixed-input gemm when writing the registers (NVIDIA#1931) * move two warpgroup_wait * merge main --------- Co-authored-by: Siyuan Fu <siyuanf@nvidia.com> * Fix `cutlass` python library with cuda `12.6.2.post1` (NVIDIA#1942) * Fix `cutlass` python library with cuda `12.6.2.post1` Previously we had this error: ``` File "/storage/home/cutlass/python/cutlass/backend/operation.py", line 39, in <listcomp> _version_splits = [int(x) for x in __version__.split("rc")[0].split(".")] ^^^^^^ ValueError: invalid literal for int() with base 10: 'post1' ``` * Update sm90_utils.py * Update generator.py * Update python/cutlass_library/generator.py Co-authored-by: Jack Kosaian <jackkosaian@gmail.com> * Update python/cutlass_library/sm90_utils.py Co-authored-by: Jack Kosaian <jackkosaian@gmail.com> --------- Co-authored-by: Jack Kosaian <jackkosaian@gmail.com> * add {uint4, uint2, int2} => {fp16, bf16} conversion (NVIDIA#1966) * Improve mixed dtype GEMM (NVIDIA#1972) * update * fix a typo * fix a typo that fails the compiling when ElementScale is not the same as MmaType (NVIDIA#1977) * Fix CuTe README Typo (NVIDIA#1951) * Fix Typo (NVIDIA#1962) * 3.6.0 update (NVIDIA#2005) * 3.6.0 update * doc and swap stuff --------- Co-authored-by: yuzhai <yuzhai@nvidia.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com> * Update CHANGELOG.md * Update 0x_gemm_tutorial.md (NVIDIA#1982) Shouldn't this be BLK_M, BLK_**K**, k * fix bug: arch/mma_sm60.h Mma<2,2,1> calculate wrong (NVIDIA#1989) * fix mem fence (NVIDIA#2030) Co-authored-by: yuzhai <yuzhai@nvidia.com> * Add half->int8 saturate conversion to promise valid range (NVIDIA#1983) * Add half->int8 saturate conversion to promise valid range * add gpu only macro --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com> * Add vector-types back to platform.h (NVIDIA#2026) * Fix typo in library_defaults.py (NVIDIA#2024) * Fix Typos (NVIDIA#2021) * Fix Typo * Fix Typo * Add Line Break (NVIDIA#2020) * Blockwise Scaling for FP8 (NVIDIA#1932) * F8 Blockwise Scaling * two more NumProducerThreadEvents --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com> * fix assertion in integer_subbytes.h (NVIDIA#1961) * CUTLASS 3.7 (NVIDIA#2045) * CUTLASS 3.7 * clean up changelog --------- Co-authored-by: yuzhai <yuzhai@nvidia.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com> * update 3.7 docs (NVIDIA#2051) * update docs * update docs * update docs * update docs * update docs --------- Co-authored-by: yuzhai <yuzhai@nvidia.com> * CUTLASS 3.8 Release (NVIDIA#2059) * CUTLASS 3.8 Release * update * Update README.md * Revert "Update README.md" This reverts commit b353e36. * update * update --------- Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com> * fix cuda 12.6 issues (NVIDIA#2066) * fix a readme broken link (NVIDIA#2069) * Update README.md * Groupwise scaling along M for FP8 gemm (NVIDIA#2037) * FP8 groupwise scaling along M * small updates --------- Co-authored-by: zl <zl@deepseek.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com> * bugfix generic-k code in top-k with softmax (NVIDIA#1993) * bugfix generic-k code in top-k with softmax * Update include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com> * Update examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com> --------- Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com> * [EVT] Add support for Row/Col broadcast PtrArray (NVIDIA#2033) * Add group support to EVT row/col broadcast. * small modifications --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com> * v3.8.0 update (NVIDIA#2082) * 3.8 update * fix Markus' name --------- Co-authored-by: yuzhai <yuzhai@nvidia.com> * [WA] Fix compiling errors --------- Co-authored-by: Saagar Jha <saagar@saagarjha.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com> Co-authored-by: Sergey Klevtsov <141879860+sklevtsov-nvidia@users.noreply.github.com> Co-authored-by: Tri Dao <tridao@users.noreply.github.com> Co-authored-by: Xinyu Yang <ltyxy@buaa.edu.cn> Co-authored-by: sijialou <sijia.lou@intel.com> Co-authored-by: Bogumil Sapinski Mobica <48835513+Bogumil-Sapinski-Mobica@users.noreply.github.com> Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com> Co-authored-by: Lei Mao <dukeleimao@gmail.com> Co-authored-by: 103yiran <1039105206@qq.com> Co-authored-by: MaxAkaAltmer <MaxAkaAltmer@yandex.ru> Co-authored-by: 侯奇 <houqi1993@gmail.com> Co-authored-by: Lain <28486541+IwakuraRein@users.noreply.github.com> Co-authored-by: Siyuan Fu <siyuanf@nvidia.com> Co-authored-by: Caleb_Du <59528230+CalebDu@users.noreply.github.com> Co-authored-by: LiYu Lu <luliyucoordinate@outlook.com> Co-authored-by: azhurkevich <101208641+azhurkevich@users.noreply.github.com> Co-authored-by: chenwei <15601910741@163.com> Co-authored-by: Wenlei Bao <142055114+wenlei-bao@users.noreply.github.com> Co-authored-by: LiuQiang <thorneliu@gmail.com> Co-authored-by: dan_the_3rd <43445237+danthe3rd@users.noreply.github.com> Co-authored-by: Jack Kosaian <jackkosaian@gmail.com> Co-authored-by: Yujia Zhai <yzhai015@ucr.edu> Co-authored-by: yuzhai <yuzhai@nvidia.com> Co-authored-by: Andrew O'Neill <foolusion@gmail.com> Co-authored-by: Dongxu.Wang <wangdongxuking61@gmail.com> Co-authored-by: ZZK <359521840@qq.com> Co-authored-by: Driss Guessous <32754868+drisspg@users.noreply.github.com> Co-authored-by: ZincCat <52513999+zinccat@users.noreply.github.com> Co-authored-by: Manish Gupta <mgupta.iitr@gmail.com> Co-authored-by: bobliao <codechaser@163.com> Co-authored-by: mihir-awatramani <162148077+mihir-awatramani@users.noreply.github.com> Co-authored-by: Liang <44948473+soundOfDestiny@users.noreply.github.com> Co-authored-by: zl <zl@deepseek.com> Co-authored-by: Tadej Ciglarič <tadej.c@gmail.com> Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com> Co-authored-by: Josh Fromm <jwfromm@meta.com>

* Add group support to EVT row/col broadcast. * small modifications --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

ANIKET-SHIVAM self-requested a review January 8, 2025 19:02

Skylion007 reviewed Jan 12, 2025

View reviewed changes

include/cutlass/epilogue/fusion/sm90_visitor_load_tma_warpspecialized.hpp Outdated Show resolved Hide resolved

Skylion007 reviewed Jan 12, 2025

View reviewed changes

include/cutlass/epilogue/fusion/sm90_visitor_load_tma_warpspecialized.hpp Outdated Show resolved Hide resolved

Skylion007 approved these changes Jan 14, 2025

View reviewed changes

jwfromm force-pushed the row_col_ptrarray branch from bfba683 to dadd7c2 Compare January 22, 2025 03:52

Add group support to EVT row/col broadcast.

2d14744

jwfromm force-pushed the row_col_ptrarray branch from dadd7c2 to 2d14744 Compare January 22, 2025 04:20

small modifications

fb03810

hwu36 approved these changes Feb 2, 2025

View reviewed changes

hwu36 merged commit affd1b6 into NVIDIA:main Feb 2, 2025

jwfromm mentioned this pull request Feb 20, 2025

[EVT] Fix Row/Col broadcast with array arguments #2120

Merged

hgl71964 pushed a commit to hgl71964/cutlass that referenced this pull request Feb 21, 2025

[EVT] Add support for Row/Col broadcast PtrArray (NVIDIA#2033)

d6c0dd5

* Add group support to EVT row/col broadcast. * small modifications --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

andralex pushed a commit to andralex/cutlass that referenced this pull request Jun 14, 2025

[EVT] Add support for Row/Col broadcast PtrArray (NVIDIA#2033)

c3fc646

* Add group support to EVT row/col broadcast. * small modifications --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

Albresky pushed a commit to Albresky/cutlass that referenced this pull request Oct 11, 2025

[EVT] Add support for Row/Col broadcast PtrArray (NVIDIA#2033)

2c5f500

* Add group support to EVT row/col broadcast. * small modifications --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

Conversation

jwfromm commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiawenliu64 commented Jan 10, 2025

Uh oh!

Uh oh!

Uh oh!

jwfromm commented Jan 13, 2025

Uh oh!

Skylion007 left a comment

Choose a reason for hiding this comment

Uh oh!

jiawenliu64 commented Jan 15, 2025

Uh oh!

Skylion007 commented Jan 15, 2025

Uh oh!

jiawenliu64 commented Jan 15, 2025

Uh oh!

hwu36 commented Jan 15, 2025

Uh oh!

jwfromm commented Jan 23, 2025

Uh oh!

ANIKET-SHIVAM commented Jan 24, 2025

Uh oh!

jwfromm commented Jan 26, 2025

Uh oh!

jiawenliu64 commented Jan 31, 2025

Uh oh!

ANIKET-SHIVAM commented Jan 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jwfromm commented Jan 8, 2025 •

edited

Loading