Add GMMA shape m64n40k16 by tridao · Pull Request #1864 · NVIDIA/cutlass

tridao · 2024-10-11T18:52:59Z

This GMMA shape is being used in FA3 backward pass for headdim 256 (tile size 64 x 80, split into 2 WGs).
cc @thakkarV

thakkarV · 2024-10-11T21:09:41Z

sklevtsov-nvidia · 2024-10-22T00:34:33Z

Hi @tridao, we are adding all supported GMMA shapes: #1890. Out of concern for compile times, we moved all "extended" instruction shapes into their own headers, which are conditionally included into mma_sm90_gmma.hpp and mma_traits_sm90_gmma.hpp. Nothing needs to change on your side - just compiling with CUTLASS_ENABLE_SM90_EXTENDED_MMA_SHAPES is enough.

However it may not be possible to merge both PRs because of conflict. Alternatively, @hwu36 if you merge this one first, I'm happy to rebase mine and resolve the conflict.

* Handle MNK Sm90{Row, Col}Reduction problem shapes (NVIDIA#1803) * add is_last_tile * Improve sm90 mixed dtype kernel (NVIDIA#1883) * Add GMMA shape m64n40k16 (NVIDIA#1864) * Add all supported GMMA shapes (NVIDIA#1890) * add maximum support (NVIDIA#1833) * fix typo (NVIDIA#1853) * fix by adding public (NVIDIA#1753) * added mapping for bf16 to torch::kBFloat16 (NVIDIA#1843) Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com> * Fix README (NVIDIA#1658) * Fix README * Improve README --------- Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com> * Adjusting code indentation (NVIDIA#1639) * Include of regular_tile_iterator.h fixed for NVRTC (NVIDIA#1765) * Include of regular_tile_iterator.h fixed for NVRTC * More include fixed for NVRTC * Update gemm_f16n_f16t_f32t_tensor_op_f32_sm80.cu with include "cutlass/gemm/device/gemm_universal.h" (NVIDIA#1569) fix compile with `cmake .. -DCUTLASS_ENABLE_TESTS=ON -DCUTLASS_TEST_LEVEL=2` * remove redundant hardcoded packing configs in mixed dtype gemm (NVIDIA#1894) Co-authored-by: Siyuan Fu <siyuanf@nvidia.com> * fix wrong A/BLayout in MMA_Traits for binary mma and append other MMA_Traits support (NVIDIA#1856) * fix wrong A/BLayout in MMA_Traits<SM80_16x8x256_S32U1U1S32_TN_XORPOPC> and append support for m8n8k128, m16n8k128 mma.and.popc in MMA_Traits instantiation * add "print" template for subbyte_reference<T> * Add a print for the uint{x}b_t type. (NVIDIA#1871) * Refactor some GroupedGEMM logic (NVIDIA#1899) * feat: support kFactor 8 used in mma tensor op tile iterator (NVIDIA#1512) * Update publications (NVIDIA#1912) * remove restriction of stride == kernel in nhwc_pooling (NVIDIA#1896) * fix undefined in device code error (NVIDIA#1880) * Fix the racing condition of mixed-input gemm when writing the registers (NVIDIA#1931) * move two warpgroup_wait * merge main --------- Co-authored-by: Siyuan Fu <siyuanf@nvidia.com> * Fix `cutlass` python library with cuda `12.6.2.post1` (NVIDIA#1942) * Fix `cutlass` python library with cuda `12.6.2.post1` Previously we had this error: ``` File "/storage/home/cutlass/python/cutlass/backend/operation.py", line 39, in <listcomp> _version_splits = [int(x) for x in __version__.split("rc")[0].split(".")] ^^^^^^ ValueError: invalid literal for int() with base 10: 'post1' ``` * Update sm90_utils.py * Update generator.py * Update python/cutlass_library/generator.py Co-authored-by: Jack Kosaian <jackkosaian@gmail.com> * Update python/cutlass_library/sm90_utils.py Co-authored-by: Jack Kosaian <jackkosaian@gmail.com> --------- Co-authored-by: Jack Kosaian <jackkosaian@gmail.com> * add {uint4, uint2, int2} => {fp16, bf16} conversion (NVIDIA#1966) * Improve mixed dtype GEMM (NVIDIA#1972) * update * fix a typo * fix a typo that fails the compiling when ElementScale is not the same as MmaType (NVIDIA#1977) * Fix CuTe README Typo (NVIDIA#1951) * Fix Typo (NVIDIA#1962) * 3.6.0 update (NVIDIA#2005) * 3.6.0 update * doc and swap stuff --------- Co-authored-by: yuzhai <yuzhai@nvidia.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com> * Update CHANGELOG.md * Update 0x_gemm_tutorial.md (NVIDIA#1982) Shouldn't this be BLK_M, BLK_**K**, k * fix bug: arch/mma_sm60.h Mma<2,2,1> calculate wrong (NVIDIA#1989) * fix mem fence (NVIDIA#2030) Co-authored-by: yuzhai <yuzhai@nvidia.com> * Add half->int8 saturate conversion to promise valid range (NVIDIA#1983) * Add half->int8 saturate conversion to promise valid range * add gpu only macro --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com> * Add vector-types back to platform.h (NVIDIA#2026) * Fix typo in library_defaults.py (NVIDIA#2024) * Fix Typos (NVIDIA#2021) * Fix Typo * Fix Typo * Add Line Break (NVIDIA#2020) * Blockwise Scaling for FP8 (NVIDIA#1932) * F8 Blockwise Scaling * two more NumProducerThreadEvents --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com> * fix assertion in integer_subbytes.h (NVIDIA#1961) * CUTLASS 3.7 (NVIDIA#2045) * CUTLASS 3.7 * clean up changelog --------- Co-authored-by: yuzhai <yuzhai@nvidia.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com> * update 3.7 docs (NVIDIA#2051) * update docs * update docs * update docs * update docs * update docs --------- Co-authored-by: yuzhai <yuzhai@nvidia.com> * CUTLASS 3.8 Release (NVIDIA#2059) * CUTLASS 3.8 Release * update * Update README.md * Revert "Update README.md" This reverts commit b353e36. * update * update --------- Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com> * fix cuda 12.6 issues (NVIDIA#2066) * fix a readme broken link (NVIDIA#2069) * Update README.md * Groupwise scaling along M for FP8 gemm (NVIDIA#2037) * FP8 groupwise scaling along M * small updates --------- Co-authored-by: zl <zl@deepseek.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com> * bugfix generic-k code in top-k with softmax (NVIDIA#1993) * bugfix generic-k code in top-k with softmax * Update include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com> * Update examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com> --------- Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com> * [EVT] Add support for Row/Col broadcast PtrArray (NVIDIA#2033) * Add group support to EVT row/col broadcast. * small modifications --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com> * v3.8.0 update (NVIDIA#2082) * 3.8 update * fix Markus' name --------- Co-authored-by: yuzhai <yuzhai@nvidia.com> * [WA] Fix compiling errors --------- Co-authored-by: Saagar Jha <saagar@saagarjha.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com> Co-authored-by: Sergey Klevtsov <141879860+sklevtsov-nvidia@users.noreply.github.com> Co-authored-by: Tri Dao <tridao@users.noreply.github.com> Co-authored-by: Xinyu Yang <ltyxy@buaa.edu.cn> Co-authored-by: sijialou <sijia.lou@intel.com> Co-authored-by: Bogumil Sapinski Mobica <48835513+Bogumil-Sapinski-Mobica@users.noreply.github.com> Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com> Co-authored-by: Lei Mao <dukeleimao@gmail.com> Co-authored-by: 103yiran <1039105206@qq.com> Co-authored-by: MaxAkaAltmer <MaxAkaAltmer@yandex.ru> Co-authored-by: 侯奇 <houqi1993@gmail.com> Co-authored-by: Lain <28486541+IwakuraRein@users.noreply.github.com> Co-authored-by: Siyuan Fu <siyuanf@nvidia.com> Co-authored-by: Caleb_Du <59528230+CalebDu@users.noreply.github.com> Co-authored-by: LiYu Lu <luliyucoordinate@outlook.com> Co-authored-by: azhurkevich <101208641+azhurkevich@users.noreply.github.com> Co-authored-by: chenwei <15601910741@163.com> Co-authored-by: Wenlei Bao <142055114+wenlei-bao@users.noreply.github.com> Co-authored-by: LiuQiang <thorneliu@gmail.com> Co-authored-by: dan_the_3rd <43445237+danthe3rd@users.noreply.github.com> Co-authored-by: Jack Kosaian <jackkosaian@gmail.com> Co-authored-by: Yujia Zhai <yzhai015@ucr.edu> Co-authored-by: yuzhai <yuzhai@nvidia.com> Co-authored-by: Andrew O'Neill <foolusion@gmail.com> Co-authored-by: Dongxu.Wang <wangdongxuking61@gmail.com> Co-authored-by: ZZK <359521840@qq.com> Co-authored-by: Driss Guessous <32754868+drisspg@users.noreply.github.com> Co-authored-by: ZincCat <52513999+zinccat@users.noreply.github.com> Co-authored-by: Manish Gupta <mgupta.iitr@gmail.com> Co-authored-by: bobliao <codechaser@163.com> Co-authored-by: mihir-awatramani <162148077+mihir-awatramani@users.noreply.github.com> Co-authored-by: Liang <44948473+soundOfDestiny@users.noreply.github.com> Co-authored-by: zl <zl@deepseek.com> Co-authored-by: Tadej Ciglarič <tadej.c@gmail.com> Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com> Co-authored-by: Josh Fromm <jwfromm@meta.com>

Add GMMA shape m64n40k16

9892232

tridao force-pushed the tridao/mma40 branch from 50b2a75 to 9892232 Compare October 12, 2024 00:45

hwu36 approved these changes Oct 22, 2024

View reviewed changes

hwu36 merged commit 5b50a8f into NVIDIA:main Oct 22, 2024

hgl71964 pushed a commit to hgl71964/cutlass that referenced this pull request Feb 21, 2025

Add GMMA shape m64n40k16 (NVIDIA#1864)

31520ed

tridao deleted the tridao/mma40 branch April 4, 2025 16:13

andralex pushed a commit to andralex/cutlass that referenced this pull request Jun 14, 2025

Add GMMA shape m64n40k16 (NVIDIA#1864)

37c2a96

Albresky pushed a commit to Albresky/cutlass that referenced this pull request Oct 11, 2025

Add GMMA shape m64n40k16 (NVIDIA#1864)

240c967

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GMMA shape m64n40k16#1864

Add GMMA shape m64n40k16#1864
hwu36 merged 1 commit intoNVIDIA:mainfrom
Dao-AILab:tridao/mma40

tridao commented Oct 11, 2024 •

edited

Loading

Uh oh!

thakkarV commented Oct 11, 2024

Uh oh!

sklevtsov-nvidia commented Oct 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tridao commented Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thakkarV commented Oct 11, 2024

Uh oh!

sklevtsov-nvidia commented Oct 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tridao commented Oct 11, 2024 •

edited

Loading