Skip to content

bugfix generic-k code in top-k with softmax#1993

Merged
hwu36 merged 3 commits intoNVIDIA:mainfrom
t4c1:bugfix_generic_top_k
Feb 1, 2025
Merged

bugfix generic-k code in top-k with softmax#1993
hwu36 merged 3 commits intoNVIDIA:mainfrom
t4c1:bugfix_generic_top_k

Conversation

@t4c1
Copy link
Contributor

@t4c1 t4c1 commented Dec 17, 2024

Fixes a bug in generic top-k softmax EVT implementation that resulted in wrong results when k != 2 and k != 4.

This allows removal of the static assert requiring k to be either 2 or 4. Comment in example 61 is also fixed to reflect that any k value is now supported.

@alihassanijr
Copy link
Contributor

@t4c1 Thank you for submitting this patch.

Just a note on the assertion, it's there more as a warning to users that the generic sort comes with serious performance implications. Due to the control flow associated with it, it introduces somewhat heavy branching and register spilling, resulting in a smaller improvement over the baseline.

I'll let @hwu36 chime in on whether we want to get rid of the assertion.

@hwu36
Copy link
Collaborator

hwu36 commented Dec 26, 2024

@alihassanijr , maybe leave it there but make the text clearer about the consequences?

@alihassanijr
Copy link
Contributor

@hwu36 leave the assert there with a better message, or remove and make the doc clearer about the consequences?

@hwu36
Copy link
Collaborator

hwu36 commented Dec 26, 2024

Leave it there and make the message better so people willing to try can try it.

t4c1 and others added 2 commits January 3, 2025 13:16
Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>
…ith_topk_and_softmax.cu

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>
@t4c1
Copy link
Contributor Author

t4c1 commented Jan 15, 2025

I addressed the review comments some time ago. Is something else required to get this merged?

@t4c1
Copy link
Contributor Author

t4c1 commented Jan 27, 2025

@alihassanijr anything else here to be done from my side?

@alihassanijr
Copy link
Contributor

@hwu36 to approve the merge.

@hwu36 hwu36 merged commit 6f55278 into NVIDIA:main Feb 1, 2025
sijialouintel added a commit to sijialouintel/cutlass that referenced this pull request Feb 12, 2025
* Handle MNK Sm90{Row, Col}Reduction problem shapes (NVIDIA#1803)

* add is_last_tile

* Improve sm90 mixed dtype kernel (NVIDIA#1883)

* Add GMMA shape m64n40k16 (NVIDIA#1864)

* Add all supported GMMA shapes (NVIDIA#1890)

* add maximum support (NVIDIA#1833)

* fix typo (NVIDIA#1853)

* fix by adding public (NVIDIA#1753)

* added mapping for bf16 to torch::kBFloat16 (NVIDIA#1843)

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>

* Fix README (NVIDIA#1658)

* Fix README

* Improve README

---------

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>

* Adjusting code indentation (NVIDIA#1639)

* Include of regular_tile_iterator.h fixed for NVRTC (NVIDIA#1765)

* Include of regular_tile_iterator.h fixed for NVRTC

* More include fixed for NVRTC

* Update gemm_f16n_f16t_f32t_tensor_op_f32_sm80.cu with include "cutlass/gemm/device/gemm_universal.h" (NVIDIA#1569)

fix compile with `cmake .. -DCUTLASS_ENABLE_TESTS=ON -DCUTLASS_TEST_LEVEL=2`

* remove redundant hardcoded packing configs in mixed dtype gemm (NVIDIA#1894)

Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>

* fix wrong A/BLayout in MMA_Traits for binary mma and append other MMA_Traits support  (NVIDIA#1856)

* fix wrong A/BLayout in  MMA_Traits<SM80_16x8x256_S32U1U1S32_TN_XORPOPC> and append support for  m8n8k128, m16n8k128  mma.and.popc in MMA_Traits instantiation

* add "print" template for  subbyte_reference<T>

* Add a print for the uint{x}b_t type. (NVIDIA#1871)

* Refactor some GroupedGEMM logic (NVIDIA#1899)

* feat: support kFactor 8 used in mma tensor op tile iterator (NVIDIA#1512)

* Update publications (NVIDIA#1912)

* remove restriction of stride == kernel in nhwc_pooling (NVIDIA#1896)

* fix undefined in device code error (NVIDIA#1880)

* Fix the racing condition of mixed-input gemm when writing the registers (NVIDIA#1931)

* move two warpgroup_wait

* merge main

---------

Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>

* Fix `cutlass` python library with cuda `12.6.2.post1` (NVIDIA#1942)

* Fix `cutlass` python library with cuda `12.6.2.post1`

Previously we had this error:
```
  File "/storage/home/cutlass/python/cutlass/backend/operation.py", line 39, in <listcomp>
    _version_splits = [int(x) for x in __version__.split("rc")[0].split(".")]
                       ^^^^^^
ValueError: invalid literal for int() with base 10: 'post1'
```

* Update sm90_utils.py

* Update generator.py

* Update python/cutlass_library/generator.py

Co-authored-by: Jack Kosaian <jackkosaian@gmail.com>

* Update python/cutlass_library/sm90_utils.py

Co-authored-by: Jack Kosaian <jackkosaian@gmail.com>

---------

Co-authored-by: Jack Kosaian <jackkosaian@gmail.com>

* add {uint4, uint2, int2} => {fp16, bf16} conversion (NVIDIA#1966)

* Improve mixed dtype GEMM (NVIDIA#1972)

* update

* fix a typo

* fix a typo that fails the compiling when ElementScale is not the same as MmaType (NVIDIA#1977)

* Fix CuTe README Typo (NVIDIA#1951)

* Fix Typo (NVIDIA#1962)

* 3.6.0 update (NVIDIA#2005)

* 3.6.0 update

* doc and swap stuff

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

* Update CHANGELOG.md

* Update 0x_gemm_tutorial.md (NVIDIA#1982)

Shouldn't this be BLK_M, BLK_**K**, k

* fix bug: arch/mma_sm60.h Mma<2,2,1> calculate wrong (NVIDIA#1989)

* fix mem fence (NVIDIA#2030)

Co-authored-by: yuzhai <yuzhai@nvidia.com>

* Add half->int8 saturate conversion to promise valid range (NVIDIA#1983)

* Add half->int8 saturate conversion to promise valid range

* add gpu only macro

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

* Add vector-types back to platform.h (NVIDIA#2026)

* Fix typo in library_defaults.py (NVIDIA#2024)

* Fix Typos (NVIDIA#2021)

* Fix Typo

* Fix Typo

* Add Line Break (NVIDIA#2020)

* Blockwise Scaling for FP8 (NVIDIA#1932)

* F8 Blockwise Scaling

* two more NumProducerThreadEvents

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

* fix assertion in integer_subbytes.h (NVIDIA#1961)

* CUTLASS 3.7 (NVIDIA#2045)

* CUTLASS 3.7

* clean up changelog

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

* update 3.7 docs (NVIDIA#2051)

* update docs

* update docs

* update docs

* update docs

* update docs

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>

* CUTLASS 3.8 Release (NVIDIA#2059)

* CUTLASS 3.8 Release

* update

* Update README.md

* Revert "Update README.md"

This reverts commit b353e36.

* update

* update

---------

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

* fix cuda 12.6 issues (NVIDIA#2066)

* fix a readme broken link (NVIDIA#2069)

* Update README.md

* Groupwise scaling along M for FP8 gemm (NVIDIA#2037)

* FP8 groupwise scaling along M

* small updates

---------

Co-authored-by: zl <zl@deepseek.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

* bugfix generic-k code in top-k with softmax (NVIDIA#1993)

* bugfix generic-k code in top-k with softmax

* Update include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>

* Update examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>

---------

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>

* [EVT] Add support for Row/Col broadcast PtrArray (NVIDIA#2033)

* Add group support to EVT row/col broadcast.

* small modifications

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

* v3.8.0 update (NVIDIA#2082)

* 3.8 update

* fix Markus' name

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>

* [WA] Fix compiling errors

---------

Co-authored-by: Saagar Jha <saagar@saagarjha.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
Co-authored-by: Sergey Klevtsov <141879860+sklevtsov-nvidia@users.noreply.github.com>
Co-authored-by: Tri Dao <tridao@users.noreply.github.com>
Co-authored-by: Xinyu Yang <ltyxy@buaa.edu.cn>
Co-authored-by: sijialou <sijia.lou@intel.com>
Co-authored-by: Bogumil Sapinski Mobica <48835513+Bogumil-Sapinski-Mobica@users.noreply.github.com>
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
Co-authored-by: Lei Mao <dukeleimao@gmail.com>
Co-authored-by: 103yiran <1039105206@qq.com>
Co-authored-by: MaxAkaAltmer <MaxAkaAltmer@yandex.ru>
Co-authored-by: 侯奇 <houqi1993@gmail.com>
Co-authored-by: Lain <28486541+IwakuraRein@users.noreply.github.com>
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>
Co-authored-by: Caleb_Du <59528230+CalebDu@users.noreply.github.com>
Co-authored-by: LiYu Lu <luliyucoordinate@outlook.com>
Co-authored-by: azhurkevich <101208641+azhurkevich@users.noreply.github.com>
Co-authored-by: chenwei <15601910741@163.com>
Co-authored-by: Wenlei Bao <142055114+wenlei-bao@users.noreply.github.com>
Co-authored-by: LiuQiang <thorneliu@gmail.com>
Co-authored-by: dan_the_3rd <43445237+danthe3rd@users.noreply.github.com>
Co-authored-by: Jack Kosaian <jackkosaian@gmail.com>
Co-authored-by: Yujia Zhai <yzhai015@ucr.edu>
Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Andrew O'Neill <foolusion@gmail.com>
Co-authored-by: Dongxu.Wang <wangdongxuking61@gmail.com>
Co-authored-by: ZZK <359521840@qq.com>
Co-authored-by: Driss Guessous <32754868+drisspg@users.noreply.github.com>
Co-authored-by: ZincCat <52513999+zinccat@users.noreply.github.com>
Co-authored-by: Manish Gupta <mgupta.iitr@gmail.com>
Co-authored-by: bobliao <codechaser@163.com>
Co-authored-by: mihir-awatramani <162148077+mihir-awatramani@users.noreply.github.com>
Co-authored-by: Liang <44948473+soundOfDestiny@users.noreply.github.com>
Co-authored-by: zl <zl@deepseek.com>
Co-authored-by: Tadej Ciglarič <tadej.c@gmail.com>
Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>
Co-authored-by: Josh Fromm <jwfromm@meta.com>
hgl71964 pushed a commit to hgl71964/cutlass that referenced this pull request Feb 21, 2025
* bugfix generic-k code in top-k with softmax

* Update include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>

* Update examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>

---------

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>
andralex pushed a commit to andralex/cutlass that referenced this pull request Jun 14, 2025
* bugfix generic-k code in top-k with softmax

* Update include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>

* Update examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>

---------

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>
Albresky pushed a commit to Albresky/cutlass that referenced this pull request Oct 11, 2025
* bugfix generic-k code in top-k with softmax

* Update include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>

* Update examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>

---------

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants