Skip to content

Fix multicast bug and optimize masked GEMM#193

Merged
LyricZhao merged 2 commits intomainfrom
multicast-fixed
Sep 12, 2025
Merged

Fix multicast bug and optimize masked GEMM#193
LyricZhao merged 2 commits intomainfrom
multicast-fixed

Conversation

@yukuai26
Copy link
Copy Markdown
Collaborator

@yukuai26 yukuai26 commented Sep 12, 2025

In the previous code, the variable used for determining whether to use multicast was incorrectly applied, the variable for judging A multicast capability was used to determine B multicast capability. Fortunately, this error does not affect correctness, and it also does not impact performance in the case of non-masked GEMM.

Additionally, I added B multicast logic for masked GEMM and modified the parameters. Masked GEMM achieved a 10%-20% speedup in H800.

Before:

Testing m-grouped masked GEMM:
 > Perf (num_groups=1, expected_m_per_group=1024, n=4096, k=7168, 1D2D):   79 us |  778 TFLOPS |  578 GB/s
 > Perf (num_groups=1, expected_m_per_group=1024, n=7168, k=2048, 1D2D):   45 us |  727 TFLOPS |  732 GB/s
 > Perf (num_groups=2, expected_m_per_group= 512, n=4096, k=7168, 1D2D):   92 us |  744 TFLOPS |  839 GB/s
 > Perf (num_groups=2, expected_m_per_group= 512, n=7168, k=2048, 1D2D):   44 us |  660 TFLOPS | 1041 GB/s
 > Perf (num_groups=4, expected_m_per_group= 256, n=4096, k=7168, 1D2D):   95 us |  688 TFLOPS | 1425 GB/s
 > Perf (num_groups=4, expected_m_per_group= 256, n=7168, k=2048, 1D2D):   46 us |  573 TFLOPS | 1585 GB/s

After:

Testing m-grouped masked GEMM:
 > Perf (num_groups=1, expected_m_per_group=1024, n=4096, k=7168, 1D2D):   67 us |  920 TFLOPS |  683 GB/s
 > Perf (num_groups=1, expected_m_per_group=1024, n=7168, k=2048, 1D2D):   35 us |  931 TFLOPS |  937 GB/s
 > Perf (num_groups=2, expected_m_per_group= 512, n=4096, k=7168, 1D2D):   77 us |  879 TFLOPS |  992 GB/s
 > Perf (num_groups=2, expected_m_per_group= 512, n=7168, k=2048, 1D2D):   38 us |  751 TFLOPS | 1184 GB/s
 > Perf (num_groups=4, expected_m_per_group= 256, n=4096, k=7168, 1D2D):   82 us |  798 TFLOPS | 1652 GB/s
 > Perf (num_groups=4, expected_m_per_group= 256, n=7168, k=2048, 1D2D):   43 us |  613 TFLOPS | 1698 GB/s

@yukuai26 yukuai26 closed this Sep 12, 2025
@yukuai26 yukuai26 deleted the multicast-fixed branch September 12, 2025 08:38
@yukuai26 yukuai26 restored the multicast-fixed branch September 12, 2025 08:44
@yukuai26 yukuai26 changed the title Fix multicast bug and profile masked GEMM Fix multicast bug and optimize masked GEMM Sep 12, 2025
@yukuai26 yukuai26 reopened this Sep 12, 2025
@LyricZhao LyricZhao merged commit 79f48ee into main Sep 12, 2025
@LyricZhao LyricZhao deleted the multicast-fixed branch September 25, 2025 09:25
FlamingoPg pushed a commit to sgl-project/DeepGEMM that referenced this pull request Oct 15, 2025
* Fix multicast bug and profile masked GEMM

* Updates and lint

---------

Co-authored-by: Kuai Yu <yukuai@deepseek.com>
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
FlamingoPg pushed a commit to sgl-project/DeepGEMM that referenced this pull request Oct 15, 2025
* Fix multicast bug and profile masked GEMM

* Updates and lint

---------

Co-authored-by: Kuai Yu <yukuai@deepseek.com>
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
LyricZhao added a commit that referenced this pull request Apr 16, 2026
* Minor fix

* Fix workspace API usages

* Minor fix

* Successful compilation

* Align tokens

* Dump profiling traces

* Load token into shared memory

* Store into remote buffers

* Use 2 GiB to debug

* Mega MoE Scheduler Update (#184)

* Update scheduler

* Fix ptx

* Add scheduler init

* Minor update

* Add tma load warp

* Add mma warp

* Update scheduler

* Minor fixes

* Add epilogue warps

* Minor fix

* Minor fix

* Minor fix

* Minor fix

* Epilogue Linear1 Left Phase

* Epilogue Linear1 Right Phase

* Support gran_k = 32

* Add cast

* Minor fix

* Add tma store

* Minor fix

* Finish Lienar1 Right Epilogue

* Minor fix

* Minor fix

* Refactor specialized ld/st PTX

* Use scheduler in the kernel

* Minor fix

* Minor fix

* Allocate slots indices together

* SMEM CD FP8/BF16

* Rename scheduler namespace

* Linear2 store smem

* Process the last token block count

* Finish Linear2 Epilogue

* Minor fix

* Add m offset

* Fix bugs

* Fix __fns

* Fix dispatch and shceduler

* Add shared memory

* Add Tensor Check

* Add topk checks

* Add checks and L1/L2 tensormaps

* Finish tensormaps

* Minor fix

* Minor fix

* Minor fix

* Minor fix

* Minor fix

* Share epilogue warps with combine

* Reorder params

* Reorder params

* Many fixes

* Many fixes

* Add grid sync

* Many fixes

* Minor fix

* Add NVLink barriers

* Integrate NVLink barriers

* Add combine implements and AGENTS.md

* Support -1 top-k indices

* Minor fix

* Early return for debugging

* Dispatch token indices as well

* Fix smem cd

* Deallocate tensor memory

* Some refactors

* Some renaming

* Typo

* Add assertions

* Fix L2 arrival bugs

* Enable combine epilogue

* Some refactors

* Bench legacy code

* Add a TODO

* Finishi Linear2 Epilgoue

* Add clamp to match the testing scripts

* Fix TMA wait

* Fix offsets

* Add a TODO

* Add a TODO for top-k weights

* Polish L1 code

* Earlier tensor memory empty signaling

* Add L2 tmem/smem read/write

* Write into remote

* Skip oob tokens

* Fix swizzling bugs

* Fix device indices

* Remove useless includes

* Remove useless includes x2

* Code lint

---------

Co-authored-by: Zhean Xu <94977922+zheanxu@users.noreply.github.com>
Co-authored-by: Zhean Xu <xza@deepseek.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants