Fix multicast bug and optimize masked GEMM by yukuai26 · Pull Request #193 · deepseek-ai/DeepGEMM

yukuai26 · 2025-09-12T08:32:11Z

In the previous code, the variable used for determining whether to use multicast was incorrectly applied, the variable for judging A multicast capability was used to determine B multicast capability. Fortunately, this error does not affect correctness, and it also does not impact performance in the case of non-masked GEMM.

Additionally, I added B multicast logic for masked GEMM and modified the parameters. Masked GEMM achieved a 10%-20% speedup in H800.

Before:

Testing m-grouped masked GEMM:
 > Perf (num_groups=1, expected_m_per_group=1024, n=4096, k=7168, 1D2D):   79 us |  778 TFLOPS |  578 GB/s
 > Perf (num_groups=1, expected_m_per_group=1024, n=7168, k=2048, 1D2D):   45 us |  727 TFLOPS |  732 GB/s
 > Perf (num_groups=2, expected_m_per_group= 512, n=4096, k=7168, 1D2D):   92 us |  744 TFLOPS |  839 GB/s
 > Perf (num_groups=2, expected_m_per_group= 512, n=7168, k=2048, 1D2D):   44 us |  660 TFLOPS | 1041 GB/s
 > Perf (num_groups=4, expected_m_per_group= 256, n=4096, k=7168, 1D2D):   95 us |  688 TFLOPS | 1425 GB/s
 > Perf (num_groups=4, expected_m_per_group= 256, n=7168, k=2048, 1D2D):   46 us |  573 TFLOPS | 1585 GB/s

After:

Testing m-grouped masked GEMM:
 > Perf (num_groups=1, expected_m_per_group=1024, n=4096, k=7168, 1D2D):   67 us |  920 TFLOPS |  683 GB/s
 > Perf (num_groups=1, expected_m_per_group=1024, n=7168, k=2048, 1D2D):   35 us |  931 TFLOPS |  937 GB/s
 > Perf (num_groups=2, expected_m_per_group= 512, n=4096, k=7168, 1D2D):   77 us |  879 TFLOPS |  992 GB/s
 > Perf (num_groups=2, expected_m_per_group= 512, n=7168, k=2048, 1D2D):   38 us |  751 TFLOPS | 1184 GB/s
 > Perf (num_groups=4, expected_m_per_group= 256, n=4096, k=7168, 1D2D):   82 us |  798 TFLOPS | 1652 GB/s
 > Perf (num_groups=4, expected_m_per_group= 256, n=7168, k=2048, 1D2D):   43 us |  613 TFLOPS | 1698 GB/s

* Fix multicast bug and profile masked GEMM * Updates and lint --------- Co-authored-by: Kuai Yu <yukuai@deepseek.com> Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

* Minor fix * Fix workspace API usages * Minor fix * Successful compilation * Align tokens * Dump profiling traces * Load token into shared memory * Store into remote buffers * Use 2 GiB to debug * Mega MoE Scheduler Update (#184) * Update scheduler * Fix ptx * Add scheduler init * Minor update * Add tma load warp * Add mma warp * Update scheduler * Minor fixes * Add epilogue warps * Minor fix * Minor fix * Minor fix * Minor fix * Epilogue Linear1 Left Phase * Epilogue Linear1 Right Phase * Support gran_k = 32 * Add cast * Minor fix * Add tma store * Minor fix * Finish Lienar1 Right Epilogue * Minor fix * Minor fix * Refactor specialized ld/st PTX * Use scheduler in the kernel * Minor fix * Minor fix * Allocate slots indices together * SMEM CD FP8/BF16 * Rename scheduler namespace * Linear2 store smem * Process the last token block count * Finish Linear2 Epilogue * Minor fix * Add m offset * Fix bugs * Fix __fns * Fix dispatch and shceduler * Add shared memory * Add Tensor Check * Add topk checks * Add checks and L1/L2 tensormaps * Finish tensormaps * Minor fix * Minor fix * Minor fix * Minor fix * Minor fix * Share epilogue warps with combine * Reorder params * Reorder params * Many fixes * Many fixes * Add grid sync * Many fixes * Minor fix * Add NVLink barriers * Integrate NVLink barriers * Add combine implements and AGENTS.md * Support -1 top-k indices * Minor fix * Early return for debugging * Dispatch token indices as well * Fix smem cd * Deallocate tensor memory * Some refactors * Some renaming * Typo * Add assertions * Fix L2 arrival bugs * Enable combine epilogue * Some refactors * Bench legacy code * Add a TODO * Finishi Linear2 Epilgoue * Add clamp to match the testing scripts * Fix TMA wait * Fix offsets * Add a TODO * Add a TODO for top-k weights * Polish L1 code * Earlier tensor memory empty signaling * Add L2 tmem/smem read/write * Write into remote * Skip oob tokens * Fix swizzling bugs * Fix device indices * Remove useless includes * Remove useless includes x2 * Code lint --------- Co-authored-by: Zhean Xu <94977922+zheanxu@users.noreply.github.com> Co-authored-by: Zhean Xu <xza@deepseek.com>

Fix multicast bug and profile masked GEMM

a43a750

yukuai26 closed this Sep 12, 2025

yukuai26 deleted the multicast-fixed branch September 12, 2025 08:38

yukuai26 restored the multicast-fixed branch September 12, 2025 08:44

yukuai26 changed the title ~~Fix multicast bug and profile masked GEMM~~ Fix multicast bug and optimize masked GEMM Sep 12, 2025

yukuai26 reopened this Sep 12, 2025

Updates and lint

821ad1e

LyricZhao merged commit 79f48ee into main Sep 12, 2025

yukuai26 mentioned this pull request Sep 12, 2025

[Bug]TMA Multicast code issue #190

Closed

LyricZhao deleted the multicast-fixed branch September 25, 2025 09:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multicast bug and optimize masked GEMM#193

Fix multicast bug and optimize masked GEMM#193
LyricZhao merged 2 commits intomainfrom
multicast-fixed

yukuai26 commented Sep 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yukuai26 commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yukuai26 commented Sep 12, 2025 •

edited

Loading