Fix multicast bug and optimize masked GEMM#193
Merged
Conversation
FlamingoPg
pushed a commit
to sgl-project/DeepGEMM
that referenced
this pull request
Oct 15, 2025
* Fix multicast bug and profile masked GEMM * Updates and lint --------- Co-authored-by: Kuai Yu <yukuai@deepseek.com> Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
FlamingoPg
pushed a commit
to sgl-project/DeepGEMM
that referenced
this pull request
Oct 15, 2025
* Fix multicast bug and profile masked GEMM * Updates and lint --------- Co-authored-by: Kuai Yu <yukuai@deepseek.com> Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
LyricZhao
added a commit
that referenced
this pull request
Apr 16, 2026
* Minor fix * Fix workspace API usages * Minor fix * Successful compilation * Align tokens * Dump profiling traces * Load token into shared memory * Store into remote buffers * Use 2 GiB to debug * Mega MoE Scheduler Update (#184) * Update scheduler * Fix ptx * Add scheduler init * Minor update * Add tma load warp * Add mma warp * Update scheduler * Minor fixes * Add epilogue warps * Minor fix * Minor fix * Minor fix * Minor fix * Epilogue Linear1 Left Phase * Epilogue Linear1 Right Phase * Support gran_k = 32 * Add cast * Minor fix * Add tma store * Minor fix * Finish Lienar1 Right Epilogue * Minor fix * Minor fix * Refactor specialized ld/st PTX * Use scheduler in the kernel * Minor fix * Minor fix * Allocate slots indices together * SMEM CD FP8/BF16 * Rename scheduler namespace * Linear2 store smem * Process the last token block count * Finish Linear2 Epilogue * Minor fix * Add m offset * Fix bugs * Fix __fns * Fix dispatch and shceduler * Add shared memory * Add Tensor Check * Add topk checks * Add checks and L1/L2 tensormaps * Finish tensormaps * Minor fix * Minor fix * Minor fix * Minor fix * Minor fix * Share epilogue warps with combine * Reorder params * Reorder params * Many fixes * Many fixes * Add grid sync * Many fixes * Minor fix * Add NVLink barriers * Integrate NVLink barriers * Add combine implements and AGENTS.md * Support -1 top-k indices * Minor fix * Early return for debugging * Dispatch token indices as well * Fix smem cd * Deallocate tensor memory * Some refactors * Some renaming * Typo * Add assertions * Fix L2 arrival bugs * Enable combine epilogue * Some refactors * Bench legacy code * Add a TODO * Finishi Linear2 Epilgoue * Add clamp to match the testing scripts * Fix TMA wait * Fix offsets * Add a TODO * Add a TODO for top-k weights * Polish L1 code * Earlier tensor memory empty signaling * Add L2 tmem/smem read/write * Write into remote * Skip oob tokens * Fix swizzling bugs * Fix device indices * Remove useless includes * Remove useless includes x2 * Code lint --------- Co-authored-by: Zhean Xu <94977922+zheanxu@users.noreply.github.com> Co-authored-by: Zhean Xu <xza@deepseek.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In the previous code, the variable used for determining whether to use multicast was incorrectly applied, the variable for judging A multicast capability was used to determine B multicast capability. Fortunately, this error does not affect correctness, and it also does not impact performance in the case of non-masked GEMM.
Additionally, I added B multicast logic for masked GEMM and modified the parameters. Masked GEMM achieved a 10%-20% speedup in H800.
Before:
After: