[dev] [DeepSeek-v4] Part 4: Fusion Kernels for DSv4 Hybrid Attention#4894
Conversation
|
/claude strict-review |
There was a problem hiding this comment.
Review Summary
CRITICAL: 1 | IMPORTANT: 2 | SUGGESTION: 1
Critical
- Dense path forward/backward inconsistency (
dsa_kernels.py:800-807): InFusedIndexerSparseAttnFunc.forward, the dense loss branch passes both pre-scaled weights (w_bsh_scaled = w_bsh * indexer_softmax_scale) ANDindexer_softmax_scaleassm_scaleto_compute_dense_indexer_score, double-applying the scale factor. The backward (line 908) correctly uses raww_bsh+sm_scale(single application). This forward/backward mismatch produces incorrect optimizer updates for indexer parameters whensparse_loss=False. The parity test (test_dsa_kernels.py:3399) explicitly documents this as an "apparent bug" and works around it withindexer_softmax_scale ** 2in the reference. Fix: passw_bshinstead ofw_bsh_scaledto_compute_dense_indexer_score.
Important
-
Default change for
force_unfused_dsa(csa.py:607): Changed fromTrue→False, silently switching all existing users to the fused path which requiresflash_mlaandcudnn.DSA. Users without these packages will get runtimeImportErrorinstead of the previously-working unfused path. -
Potential numerical fragility in
_kl_loss_from_dense_scores(dsa_kernels.py:648-676): Missing per-position masking for-infentries at ratio-causal-masked positions. If the cuDNN kernel outputs-infat masked positions, partially-masked rows will produceinfin the KL sum (eps * (log(eps) - (-inf)) = inf). The test reference_ref_dense_indexer_losshandles this explicitly but the production code does not.
Suggestion
- Missing NVTX annotations (
csa.py:779-848): The new fused paths (A, B, C) lack thenvtx_range_push/popcalls present in the unfused path, reducing profiling visibility for what are now the default codepaths.
Overall Assessment
Risk: Medium-High. The core architecture — lazy kernel imports, autograd Functions with fwd/bwd pairing, index helpers, and the three-path integration into csa.py — is well-structured and thoroughly tested. The sparse loss path (the default) appears correct. However, the dense loss path has a forward/backward scale inconsistency that must be fixed before merge. The default-change for force_unfused_dsa also needs either a graceful fallback or explicit documentation of the new runtime dependencies.
|
/ok to test 2b3c815 |
| # weights-scaling trick so the effective weights match | ||
| # the pre-scale-split behaviour. | ||
| weights_for_unfused = weights_indexer * self.indexer.softmax_scale | ||
| weights_for_unfused = weights_indexer.float() * self.indexer.softmax_scale |
There was a problem hiding this comment.
Why don't pass the softmax_scale into kernel directly?
There was a problem hiding this comment.
This is for the unfused path. Since the fused kernel accepts the scale, we only need to multiply the scale in the unfused path.
|
/claude strict-review |
Code Review Summary — PR #4894Scope: Fusion kernels for DSv4 Hybrid Attention (FlashMLA fwd + cuDNN DSA bwd integration) Findings
Risk Assessment
VerdictRequest changes. Findings 1 and 2 should be addressed before merge. Finding 3 is optional but recommended. 🤖 Generated with Claude Code |
|
/ok to test d91242a |
|
/ok to test 8ba9cc0 |
|
/ok to test fca33ac |
|
/ok to test b86e4c5 |
|
/ok to test a864690 |
|
/ok to test 70a6673 |
|
/ok to test a31271e |
|
/ok to test 4e560b7 |
ko3n1g
left a comment
There was a problem hiding this comment.
question regarding pin strictness
| "megatron-energon[av_decode]~=6.0", | ||
| "av", | ||
| "flashinfer-python>=0.5.0,<0.7.0", | ||
| "nvidia-cudnn-frontend==1.24.0", |
There was a problem hiding this comment.
do we really need such a strict pin for all of MLM? Can we have a version guard inside core instead?
|
/ok to test 04bade0 |
|
/ok to test ccc8449 |
|
/ok to test 5a806d1 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26489072473 |
What does this PR do ?
https://github.com/NVIDIA/cudnn-frontend release v1.24.0 added the fusion kernels for DeepSeek-V4 Hybrid Attention.
This PR integrated these kernel into MCore.
Issue tracking
For PRs from open-source community contributors:
Linked issue:
Contribution process
Pre-checks
Code review
Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.