Introduce GDN to Mamba#3535
Conversation
|
/ok to test acc49e7 |
|
/ok to test 2fb3d8d |
|
/ok to test a2cfc0e |
|
Looking good. Please merge after 3377, which is very close to being merged. |
The merge auto-resolved mamba_block.py but kept the old layer_number=i+1 style in the GDN case. Update to use layer_number (which includes pp_layer_offset), add_layer_offset=False, and pp_layer_offset to match the ATTENTION case. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ok to test 6b9a8f7 |
There was a problem hiding this comment.
These are the things that stand-out from the review of your latest changes. I get the sense that there are other things that will need to be added to fully support GDN, such as the calculation of flops in training.py
Just realized I had some comments that were pending from yesterday that I forgot to submit. Submitted with this review cycle.
| layer_number=layer_number, | ||
| pg_collection=pg_collection, | ||
| is_mtp_layer=is_mtp_layer, | ||
| add_layer_offset=False, |
There was a problem hiding this comment.
Looking at the current GDN implementation in main, it doesn't seem to add its own pp_layer_offset like the attention layers do. I wonder if this is because it doesn't currently support pipeline parallel via the GPTModel. If it's not adding its own pp_layer_offset, then it doesn't need to be told not to when the HLModel passes a layer_number containing the pp_layer_offset it already calculated.
There was a problem hiding this comment.
pp_layer_offset is not needed for GDN (for now).
There was a problem hiding this comment.
This code should be moved up under mamba.
Sounds like add_layer_offset is not needed, because the GDN layer does not currently add its own pp_layer_offset. So perhaps add_layer_offset should not be included here.
There was a problem hiding this comment.
I believe we should be setting add_layer_offset=False so that the TransformerLayer does not calculate additional layers (see this code).
|
/ok to test bd9317e |
|
Can we add a functional test or two for GDN? Or do we need more to do an end-to-end run? |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/23365696949 |
|
/ok to test 5a5e206 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/23414703144 |
|
/ok to test cb665a7 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/23423090252 |
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
What does this PR do ?
Adding Gated Delta Net (GDN) feature to MambaModel. GDN was added to GPTModel in #1989.
Adding unit tests. Also, tested end-to-end model using this Megatron-Bridge PR: NVIDIA-NeMo/Megatron-Bridge#2520.
wandb: Baseline (w/ GPTModel) vs MambaModel
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.