Fix cutlass moe accuracy drop caused by attention UB from DP padding mode#10414
Fix cutlass moe accuracy drop caused by attention UB from DP padding mode#10414zhyncs merged 5 commits intosgl-project:mainfrom
Conversation
(cherry picked from commit 0f6f939)
There was a problem hiding this comment.
Summary of Changes
Hello @fzyzcjy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a temporary workaround to address an observed accuracy drop, specifically within the DeepseekV2 model. The changes involve tracking the original number of tokens before padding and then explicitly zeroing out the padded sections of intermediate tensors (hidden_states and residual) during the model's forward pass in 'extend' mode. This is a diagnostic or short-term fix to isolate or mitigate potential issues related to padding effects on model accuracy, indicated by the [DO NOT MERGE] tag and hack_ prefixes.
Highlights
- Temporary Field Addition: A new attribute,
hack_num_tokens_before_pad, has been added to theForwardBatchclass inforward_batch_info.pyto temporarily store the count of actual tokens before any padding is applied. - Padding Length Tracking: The
prepare_mlp_sync_batchmethod now records the original length ofinput_idsinto thehack_num_tokens_before_padfield right before the padding operation. - Padded Tensor Zeroing: A helper function
hack_setpadwas introduced indeepseek_v2.pyto explicitly zero out the padded portions ofhidden_statesandresidualtensors. This function is conditionally applied at various stages within theDeepseekV2ForCausalLM'sforwardmethod when theforward_modeis set to 'extend'.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces a temporary workaround to address an accuracy drop by zeroing out padding in hidden states and residuals for the Deepseek V2 model during the extend phase. While the fix seems correct, the implementation introduces some code quality issues. Specifically, the use of 'hack' in variable and function names should be avoided, and there is significant code duplication in deepseek_v2.py that should be refactored for better maintainability. My review includes suggestions to address these points.
This reverts commit 4fd6177.
(cherry picked from commit 975beb7)
Motivation
Bug description
Direct cause
Root cause
Potential affects
Accuracy test and issue that this PR fixes
old text
I will do more experiments when having time to see what is going on and give a better fix. currently it is just a temporary workaround to check whether accuracy issue is caused by this.
ongoing work at feat/hack_ac6131
EDIT: a better workaround is as follows. seems attn metadata do not understand padding while we do have padding to pass to it.
EDIT: maybe this, I will cherry pick after testing e2e long benchmarks
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist