[Fix] fix enable_deterministic_inference failed with enable_dp_attention#11023
[Fix] fix enable_deterministic_inference failed with enable_dp_attention#11023GuoweiWangU wants to merge 3 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @GuoweiWangU, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves an issue where enabling deterministic inference failed to produce consistent results when Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses an issue where deterministic inference was failing when DP attention is enabled. The fix correctly forces the DpPaddingMode to SUM_LEN in deterministic mode. The changes are logical and well-targeted. I've added a couple of suggestions to refactor the new conditional checks for improved conciseness and maintainability.
| cls, is_extend_in_batch, global_num_tokens: List[int] | ||
| ) -> DpPaddingMode: | ||
| if is_extend_in_batch: | ||
| if _DETERMINISTIC_INFERENCE or is_extend_in_batch: |
There was a problem hiding this comment.
Is there any reason of fixing padding mdoe to SUM_LEN rather than MAX_LEN?
There was a problem hiding this comment.
Given that SUM_LEN is used in is_extend_in_batch mode, for consistency, we choose to use SUM_LEN throughout the entire process. If get_dp_padding_mode and get_default_mode_in_cuda_graph are forced to be configured as MAX_LEN, the test results will still not be deterministic.
Prompt 0 with prefix length 1: total samples: 314, Unique samples: 4
Prompt 1 with prefix length 511: total samples: 300, Unique samples: 13
Prompt 2 with prefix length 2048: total samples: 344, Unique samples: 5
Prompt 3 with prefix length 4097: total samples: 317, Unique samples: 25
|
@GuoweiWangU Also can you provide the result of testing determinism with prefix mode? python3 -m sglang.test.test_deterministic --test-mode prefix |
Results updated |
| # TODO(kkhuang-amd): noqa, temporary work-around for rocm 7.0.0 alpha | ||
| # it can be safely removed later, once RCCL fixed | ||
| if _USE_ROCM700A_WA: | ||
| if _DETERMINISTIC_INFERENCE or _USE_ROCM700A_WA: |
There was a problem hiding this comment.
SUM_LEN is likely to hurt performance here. Could you double check if the change in this line is needed?
There was a problem hiding this comment.
This PR is to fix the problem of deterministic inference. Here, it is to ensure that SUM_LEN is used in all calculation processes (existing extend processes use SUM_LEN). Some performance loss is unavoidable.
There was a problem hiding this comment.
Yes, we can temporarily apply this strategy, and try to optimize its performance later. @ch-wan
Motivation
--enable-deterministic-inferencefailed to get deterministic results when run Moe models with--enable-dp-attention.python3 -m sglang.test.test_deterministic --test-mode mixedBefore this pr:
After fix:
python3 -m sglang.test.test_deterministic --test-mode prefixIn the current version, test prefix with
enable_dp_attentiontriggers a400 Bad Requesterror due toflush_cacheinsend_prefix, ultimately leading to memory leaksValueError: token_to_kv_pool_allocator memory leak detected!. The specific cause is unknown and unrelated to this pull request. The following results are only obtained after commenting out flush_cache.Before this pr:
After fix:
Modifications
Fixed
DpPaddingModetoSUM_LENwhen launch with--enable-deterministic-inference.Checklist