Skip to content

Fix DeepSeek PD decode DP + MTP failed by vectorized gather kernel in…#15457

Open
llc-kc wants to merge 2 commits intosgl-project:mainfrom
llc-kc:fix_mtp
Open

Fix DeepSeek PD decode DP + MTP failed by vectorized gather kernel in…#15457
llc-kc wants to merge 2 commits intosgl-project:mainfrom
llc-kc:fix_mtp

Conversation

@llc-kc
Copy link
Copy Markdown
Contributor

@llc-kc llc-kc commented Dec 19, 2025

Motivation

Fix DeepSeek PD TP+DP+MTP, decode failed by vectorized gather kernel index out of bounds
as Descripted in
#15143
#15399

Modifications

Using clone to avoid tensor of output in cuda graph output_buffers used at other place simultaneously.

Accuracy Tests

None

Benchmarking and Profiling

None

…dex out of bounds

Signed-off-by: liluchang <liluchang@kingsoft.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@yudian0504
Copy link
Copy Markdown
Contributor

#16551

Possibly different fixes for the same issue.

@llc-kc
Copy link
Copy Markdown
Contributor Author

llc-kc commented Jan 7, 2026

@yudian0504 thanks, If your pr fix this problem and merged, I will close this pr.

@yudian0504
Copy link
Copy Markdown
Contributor

@yudian0504 thanks, If your pr fix this problem and merged, I will close this pr.

This is not my PR😂, but we are also experiencing this bug, so we are following it as well.

@yudian0504
Copy link
Copy Markdown
Contributor

Additionally, it seems this issue is NOT exclusive to the PD disagg, we've encountered it in our non-PD setups as well, although the probability is quite low, making it difficult to reproduce.

@wcsjtu
Copy link
Copy Markdown
Contributor

wcsjtu commented Jan 9, 2026

it works for me

@yuan-luo
Copy link
Copy Markdown
Collaborator

yuan-luo commented Jan 9, 2026

Agree with @yudian0504 , this fix doesn't nail the root cause on the wall. It is more like a work-around. We are still encountering this issue in on-line service.

@hnyls2002
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label Jan 9, 2026
@llc-kc
Copy link
Copy Markdown
Contributor Author

llc-kc commented Feb 4, 2026

@hnyls2002 Hi, Is there any progress on this issue? Have any other PRs already fixed it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants