Fix MLA dynamic inference decode flag by cuichenx · Pull Request #4902 · NVIDIA/Megatron-LM

cuichenx · 2026-05-20T20:54:21Z

Summary

Fix the dynamic inference MLA path to pass inference_context.is_decode_only() into flash_decode_and_prefill().

The regular attention dynamic path already passes this argument, but MultiLatentAttention omitted it after the flash_decode_and_prefill() signature gained is_decode_only.

Fixes #4901.

Testing

python -m py_compile megatron/core/transformer/multi_latent_attention.py
git diff --check

Runtime validation for MLA dynamic inference with cached MLA latents is pending.

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot · 2026-05-20T20:54:25Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

cuichenx · 2026-05-20T23:06:15Z

/claude strict-review

cuichenx · 2026-05-20T23:06:56Z

/ok to test aa8b233

claude

Strict review passed — no significant issues found. LGTM.

CRITICAL: 0 | IMPORTANT: 0 | SUGGESTION: 0

The fix is correct: flash_decode_and_prefill() (defined at attention.py:829) requires is_decode_only as its last positional argument. The regular Attention path already passes it (attention.py:1304), but the MultiLatentAttention path omitted it — causing a TypeError whenever cache_mla_latents was enabled with dynamic batching. This one-line addition aligns the MLA call site with both the function signature and the existing non-MLA call site.

Minimal risk — only enables a previously broken code path.

santhnm2 · 2026-05-20T23:11:16Z

@cuichenx Can we add a unit test for MLA which will exercise this code path?

Fix MLA dynamic inference decode flag

aa8b233

Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx marked this pull request as ready for review May 20, 2026 23:05

cuichenx requested review from a team as code owners May 20, 2026 23:05

svcnvidia-nemo-ci requested a review from a team May 20, 2026 23:06

shanmugamr1992 approved these changes May 20, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to public May 20, 2026 23:06 Inactive

cuichenx added the complexity: low label May 20, 2026

svcnvidia-nemo-ci removed the complexity: low label May 20, 2026

copy-pr-bot Bot temporarily deployed to test May 20, 2026 23:06 Inactive

svcnvidia-nemo-ci added the complexity: low label May 20, 2026

claude Bot approved these changes May 20, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to public May 20, 2026 23:09 Inactive

copy-pr-bot Bot temporarily deployed to public May 20, 2026 23:10 Inactive

cuichenx mentioned this pull request May 20, 2026

[Inference] Add MCore inference examples and model wrappers NVIDIA-NeMo/Megatron-Bridge#3897

Merged

copy-pr-bot Bot temporarily deployed to public May 20, 2026 23:17 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MLA dynamic inference decode flag#4902

Fix MLA dynamic inference decode flag#4902
cuichenx wants to merge 1 commit into
NVIDIA:mainfrom
cuichenx:chcui/fix-mla-dynamic-inference-is-decode-only

cuichenx commented May 20, 2026

Uh oh!

copy-pr-bot Bot commented May 20, 2026

Uh oh!

cuichenx commented May 20, 2026

Uh oh!

cuichenx commented May 20, 2026

Uh oh!

claude Bot left a comment

Uh oh!

santhnm2 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

cuichenx commented May 20, 2026

Summary

Testing

Uh oh!

copy-pr-bot Bot commented May 20, 2026

Uh oh!

cuichenx commented May 20, 2026

Uh oh!

cuichenx commented May 20, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

santhnm2 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants