test: re-enable paged stashing MoE tests by ko3n1g · Pull Request #4978 · NVIDIA/Megatron-LM

ko3n1g · 2026-05-26T07:22:55Z

Claude summary

Removes the @pytest.mark.flaky_in_dev markers from both paged-stashing MoE tests and replaces them with an explicit SM100 (Blackwell) skip, so the tests run on Blackwell CI and skip cleanly on H100/A100 instead of failing with a TE runtime error.

Why the original markers were wrong

flaky_in_dev was added in #4931 to keep CI green, but the failures were not flaky — they were a missing hardware gate. Both classes configure fp8_recipe='mxfp8'; TE rejects this on devices below compute capability 10.0 with:

RuntimeError: Device compute capability 10.0 or higher required for MXFP8 execution.
  at .../transformer_engine/pytorch/quantization.py:143

H100 CI is SM90, so the tests can never have run there. The previous flaky_in_dev marker hid this fact in the dev environment.

Change

def _is_mxfp8_supported() -> bool:
    """MXFP8 quantization in TE requires compute capability >= 10.0 (Blackwell)."""
    if not torch.cuda.is_available():
        return False
    return torch.cuda.get_device_capability()[0] >= 10


@pytest.mark.skipif(not _is_mxfp8_supported(), reason=_MXFP8_SKIP_REASON)
@pytest.mark.skipif(not _te_grouped_mlp_op_fuser_environment_supported(), ...)
@pytest.mark.skipif(not is_hybrid_ep_available(), reason="Hybrid EP are not available")
class TestPagedStashing: ...

@pytest.mark.skipif(not _is_mxfp8_supported(), reason=_MXFP8_SKIP_REASON)
@pytest.mark.skipif(not _te_grouped_mlp_op_fuser_environment_supported(), ...)
@pytest.mark.skipif(not is_hybrid_ep_available(), reason="Hybrid EP are not available")
class TestPagedStashingOverBudget: ...

Same pattern already used in tests/unit_tests/distributed/megatron_fsdp/test_mcore_fully_sharded_data_parallel.py:983 and tests/unit_tests/inference/test_mxfp8_utils.py:349.

Affected tests

tests/unit_tests/transformer/moe/test_paged_stashing.py::TestPagedStashing::test_forward_backward_4_layers
tests/unit_tests/transformer/moe/test_paged_stashing.py::TestPagedStashingOverBudget::test_overload_factor_and_over_budget

Closes #4935.

Remove flaky_in_dev markers from TestPagedStashing.test_forward_backward_4_layers and TestPagedStashingOverBudget.test_overload_factor_and_over_budget. The underlying AttributeError on transformer_layer_spec.submodules.mlp.submodules was fixed in NVIDIA#4925 by routing through get_submodules(); the skip markers added in NVIDIA#4931 are no longer needed. Closes NVIDIA#4935 Signed-off-by: oliver könig <okoenig@nvidia.com>

copy-pr-bot · 2026-05-26T07:23:00Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

ko3n1g · 2026-05-26T07:23:01Z

/ok to test b616990

The tests configure fp8_recipe='mxfp8', which TE rejects on devices below compute capability 10.0 with 'Device compute capability 10.0 or higher required for MXFP8 execution'. The previous flaky_in_dev marker was masking this hardware-incompatibility on H100 CI. Replace it with an explicit SM100 skipif so the tests actually run on Blackwell and skip cleanly elsewhere. Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g · 2026-05-26T07:41:27Z

/ok to test 05ae563

svcnvidia-nemo-ci · 2026-05-26T13:49:49Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26452178663

Signed-off-by: oliver könig <okoenig@nvidia.com>

* origin/main: (50 commits) Drain predecessor reduce-scatter at dispatch time (NVIDIA#4940) ci: Add allow_failure flag to gpt and moe recipes that are failing in nightlies (NVIDIA#4905) fix(tests): initialize num_microbatches calculator in vision cudagraph tests (NVIDIA#4986) test: re-enable test_pp2_create_cudagraphs_first_stage on TE 2.15+ (NVIDIA#4985) ci: Add support for MBridge job gating based on PR labels (NVIDIA#4926) test(ci): re-enable 8experts2parallel_multi_dist_optimizer_instances_1node (NVIDIA#4984) test: re-enable paged stashing MoE tests (NVIDIA#4978) Fix elastification unwrap_model import (NVIDIA#4972) Avoid offsetting functional test master port (NVIDIA#4973) test: enable NVTE_CUTEDSL_FUSED_GROUPED_MLP via pytest fixture (NVIDIA#4931) chore(beep boop 🤖): Bump (main) (2026-05-25) test(release): add release goldens for deepseekv3/nemotron3 and set tp2pp2 exit-interval (NVIDIA#4932) Fix `get_batch` return order to ignore BlendedDataset provenance fields (NVIDIA#4952) ci: restore perf test torchrun logs (NVIDIA#4951) Various training utils (NVIDIA#4872) ci: Update training script paths in BERT and T5 (NVIDIA#4939) [MXFP8/FP4-param-gather] Post processing after forced param AG in eval (NVIDIA#4562) Fix mxfp8 param gather numerical issue when DP overlap is off (NVIDIA#4800) Add TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+ (NVIDIA#4318) (NVIDIA#4786) Fix paged stashing test submodules lookup (NVIDIA#4925) ... # Conflicts: # megatron/training/training.py

Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g added the Run tests label May 26, 2026

copy-pr-bot Bot temporarily deployed to public May 26, 2026 07:23 Inactive

copy-pr-bot Bot temporarily deployed to test May 26, 2026 07:24 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 07:26 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 07:27 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 07:34 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 07:42 Inactive

copy-pr-bot Bot temporarily deployed to test May 26, 2026 07:42 Inactive

ko3n1g requested a review from nanz-nv May 26, 2026 07:43

copy-pr-bot Bot temporarily deployed to public May 26, 2026 07:45 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 07:46 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 07:53 Inactive

ko3n1g marked this pull request as ready for review May 26, 2026 07:56

svcnvidia-nemo-ci requested a review from a team May 26, 2026 07:57

svcnvidia-nemo-ci added the complexity: low label May 26, 2026

ko3n1g enabled auto-merge May 26, 2026 10:02

This was referenced May 26, 2026

🐛 CI failure: test_paged_stashing.py::TestPagedStashingOverBudget::test_overload_factor_and_over_budget #4345

Open

🐛 CI failure: test_paged_stashing.py::TestPagedStashing::test_forward_backward_4_layers #4339

Open

thomasdhc approved these changes May 26, 2026

View reviewed changes

ko3n1g added this pull request to the merge queue May 26, 2026

Merged via the queue into NVIDIA:main with commit 432d76b May 26, 2026
240 of 242 checks passed

ko3n1g deleted the ko3n1g/fix/paged-stashing-moe-submodules branch May 26, 2026 14:27

santhnm2 pushed a commit to santhnm2/Megatron-LM that referenced this pull request May 26, 2026

test: re-enable paged stashing MoE tests (NVIDIA#4978)

f4e84c2

Signed-off-by: oliver könig <okoenig@nvidia.com>

janEbert pushed a commit to janEbert/Megatron-LM that referenced this pull request Jun 2, 2026

test: re-enable paged stashing MoE tests (NVIDIA#4978)

1a7f8f7

Signed-off-by: oliver könig <okoenig@nvidia.com>

mathemakitten pushed a commit to mathemakitten/Megatron-LM that referenced this pull request Jun 12, 2026

test: re-enable paged stashing MoE tests (NVIDIA#4978)

2737ffe

Signed-off-by: oliver könig <okoenig@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: re-enable paged stashing MoE tests#4978

test: re-enable paged stashing MoE tests#4978
ko3n1g merged 2 commits into
NVIDIA:mainfrom
ko3n1g:ko3n1g/fix/paged-stashing-moe-submodules

ko3n1g commented May 26, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 26, 2026

Uh oh!

ko3n1g commented May 26, 2026

Uh oh!

ko3n1g commented May 26, 2026

Uh oh!

svcnvidia-nemo-ci commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ko3n1g commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why the original markers were wrong

Change

Affected tests

Uh oh!

copy-pr-bot Bot commented May 26, 2026

Uh oh!

ko3n1g commented May 26, 2026

Uh oh!

ko3n1g commented May 26, 2026

Uh oh!

svcnvidia-nemo-ci commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ko3n1g commented May 26, 2026 •

edited

Loading