Skip to content

test: re-enable paged stashing MoE tests#4978

Merged
ko3n1g merged 2 commits into
NVIDIA:mainfrom
ko3n1g:ko3n1g/fix/paged-stashing-moe-submodules
May 26, 2026
Merged

test: re-enable paged stashing MoE tests#4978
ko3n1g merged 2 commits into
NVIDIA:mainfrom
ko3n1g:ko3n1g/fix/paged-stashing-moe-submodules

Conversation

@ko3n1g

@ko3n1g ko3n1g commented May 26, 2026

Copy link
Copy Markdown
Contributor
Claude summary

Removes the @pytest.mark.flaky_in_dev markers from both paged-stashing MoE tests and replaces them with an explicit SM100 (Blackwell) skip, so the tests run on Blackwell CI and skip cleanly on H100/A100 instead of failing with a TE runtime error.

Why the original markers were wrong

flaky_in_dev was added in #4931 to keep CI green, but the failures were not flaky — they were a missing hardware gate. Both classes configure fp8_recipe='mxfp8'; TE rejects this on devices below compute capability 10.0 with:

RuntimeError: Device compute capability 10.0 or higher required for MXFP8 execution.
  at .../transformer_engine/pytorch/quantization.py:143

H100 CI is SM90, so the tests can never have run there. The previous flaky_in_dev marker hid this fact in the dev environment.

Change

def _is_mxfp8_supported() -> bool:
    """MXFP8 quantization in TE requires compute capability >= 10.0 (Blackwell)."""
    if not torch.cuda.is_available():
        return False
    return torch.cuda.get_device_capability()[0] >= 10


@pytest.mark.skipif(not _is_mxfp8_supported(), reason=_MXFP8_SKIP_REASON)
@pytest.mark.skipif(not _te_grouped_mlp_op_fuser_environment_supported(), ...)
@pytest.mark.skipif(not is_hybrid_ep_available(), reason="Hybrid EP are not available")
class TestPagedStashing: ...

@pytest.mark.skipif(not _is_mxfp8_supported(), reason=_MXFP8_SKIP_REASON)
@pytest.mark.skipif(not _te_grouped_mlp_op_fuser_environment_supported(), ...)
@pytest.mark.skipif(not is_hybrid_ep_available(), reason="Hybrid EP are not available")
class TestPagedStashingOverBudget: ...

Same pattern already used in tests/unit_tests/distributed/megatron_fsdp/test_mcore_fully_sharded_data_parallel.py:983 and tests/unit_tests/inference/test_mxfp8_utils.py:349.

Affected tests

  • tests/unit_tests/transformer/moe/test_paged_stashing.py::TestPagedStashing::test_forward_backward_4_layers
  • tests/unit_tests/transformer/moe/test_paged_stashing.py::TestPagedStashingOverBudget::test_overload_factor_and_over_budget

Closes #4935.

Remove flaky_in_dev markers from TestPagedStashing.test_forward_backward_4_layers
and TestPagedStashingOverBudget.test_overload_factor_and_over_budget. The
underlying AttributeError on transformer_layer_spec.submodules.mlp.submodules
was fixed in NVIDIA#4925 by routing through get_submodules(); the skip markers added
in NVIDIA#4931 are no longer needed.

Closes NVIDIA#4935

Signed-off-by: oliver könig <okoenig@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented May 26, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@ko3n1g

ko3n1g commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test b616990

The tests configure fp8_recipe='mxfp8', which TE rejects on devices below
compute capability 10.0 with 'Device compute capability 10.0 or higher
required for MXFP8 execution'. The previous flaky_in_dev marker was
masking this hardware-incompatibility on H100 CI. Replace it with an
explicit SM100 skipif so the tests actually run on Blackwell and skip
cleanly elsewhere.

Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g

ko3n1g commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 05ae563

@ko3n1g ko3n1g added this pull request to the merge queue May 26, 2026
@svcnvidia-nemo-ci

Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26452178663

Merged via the queue into NVIDIA:main with commit 432d76b May 26, 2026
240 of 242 checks passed
@ko3n1g ko3n1g deleted the ko3n1g/fix/paged-stashing-moe-submodules branch May 26, 2026 14:27
santhnm2 pushed a commit to santhnm2/Megatron-LM that referenced this pull request May 26, 2026
Signed-off-by: oliver könig <okoenig@nvidia.com>
Victarry added a commit to yanring/Megatron-LM that referenced this pull request May 27, 2026
* origin/main: (50 commits)
  Drain predecessor reduce-scatter at dispatch time (NVIDIA#4940)
  ci: Add allow_failure flag to gpt and moe recipes that are failing in nightlies (NVIDIA#4905)
  fix(tests): initialize num_microbatches calculator in vision cudagraph tests (NVIDIA#4986)
  test: re-enable test_pp2_create_cudagraphs_first_stage on TE 2.15+ (NVIDIA#4985)
  ci: Add support for MBridge job gating based on PR labels  (NVIDIA#4926)
  test(ci): re-enable 8experts2parallel_multi_dist_optimizer_instances_1node (NVIDIA#4984)
  test: re-enable paged stashing MoE tests (NVIDIA#4978)
  Fix elastification unwrap_model import (NVIDIA#4972)
  Avoid offsetting functional test master port (NVIDIA#4973)
  test: enable NVTE_CUTEDSL_FUSED_GROUPED_MLP via pytest fixture (NVIDIA#4931)
  chore(beep boop 🤖): Bump  (main) (2026-05-25)
  test(release): add release goldens for deepseekv3/nemotron3 and set tp2pp2 exit-interval (NVIDIA#4932)
  Fix `get_batch` return order to ignore BlendedDataset provenance fields (NVIDIA#4952)
  ci: restore perf test torchrun logs (NVIDIA#4951)
  Various training utils (NVIDIA#4872)
  ci: Update training script paths in BERT and T5 (NVIDIA#4939)
  [MXFP8/FP4-param-gather] Post processing after forced param AG in eval (NVIDIA#4562)
  Fix mxfp8 param gather numerical issue when DP overlap is off (NVIDIA#4800)
  Add TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+ (NVIDIA#4318) (NVIDIA#4786)
  Fix paged stashing test submodules lookup (NVIDIA#4925)
  ...

# Conflicts:
#	megatron/training/training.py
janEbert pushed a commit to janEbert/Megatron-LM that referenced this pull request Jun 2, 2026
Signed-off-by: oliver könig <okoenig@nvidia.com>
mathemakitten pushed a commit to mathemakitten/Megatron-LM that referenced this pull request Jun 12, 2026
Signed-off-by: oliver könig <okoenig@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🐛 CI failure: 'functools.partial' has no attribute 'submodules' in test_paged_stashing

3 participants