fix(tests): initialize num_microbatches calculator in vision cudagraph tests#4986
Merged
ko3n1g merged 1 commit intoMay 26, 2026
Conversation
…h tests Closes NVIDIA#3802. `TestVisionTECudaGraphHelper._make_helper` constructs a `VisionTECudaGraphHelper` and then `create_cudagraphs()` calls into `cuda_graphs.py`'s `get_make_graphed_callables_kwargs`, which calls `get_num_microbatches()`. With the global calculator never initialized in this unit test, `_GLOBAL_NUM_MICROBATCHES_CALCULATOR` is `None` and the call fails with `AttributeError: 'NoneType' object has no attribute 'get'`. Initialize the global calculator inside `_make_helper` with the requested `num_microbatches`, and destroy it in `teardown_method` (and again on the next `_make_helper` call) so tests are hermetic. This mirrors the canonical pattern in `tests/unit_tests/transformer/test_cuda_graphs.py`. With the real bug fixed, drop the `@pytest.mark.flaky` / `@pytest.mark.flaky_in_dev` masking from `test_create_and_delete_cudagraphs` and `test_create_cudagraphs_multi_microbatch` so they run in CI again. Signed-off-by: oliver könig <okoenig@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Contributor
Author
|
/ok to test |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26461817940 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26464000672 |
Victarry
added a commit
to yanring/Megatron-LM
that referenced
this pull request
May 27, 2026
* origin/main: (50 commits) Drain predecessor reduce-scatter at dispatch time (NVIDIA#4940) ci: Add allow_failure flag to gpt and moe recipes that are failing in nightlies (NVIDIA#4905) fix(tests): initialize num_microbatches calculator in vision cudagraph tests (NVIDIA#4986) test: re-enable test_pp2_create_cudagraphs_first_stage on TE 2.15+ (NVIDIA#4985) ci: Add support for MBridge job gating based on PR labels (NVIDIA#4926) test(ci): re-enable 8experts2parallel_multi_dist_optimizer_instances_1node (NVIDIA#4984) test: re-enable paged stashing MoE tests (NVIDIA#4978) Fix elastification unwrap_model import (NVIDIA#4972) Avoid offsetting functional test master port (NVIDIA#4973) test: enable NVTE_CUTEDSL_FUSED_GROUPED_MLP via pytest fixture (NVIDIA#4931) chore(beep boop 🤖): Bump (main) (2026-05-25) test(release): add release goldens for deepseekv3/nemotron3 and set tp2pp2 exit-interval (NVIDIA#4932) Fix `get_batch` return order to ignore BlendedDataset provenance fields (NVIDIA#4952) ci: restore perf test torchrun logs (NVIDIA#4951) Various training utils (NVIDIA#4872) ci: Update training script paths in BERT and T5 (NVIDIA#4939) [MXFP8/FP4-param-gather] Post processing after forced param AG in eval (NVIDIA#4562) Fix mxfp8 param gather numerical issue when DP overlap is off (NVIDIA#4800) Add TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+ (NVIDIA#4318) (NVIDIA#4786) Fix paged stashing test submodules lookup (NVIDIA#4925) ... # Conflicts: # megatron/training/training.py
janEbert
pushed a commit
to janEbert/Megatron-LM
that referenced
this pull request
Jun 2, 2026
…h tests (NVIDIA#4986) Signed-off-by: oliver könig <okoenig@nvidia.com>
mathemakitten
pushed a commit
to mathemakitten/Megatron-LM
that referenced
this pull request
Jun 12, 2026
…h tests (NVIDIA#4986) Signed-off-by: oliver könig <okoenig@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Claude summary
Fixes #3802 (and closes its duplicate #3803, which was already closed manually).
Bug
After the TE 2.13 bump (#3800) the vision-encoder CUDA graph unit tests started failing with:
The call chain is:
TestVisionTECudaGraphHelper.test_create_and_delete_cudagraphs→
VisionTECudaGraphHelper.create_cudagraphs()→
TECudaGraphHelper.create_cudagraphs()→
_get_cuda_graph_input_data()→
get_make_graphed_callables_kwargs()(defined inline)→
get_num_microbatches()(megatron/core/transformer/cuda_graphs.py:2255)get_num_microbatches()does_GLOBAL_NUM_MICROBATCHES_CALCULATOR.get(), but the global calculator was never initialized in this unit test (only the model-parallel state is initialized insetup_method), so the global isNone.The follow-up
ci: Skip more tests in test_vision_cuda_graphs for LTS(#3860) merely marked the two tests@pytest.mark.flaky/@pytest.mark.flaky_in_dev, masking the failure rather than fixing it. The PP2 variant tracked under #3804 is a separate hang and is left untouched here.Fix
In
tests/unit_tests/transformer/test_vision_cuda_graphs.py:num_microbatchescalculator insideTestVisionTECudaGraphHelper._make_helperwith the requestednum_microbatches, matching the canonical pattern used intests/unit_tests/transformer/test_cuda_graphs.py.teardown_method(and on each new_make_helpercall) so the tests remain hermetic.@pytest.mark.flaky/@pytest.mark.flaky_in_devmarkers fromtest_create_and_delete_cudagraphsandtest_create_cudagraphs_multi_microbatchso CI exercises the real code path again.Minimal example of the fix shape:
Scope
PR scope is intentionally limited to the PP=1
TestVisionTECudaGraphHelper. TheTestVisionTECudaGraphHelperPP2variant has the same uninitialized-calculator latent issue but also a separate hang tracked in #3804; it will be addressed when #3804 is fixed.