Describe the bug
Multiple ckpt-resume functional tests on dgx_gb200 are failing in nightly CI with the same fingerprint. The first run (train-from-scratch) passes its golden-value comparison, but the post-resume --is-second-run analyzer crashes because no second tensorboard event file is present in the expected directory.
Tag @NVIDIA/mcore-oncall to get oncall's attention to this issue.
Failing test cases observed across two recent nightly runs (all environment: dev, platforms: dgx_gb200):
tests/functional_tests/test_cases/gpt/gpt3_mcore_te_tp2_pp1_resume_torch_dist_multi_dist_optimizer_instances
tests/functional_tests/test_cases/moe/gpt3_mcore_te_tp2_pp1_resume_torch_dist_te_8experts2parallel_multi_dist_optimizer_instances
tests/functional_tests/test_cases/gpt/gpt3_mcore_te_tp4_pp1_resume_torch_dist_dist_optimizer_overlap_grad_reduce_param_gather
tests/functional_tests/test_cases/moe/gpt3_moe_mcore_te_tp4_ep2_etp2_pp2_resume_torch_dist_dist_optimizer
Common pattern: every failing case matches _resume_torch_dist_.*_dist_optimizer.* and uses the distributed-optimizer + --ckpt-format torch_dist + (typically) --dist-ckpt-optim-fully-reshardable / --num-distributed-optimizer-instances 2 resume path. Tests that do not exercise the dist-optimizer reshardable-resume code path on the same hardware are unaffected.
Error
test_pretraining_regular_pipeline.py::test_regular_pipeline (1st-run goldens) PASSES in all four jobs:
INFO common.py:263 DETERMINISTIC test for metric lm loss: PASSED
INFO common.py:263 APPROXIMATE test for metric lm loss: PASSED
INFO common.py:263 DETERMINISTIC test for metric num-zeros: PASSED
INFO common.py:263 APPROXIMATE test for metric num-zeros: PASSED
PASSED tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline
The job then fails when run_ci_test.sh re-invokes the analyzer with --is-second-run against the same $TENSORBOARD_PATH:
+ uv run --no-sync python /opt/megatron-lm/tests/functional_tests/python_test_utils/get_test_results_from_tensorboard_logs.py \
--logs-dir /jet/assets/basic/<test-case>/1/tensorboard \
--train-iters 100 \
--output-path /jet/assets/basic/<test-case>/golden_values_dev_dgx_gb200_2nd.json \
--is-second-run --is-normal-test --step-size 1
Traceback (most recent call last):
File "/opt/megatron-lm/tests/functional_tests/python_test_utils/get_test_results_from_tensorboard_logs.py", line 82, in <module>
collect_train_test_metrics()
File "/opt/megatron-lm/tests/functional_tests/python_test_utils/get_test_results_from_tensorboard_logs.py", line 43, in collect_train_test_metrics
summaries = common.read_tb_logs_as_list(
File "/opt/megatron-lm/tests/functional_tests/python_test_utils/common.py", line 124, in read_tb_logs_as_list
event_file = files[index]
~~~~~^^^^^^^
IndexError: list index out of range
INFO:__main__:Pipeline terminated with status FAILED
get_test_results_from_tensorboard_logs.py passes index=1 when --is-second-run is set (line 45), expecting two events.out.tfevents.* files in the supplied --logs-dir. The IndexError means only one event file is present.
Artifact evidence
Inspecting the failing job artifacts confirms the resume run produced an event file in a different sub-directory than the first run, instead of appending to the same one:
results/iteration=0/restart=0/assets/basic/<test-case>/1/tensorboard/events.out.tfevents.<ts1>... (~181 KB — first run)
results/iteration=0/restart=0/assets/basic/<test-case>/2/tensorboard/events.out.tfevents.<ts2>... (~181 KB — resume run)
run_ci_test.sh exports TENSORBOARD_PATH=$DIR/$i/$FILE once per outer-loop iteration $i and re-uses it for both RUN_NUMBER=1 and RUN_NUMBER=2, so both runs are expected to write into the same <dir>/<i>/tensorboard/. The artifact layout shows the resume run instead wrote into <dir>/<i+1>/tensorboard/, leaving exactly one event file in the directory the analyzer reads — hence files[index=1] fails. Both failing pipelines exhibit the same artifact layout.
The training itself completes (W&B logs from rank 7's stderr.log show iteration counts past 100 in the resume run), so the regression is in how --tensorboard-dir is honored on the resume invocation, not in dist-optimizer correctness.
Steps/Code to reproduce bug
Inside the dev container, run any of the failing test cases as a ckpt-resume job on a 2-node × 4-GPU GB200 configuration. Minimum repro args (from the moe job's training cmdline):
ft_launcher --nproc_per_node 4 --nnodes 2 --master_addr <addr> --master_port <port> --node_rank 0 \
--max-restarts=3 pretrain_gpt.py \
--num-layers 12 --hidden-size 512 --num-attention-heads 8 \
--tensorboard-dir <SHARED_DIR>/1/tensorboard \
--micro-batch-size 4 --global-batch-size 32 --seq-length 1024 --max-position-embeddings 1024 \
--train-iters 100 --exit-interval 100 \
--save <SHARED_DIR>/checkpoints --load /tmp/checkpoints/ \
--transformer-impl transformer_engine \
--tensor-model-parallel-size 2 --pipeline-model-parallel-size 1 \
--use-distributed-optimizer --num-distributed-optimizer-instances 2 \
--overlap-grad-reduce --overlap-param-gather \
--deterministic-mode --no-gradient-accumulation-fusion --attention-softmax-in-fp32 \
--use-checkpoint-opt_param-scheduler --use-mcore-models \
--ckpt-format torch_dist --dist-ckpt-optim-fully-reshardable --dist-ckpt-strictness log_all \
--bf16 --async-save --use-persistent-ckpt-worker
Then re-run the same command with --load <SHARED_DIR>/checkpoints (i.e. RUN_NUMBER=2 from run_ci_test.sh) and check whether a second events.out.tfevents.* file lands in <SHARED_DIR>/1/tensorboard/ (expected) or in a sibling <SHARED_DIR>/2/tensorboard/ (observed).
Additional context
- All four failing jobs run GB200, 2 nodes × 4 GPUs.
- The first run passes its
test_regular_pipeline check on the recorded goldens, so the dist-optimizer + torch_dist save side is intact; the regression is on the resume + tensorboard-dir reuse path.
- Same failure signature reproduces across two independent recent nightly runs, indicating a deterministic regression rather than infrastructure flakiness.
- Suspect commits to investigate: recent changes to
tests/functional_tests/shell_test_utils/run_ci_test.sh and to the megatron training-init path that constructs the tensorboard SummaryWriter on resume.
Triaged automatically via /create-issue.
Describe the bug
Multiple
ckpt-resumefunctional tests ondgx_gb200are failing in nightly CI with the same fingerprint. The first run (train-from-scratch) passes its golden-value comparison, but the post-resume--is-second-runanalyzer crashes because no second tensorboard event file is present in the expected directory.Tag @NVIDIA/mcore-oncall to get oncall's attention to this issue.
Failing test cases observed across two recent nightly runs (all
environment: dev,platforms: dgx_gb200):tests/functional_tests/test_cases/gpt/gpt3_mcore_te_tp2_pp1_resume_torch_dist_multi_dist_optimizer_instancestests/functional_tests/test_cases/moe/gpt3_mcore_te_tp2_pp1_resume_torch_dist_te_8experts2parallel_multi_dist_optimizer_instancestests/functional_tests/test_cases/gpt/gpt3_mcore_te_tp4_pp1_resume_torch_dist_dist_optimizer_overlap_grad_reduce_param_gathertests/functional_tests/test_cases/moe/gpt3_moe_mcore_te_tp4_ep2_etp2_pp2_resume_torch_dist_dist_optimizerCommon pattern: every failing case matches
_resume_torch_dist_.*_dist_optimizer.*and uses the distributed-optimizer +--ckpt-format torch_dist+ (typically)--dist-ckpt-optim-fully-reshardable/--num-distributed-optimizer-instances 2resume path. Tests that do not exercise the dist-optimizer reshardable-resume code path on the same hardware are unaffected.Error
test_pretraining_regular_pipeline.py::test_regular_pipeline(1st-run goldens) PASSES in all four jobs:The job then fails when
run_ci_test.shre-invokes the analyzer with--is-second-runagainst the same$TENSORBOARD_PATH:get_test_results_from_tensorboard_logs.pypassesindex=1when--is-second-runis set (line 45), expecting twoevents.out.tfevents.*files in the supplied--logs-dir. The IndexError means only one event file is present.Artifact evidence
Inspecting the failing job artifacts confirms the resume run produced an event file in a different sub-directory than the first run, instead of appending to the same one:
run_ci_test.shexportsTENSORBOARD_PATH=$DIR/$i/$FILEonce per outer-loop iteration$iand re-uses it for bothRUN_NUMBER=1andRUN_NUMBER=2, so both runs are expected to write into the same<dir>/<i>/tensorboard/. The artifact layout shows the resume run instead wrote into<dir>/<i+1>/tensorboard/, leaving exactly one event file in the directory the analyzer reads — hencefiles[index=1]fails. Both failing pipelines exhibit the same artifact layout.The training itself completes (W&B logs from rank 7's
stderr.logshow iteration counts past 100 in the resume run), so the regression is in how--tensorboard-diris honored on the resume invocation, not in dist-optimizer correctness.Steps/Code to reproduce bug
Inside the dev container, run any of the failing test cases as a
ckpt-resumejob on a 2-node × 4-GPU GB200 configuration. Minimum repro args (from the moe job's training cmdline):Then re-run the same command with
--load <SHARED_DIR>/checkpoints(i.e.RUN_NUMBER=2fromrun_ci_test.sh) and check whether a secondevents.out.tfevents.*file lands in<SHARED_DIR>/1/tensorboard/(expected) or in a sibling<SHARED_DIR>/2/tensorboard/(observed).Additional context
test_regular_pipelinecheck on the recorded goldens, so the dist-optimizer + torch_dist save side is intact; the regression is on the resume + tensorboard-dir reuse path.tests/functional_tests/shell_test_utils/run_ci_test.shand to the megatron training-init path that constructs the tensorboard SummaryWriter on resume.Triaged automatically via
/create-issue.