Skip to content

🐛 CI failure(nightlies and MR runs): ckpt-resume tests on dgx_gb200 fail with IndexError when reading 2nd-run tensorboard events #4903

@balasaajay

Description

@balasaajay

Describe the bug

Multiple ckpt-resume functional tests on dgx_gb200 are failing in nightly CI with the same fingerprint. The first run (train-from-scratch) passes its golden-value comparison, but the post-resume --is-second-run analyzer crashes because no second tensorboard event file is present in the expected directory.

Tag @NVIDIA/mcore-oncall to get oncall's attention to this issue.

Failing test cases observed across two recent nightly runs (all environment: dev, platforms: dgx_gb200):

  • tests/functional_tests/test_cases/gpt/gpt3_mcore_te_tp2_pp1_resume_torch_dist_multi_dist_optimizer_instances
  • tests/functional_tests/test_cases/moe/gpt3_mcore_te_tp2_pp1_resume_torch_dist_te_8experts2parallel_multi_dist_optimizer_instances
  • tests/functional_tests/test_cases/gpt/gpt3_mcore_te_tp4_pp1_resume_torch_dist_dist_optimizer_overlap_grad_reduce_param_gather
  • tests/functional_tests/test_cases/moe/gpt3_moe_mcore_te_tp4_ep2_etp2_pp2_resume_torch_dist_dist_optimizer

Common pattern: every failing case matches _resume_torch_dist_.*_dist_optimizer.* and uses the distributed-optimizer + --ckpt-format torch_dist + (typically) --dist-ckpt-optim-fully-reshardable / --num-distributed-optimizer-instances 2 resume path. Tests that do not exercise the dist-optimizer reshardable-resume code path on the same hardware are unaffected.

Error

test_pretraining_regular_pipeline.py::test_regular_pipeline (1st-run goldens) PASSES in all four jobs:

INFO  common.py:263 DETERMINISTIC test for metric lm loss: PASSED
INFO  common.py:263 APPROXIMATE  test for metric lm loss: PASSED
INFO  common.py:263 DETERMINISTIC test for metric num-zeros: PASSED
INFO  common.py:263 APPROXIMATE  test for metric num-zeros: PASSED
PASSED tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline

The job then fails when run_ci_test.sh re-invokes the analyzer with --is-second-run against the same $TENSORBOARD_PATH:

+ uv run --no-sync python /opt/megatron-lm/tests/functional_tests/python_test_utils/get_test_results_from_tensorboard_logs.py \
    --logs-dir /jet/assets/basic/<test-case>/1/tensorboard \
    --train-iters 100 \
    --output-path /jet/assets/basic/<test-case>/golden_values_dev_dgx_gb200_2nd.json \
    --is-second-run --is-normal-test --step-size 1

Traceback (most recent call last):
  File "/opt/megatron-lm/tests/functional_tests/python_test_utils/get_test_results_from_tensorboard_logs.py", line 82, in <module>
    collect_train_test_metrics()
  File "/opt/megatron-lm/tests/functional_tests/python_test_utils/get_test_results_from_tensorboard_logs.py", line 43, in collect_train_test_metrics
    summaries = common.read_tb_logs_as_list(
  File "/opt/megatron-lm/tests/functional_tests/python_test_utils/common.py", line 124, in read_tb_logs_as_list
    event_file = files[index]
                 ~~~~~^^^^^^^
IndexError: list index out of range
INFO:__main__:Pipeline terminated with status FAILED

get_test_results_from_tensorboard_logs.py passes index=1 when --is-second-run is set (line 45), expecting two events.out.tfevents.* files in the supplied --logs-dir. The IndexError means only one event file is present.

Artifact evidence

Inspecting the failing job artifacts confirms the resume run produced an event file in a different sub-directory than the first run, instead of appending to the same one:

results/iteration=0/restart=0/assets/basic/<test-case>/1/tensorboard/events.out.tfevents.<ts1>...   (~181 KB — first run)
results/iteration=0/restart=0/assets/basic/<test-case>/2/tensorboard/events.out.tfevents.<ts2>...   (~181 KB — resume run)

run_ci_test.sh exports TENSORBOARD_PATH=$DIR/$i/$FILE once per outer-loop iteration $i and re-uses it for both RUN_NUMBER=1 and RUN_NUMBER=2, so both runs are expected to write into the same <dir>/<i>/tensorboard/. The artifact layout shows the resume run instead wrote into <dir>/<i+1>/tensorboard/, leaving exactly one event file in the directory the analyzer reads — hence files[index=1] fails. Both failing pipelines exhibit the same artifact layout.

The training itself completes (W&B logs from rank 7's stderr.log show iteration counts past 100 in the resume run), so the regression is in how --tensorboard-dir is honored on the resume invocation, not in dist-optimizer correctness.

Steps/Code to reproduce bug

Inside the dev container, run any of the failing test cases as a ckpt-resume job on a 2-node × 4-GPU GB200 configuration. Minimum repro args (from the moe job's training cmdline):

ft_launcher --nproc_per_node 4 --nnodes 2 --master_addr <addr> --master_port <port> --node_rank 0 \
  --max-restarts=3 pretrain_gpt.py \
  --num-layers 12 --hidden-size 512 --num-attention-heads 8 \
  --tensorboard-dir <SHARED_DIR>/1/tensorboard \
  --micro-batch-size 4 --global-batch-size 32 --seq-length 1024 --max-position-embeddings 1024 \
  --train-iters 100 --exit-interval 100 \
  --save <SHARED_DIR>/checkpoints --load /tmp/checkpoints/ \
  --transformer-impl transformer_engine \
  --tensor-model-parallel-size 2 --pipeline-model-parallel-size 1 \
  --use-distributed-optimizer --num-distributed-optimizer-instances 2 \
  --overlap-grad-reduce --overlap-param-gather \
  --deterministic-mode --no-gradient-accumulation-fusion --attention-softmax-in-fp32 \
  --use-checkpoint-opt_param-scheduler --use-mcore-models \
  --ckpt-format torch_dist --dist-ckpt-optim-fully-reshardable --dist-ckpt-strictness log_all \
  --bf16 --async-save --use-persistent-ckpt-worker

Then re-run the same command with --load <SHARED_DIR>/checkpoints (i.e. RUN_NUMBER=2 from run_ci_test.sh) and check whether a second events.out.tfevents.* file lands in <SHARED_DIR>/1/tensorboard/ (expected) or in a sibling <SHARED_DIR>/2/tensorboard/ (observed).

Additional context

  • All four failing jobs run GB200, 2 nodes × 4 GPUs.
  • The first run passes its test_regular_pipeline check on the recorded goldens, so the dist-optimizer + torch_dist save side is intact; the regression is on the resume + tensorboard-dir reuse path.
  • Same failure signature reproduces across two independent recent nightly runs, indicating a deterministic regression rather than infrastructure flakiness.
  • Suspect commits to investigate: recent changes to tests/functional_tests/shell_test_utils/run_ci_test.sh and to the megatron training-init path that constructs the tensorboard SummaryWriter on resume.

Triaged automatically via /create-issue.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions