🐛 CI failure(nightlies and MR runs): ckpt-resume tests on dgx_gb200 fail with IndexError when reading 2nd-run tensorboard events

**Describe the bug**

Multiple `ckpt-resume` functional tests on `dgx_gb200` are failing in nightly CI with the same fingerprint. The first run (train-from-scratch) passes its golden-value comparison, but the post-resume `--is-second-run` analyzer crashes because no second tensorboard event file is present in the expected directory.

Tag @NVIDIA/mcore-oncall to get oncall's attention to this issue.

Failing test cases observed across two recent nightly runs (all `environment: dev`, `platforms: dgx_gb200`):

- `tests/functional_tests/test_cases/gpt/gpt3_mcore_te_tp2_pp1_resume_torch_dist_multi_dist_optimizer_instances`
- `tests/functional_tests/test_cases/moe/gpt3_mcore_te_tp2_pp1_resume_torch_dist_te_8experts2parallel_multi_dist_optimizer_instances`
- `tests/functional_tests/test_cases/gpt/gpt3_mcore_te_tp4_pp1_resume_torch_dist_dist_optimizer_overlap_grad_reduce_param_gather`
- `tests/functional_tests/test_cases/moe/gpt3_moe_mcore_te_tp4_ep2_etp2_pp2_resume_torch_dist_dist_optimizer`

Common pattern: every failing case matches `_resume_torch_dist_.*_dist_optimizer.*` and uses the distributed-optimizer + `--ckpt-format torch_dist` + (typically) `--dist-ckpt-optim-fully-reshardable` / `--num-distributed-optimizer-instances 2` resume path. Tests that do not exercise the dist-optimizer reshardable-resume code path on the same hardware are unaffected.

**Error**

`test_pretraining_regular_pipeline.py::test_regular_pipeline` (1st-run goldens) **PASSES** in all four jobs:

```
INFO  common.py:263 DETERMINISTIC test for metric lm loss: PASSED
INFO  common.py:263 APPROXIMATE  test for metric lm loss: PASSED
INFO  common.py:263 DETERMINISTIC test for metric num-zeros: PASSED
INFO  common.py:263 APPROXIMATE  test for metric num-zeros: PASSED
PASSED tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline
```

The job then fails when `run_ci_test.sh` re-invokes the analyzer with `--is-second-run` against the same `$TENSORBOARD_PATH`:

```
+ uv run --no-sync python /opt/megatron-lm/tests/functional_tests/python_test_utils/get_test_results_from_tensorboard_logs.py \
    --logs-dir /jet/assets/basic/<test-case>/1/tensorboard \
    --train-iters 100 \
    --output-path /jet/assets/basic/<test-case>/golden_values_dev_dgx_gb200_2nd.json \
    --is-second-run --is-normal-test --step-size 1

Traceback (most recent call last):
  File "/opt/megatron-lm/tests/functional_tests/python_test_utils/get_test_results_from_tensorboard_logs.py", line 82, in <module>
    collect_train_test_metrics()
  File "/opt/megatron-lm/tests/functional_tests/python_test_utils/get_test_results_from_tensorboard_logs.py", line 43, in collect_train_test_metrics
    summaries = common.read_tb_logs_as_list(
  File "/opt/megatron-lm/tests/functional_tests/python_test_utils/common.py", line 124, in read_tb_logs_as_list
    event_file = files[index]
                 ~~~~~^^^^^^^
IndexError: list index out of range
INFO:__main__:Pipeline terminated with status FAILED
```

`get_test_results_from_tensorboard_logs.py` passes `index=1` when `--is-second-run` is set (line 45), expecting two `events.out.tfevents.*` files in the supplied `--logs-dir`. The IndexError means only one event file is present.

**Artifact evidence**

Inspecting the failing job artifacts confirms the resume run produced an event file in a *different* sub-directory than the first run, instead of appending to the same one:

```
results/iteration=0/restart=0/assets/basic/<test-case>/1/tensorboard/events.out.tfevents.<ts1>...   (~181 KB — first run)
results/iteration=0/restart=0/assets/basic/<test-case>/2/tensorboard/events.out.tfevents.<ts2>...   (~181 KB — resume run)
```

`run_ci_test.sh` exports `TENSORBOARD_PATH=$DIR/$i/$FILE` once per outer-loop iteration `$i` and re-uses it for both `RUN_NUMBER=1` and `RUN_NUMBER=2`, so both runs are expected to write into the same `<dir>/<i>/tensorboard/`. The artifact layout shows the resume run instead wrote into `<dir>/<i+1>/tensorboard/`, leaving exactly one event file in the directory the analyzer reads — hence `files[index=1]` fails. Both failing pipelines exhibit the same artifact layout.

The training itself completes (W&B logs from rank 7's `stderr.log` show iteration counts past 100 in the resume run), so the regression is in how `--tensorboard-dir` is honored on the resume invocation, not in dist-optimizer correctness.

**Steps/Code to reproduce bug**

Inside the dev container, run any of the failing test cases as a `ckpt-resume` job on a 2-node × 4-GPU GB200 configuration. Minimum repro args (from the moe job's training cmdline):

```bash
ft_launcher --nproc_per_node 4 --nnodes 2 --master_addr <addr> --master_port <port> --node_rank 0 \
  --max-restarts=3 pretrain_gpt.py \
  --num-layers 12 --hidden-size 512 --num-attention-heads 8 \
  --tensorboard-dir <SHARED_DIR>/1/tensorboard \
  --micro-batch-size 4 --global-batch-size 32 --seq-length 1024 --max-position-embeddings 1024 \
  --train-iters 100 --exit-interval 100 \
  --save <SHARED_DIR>/checkpoints --load /tmp/checkpoints/ \
  --transformer-impl transformer_engine \
  --tensor-model-parallel-size 2 --pipeline-model-parallel-size 1 \
  --use-distributed-optimizer --num-distributed-optimizer-instances 2 \
  --overlap-grad-reduce --overlap-param-gather \
  --deterministic-mode --no-gradient-accumulation-fusion --attention-softmax-in-fp32 \
  --use-checkpoint-opt_param-scheduler --use-mcore-models \
  --ckpt-format torch_dist --dist-ckpt-optim-fully-reshardable --dist-ckpt-strictness log_all \
  --bf16 --async-save --use-persistent-ckpt-worker
```

Then re-run the same command with `--load <SHARED_DIR>/checkpoints` (i.e. `RUN_NUMBER=2` from `run_ci_test.sh`) and check whether a second `events.out.tfevents.*` file lands in `<SHARED_DIR>/1/tensorboard/` (expected) or in a sibling `<SHARED_DIR>/2/tensorboard/` (observed).

**Additional context**

- All four failing jobs run GB200, 2 nodes × 4 GPUs.
- The first run passes its `test_regular_pipeline` check on the recorded goldens, so the dist-optimizer + torch_dist save side is intact; the regression is on the resume + tensorboard-dir reuse path.
- Same failure signature reproduces across two independent recent nightly runs, indicating a deterministic regression rather than infrastructure flakiness.
- Suspect commits to investigate: recent changes to `tests/functional_tests/shell_test_utils/run_ci_test.sh` and to the megatron training-init path that constructs the tensorboard SummaryWriter on resume.

Triaged automatically via `/create-issue`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 CI failure(nightlies and MR runs): ckpt-resume tests on dgx_gb200 fail with IndexError when reading 2nd-run tensorboard events #4903

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

🐛 CI failure(nightlies and MR runs): ckpt-resume tests on dgx_gb200 fail with IndexError when reading 2nd-run tensorboard events #4903

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions