ci(test): isolate ckpt-resume tensorboard per phase by ko3n1g · Pull Request #5074 · NVIDIA/Megatron-LM

ko3n1g · 2026-05-30T10:07:53Z

Claude summary

What

ckpt-resume and frozen-resume functional tests previously wrote tensorboard events from both training phases into the same $TENSORBOARD_PATH directory and relied on files[1] (mtime-sorted glob index=1) to identify the resume run's data. That assumption is racy:

if only one events file lands in the dir → IndexError (the failure mode in 🐛 CI failure(nightlies and MR runs): ckpt-resume tests on dgx_gb200 fail with IndexError when reading 2nd-run tensorboard events #4903),
if the indexed file happens to have no scalar events flushed yet, or is some unrelated extra file → read_tb_logs_as_list returns {}, the _2nd.json file is written as {}, and the resume pytest fails in fixture setup with a pydantic ValidationError: input_value=PydanticUndefined out of read_golden_values_from_json at tests/functional_tests/python_test_utils/common.py:168 — the latest h100 incarnation observed on the gpt3_mcore_te_tp2_pp1_resume_torch_dist_te_8experts2parallel_multi_dist_optimizer_instances test case (same test enumerated in 🐛 CI failure(nightlies and MR runs): ckpt-resume tests on dgx_gb200 fail with IndexError when reading 2nd-run tensorboard events #4903).

This PR removes the glob-ordering dependency by giving each phase its own directory.

How

run_ci_test.sh

Snapshot the per-repeat tensorboard base into _REPEAT_TENSORBOARD_PATH.
For ckpt-resume / frozen-resume, before each training phase set:
```
export TENSORBOARD_PATH="$_REPEAT_TENSORBOARD_PATH/run_${RUN_NUMBER}"
mkdir -p "$TENSORBOARD_PATH"
```
Training picks up the new path via the existing --tensorboard-dir: ${TENSORBOARD_PATH} in the recipe YAML.
First-run analyzer now points at ${_REPEAT_TENSORBOARD_PATH}/run_1 for resume tests (unchanged for everything else).
Second-run analyzer points at ${_REPEAT_TENSORBOARD_PATH}/run_2 and no longer passes --is-second-run.

get_test_results_from_tensorboard_logs.py

The --is-second-run flag and the index=1 branch it controlled are removed. The single caller now passes an explicit per-phase --logs-dir, so the script always reads files[0].

Layout, before / after

# before (both phases collide; resume analyzer guesses via mtime sort)
$TENSORBOARD_PATH/
  events.out.tfevents.<ts1>.<host>.<pid1>    ← run 1
  events.out.tfevents.<ts2>.<host>.<pid2>    ← run 2 (analyzer: files[1])

# after (per-phase isolation; analyzer reads explicit subdir)
$TENSORBOARD_PATH/
  run_1/events.out.tfevents.<...>
  run_2/events.out.tfevents.<...>

Scope

Behavior unchanged for frozen-start, release, checkpoint-consistency, inference, and RL modes — they don't write to per-run subdirs.
No change to golden values; same scalars, same metric filter list.

Same failure family as the closed 🐛 CI failure(nightlies and MR runs): ckpt-resume tests on dgx_gb200 fail with IndexError when reading 2nd-run tensorboard events #4903 (gb200 IndexError variant). The multi-node sync fix in fix: Fix multi-node functional test phase sync #4924 didn't address single-node h100 because the underlying cause is the analyzer's index-into-sorted-glob, not multi-node phase ordering.

Each training phase of a ckpt-resume / frozen-resume test now writes its tensorboard events into a dedicated ${TENSORBOARD_PATH}/run_N subdir, and the analyzer is invoked with the explicit subdir instead of indexing into a sorted glob. This removes the racy reliance on "files[1] is the resume run", which can yield an empty _2nd.json when only one events file is written, when the resume file is empty at read time, or when an extra file is picked up — surfacing in the test as a pydantic ValidationError out of `read_golden_values_from_json`. The --is-second-run flag on `get_test_results_from_tensorboard_logs.py` is now unused and removed; the single caller in run_ci_test.sh passes the per-phase directory directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

copy-pr-bot · 2026-05-30T10:07:57Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

ko3n1g · 2026-05-30T10:08:16Z

/ok to test e38a45c

Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g · 2026-05-30T11:07:30Z

/ok to test a10721c

thomasdhc · 2026-06-01T13:50:01Z

@@ -1,3 +1,4 @@
+# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


do we need the full copyright?

No, we only include the header not the full file

svcnvidia-nemo-ci · 2026-06-01T14:09:45Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26760172905

svcnvidia-nemo-ci · 2026-06-01T15:53:20Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26765973442

Signed-off-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> (cherry picked from commit 46f1af7) Signed-off-by: oliver könig <okoenig@nvidia.com>

Signed-off-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ko3n1g added the Run functional tests label May 30, 2026

copy-pr-bot Bot temporarily deployed to public May 30, 2026 10:08 Inactive

copy-pr-bot Bot temporarily deployed to test May 30, 2026 10:09 Inactive

copy-pr-bot Bot temporarily deployed to public May 30, 2026 10:11 Inactive

copy-pr-bot Bot temporarily deployed to public May 30, 2026 10:12 Inactive

copy-pr-bot Bot temporarily deployed to public May 30, 2026 10:20 Inactive

ci(test): add copyright header to tensorboard logs util

a10721c

Signed-off-by: oliver könig <okoenig@nvidia.com>

copy-pr-bot Bot temporarily deployed to public May 30, 2026 11:07 Inactive

copy-pr-bot Bot temporarily deployed to test May 30, 2026 11:08 Inactive

copy-pr-bot Bot temporarily deployed to public May 30, 2026 11:11 Inactive

copy-pr-bot Bot temporarily deployed to public May 30, 2026 11:19 Inactive

ko3n1g marked this pull request as ready for review June 1, 2026 10:31

ko3n1g requested a review from a team as a code owner June 1, 2026 10:31

svcnvidia-nemo-ci requested a review from a team June 1, 2026 10:31

svcnvidia-nemo-ci added the complexity: low label Jun 1, 2026

thomasdhc reviewed Jun 1, 2026

View reviewed changes

thomasdhc approved these changes Jun 1, 2026

View reviewed changes

svcnvidia-nemo-ci added the Approved All necessary approvals have been made label Jun 1, 2026

ko3n1g enabled auto-merge June 1, 2026 14:09

ko3n1g added this pull request to the merge queue Jun 1, 2026

ko3n1g removed this pull request from the merge queue due to a manual request Jun 1, 2026

ko3n1g merged commit 46f1af7 into NVIDIA:main Jun 1, 2026
209 of 211 checks passed

copy-pr-bot Bot pushed a commit that referenced this pull request Jun 12, 2026

ci(test): isolate ckpt-resume tensorboard per phase (#5074)

9e72aac

Signed-off-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(test): isolate ckpt-resume tensorboard per phase#5074

ci(test): isolate ckpt-resume tensorboard per phase#5074
ko3n1g merged 2 commits into
NVIDIA:mainfrom
ko3n1g:ko3n1g/fix/ckpt-resume-tb-isolation

ko3n1g commented May 30, 2026

Uh oh!

copy-pr-bot Bot commented May 30, 2026

Uh oh!

ko3n1g commented May 30, 2026

Uh oh!

ko3n1g commented May 30, 2026

Uh oh!

thomasdhc Jun 1, 2026

Uh oh!

ko3n1g Jun 1, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 1, 2026

Uh oh!

Uh oh!

svcnvidia-nemo-ci commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -1,3 +1,4 @@
		# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Conversation

ko3n1g commented May 30, 2026

What

How

Layout, before / after

Scope

Related

Uh oh!

copy-pr-bot Bot commented May 30, 2026

Uh oh!

ko3n1g commented May 30, 2026

Uh oh!

ko3n1g commented May 30, 2026

Uh oh!

thomasdhc Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

ko3n1g Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

svcnvidia-nemo-ci commented Jun 1, 2026

Uh oh!

Uh oh!

svcnvidia-nemo-ci commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants