Skip to content

ci(test): isolate ckpt-resume tensorboard per phase#5074

Merged
ko3n1g merged 2 commits into
NVIDIA:mainfrom
ko3n1g:ko3n1g/fix/ckpt-resume-tb-isolation
Jun 1, 2026
Merged

ci(test): isolate ckpt-resume tensorboard per phase#5074
ko3n1g merged 2 commits into
NVIDIA:mainfrom
ko3n1g:ko3n1g/fix/ckpt-resume-tb-isolation

Conversation

@ko3n1g

@ko3n1g ko3n1g commented May 30, 2026

Copy link
Copy Markdown
Contributor
Claude summary

What

ckpt-resume and frozen-resume functional tests previously wrote tensorboard events from both training phases into the same $TENSORBOARD_PATH directory and relied on files[1] (mtime-sorted glob index=1) to identify the resume run's data. That assumption is racy:

This PR removes the glob-ordering dependency by giving each phase its own directory.

How

run_ci_test.sh

  • Snapshot the per-repeat tensorboard base into _REPEAT_TENSORBOARD_PATH.
  • For ckpt-resume / frozen-resume, before each training phase set:
    export TENSORBOARD_PATH="$_REPEAT_TENSORBOARD_PATH/run_${RUN_NUMBER}"
    mkdir -p "$TENSORBOARD_PATH"
    Training picks up the new path via the existing --tensorboard-dir: ${TENSORBOARD_PATH} in the recipe YAML.
  • First-run analyzer now points at ${_REPEAT_TENSORBOARD_PATH}/run_1 for resume tests (unchanged for everything else).
  • Second-run analyzer points at ${_REPEAT_TENSORBOARD_PATH}/run_2 and no longer passes --is-second-run.

get_test_results_from_tensorboard_logs.py

  • The --is-second-run flag and the index=1 branch it controlled are removed. The single caller now passes an explicit per-phase --logs-dir, so the script always reads files[0].

Layout, before / after

# before (both phases collide; resume analyzer guesses via mtime sort)
$TENSORBOARD_PATH/
  events.out.tfevents.<ts1>.<host>.<pid1>    ← run 1
  events.out.tfevents.<ts2>.<host>.<pid2>    ← run 2 (analyzer: files[1])

# after (per-phase isolation; analyzer reads explicit subdir)
$TENSORBOARD_PATH/
  run_1/events.out.tfevents.<...>
  run_2/events.out.tfevents.<...>

Scope

  • Behavior unchanged for frozen-start, release, checkpoint-consistency, inference, and RL modes — they don't write to per-run subdirs.
  • No change to golden values; same scalars, same metric filter list.

Related

Each training phase of a ckpt-resume / frozen-resume test now writes its
tensorboard events into a dedicated ${TENSORBOARD_PATH}/run_N subdir, and
the analyzer is invoked with the explicit subdir instead of indexing into
a sorted glob. This removes the racy reliance on "files[1] is the resume
run", which can yield an empty _2nd.json when only one events file is
written, when the resume file is empty at read time, or when an extra
file is picked up — surfacing in the test as a pydantic ValidationError
out of `read_golden_values_from_json`.

The --is-second-run flag on `get_test_results_from_tensorboard_logs.py`
is now unused and removed; the single caller in run_ci_test.sh passes
the per-phase directory directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented May 30, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@ko3n1g

ko3n1g commented May 30, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test e38a45c

Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g

ko3n1g commented May 30, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test a10721c

@@ -1,3 +1,4 @@
# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the full copyright?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we only include the header not the full file

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the Approved All necessary approvals have been made label Jun 1, 2026
@ko3n1g ko3n1g enabled auto-merge June 1, 2026 14:09
@ko3n1g ko3n1g added this pull request to the merge queue Jun 1, 2026
@svcnvidia-nemo-ci

Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26760172905

@ko3n1g ko3n1g removed this pull request from the merge queue due to a manual request Jun 1, 2026
@svcnvidia-nemo-ci

Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26765973442

@ko3n1g ko3n1g merged commit 46f1af7 into NVIDIA:main Jun 1, 2026
209 of 211 checks passed
ko3n1g added a commit that referenced this pull request Jun 3, 2026
Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(cherry picked from commit 46f1af7)
Signed-off-by: oliver könig <okoenig@nvidia.com>
copy-pr-bot Bot pushed a commit that referenced this pull request Jun 12, 2026
Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved All necessary approvals have been made complexity: low Run functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🐛 CI failure(nightlies and MR runs): ckpt-resume tests on dgx_gb200 fail with IndexError when reading 2nd-run tensorboard events

3 participants