ci(test): isolate ckpt-resume tensorboard per phase#5074
Merged
Conversation
Each training phase of a ckpt-resume / frozen-resume test now writes its
tensorboard events into a dedicated ${TENSORBOARD_PATH}/run_N subdir, and
the analyzer is invoked with the explicit subdir instead of indexing into
a sorted glob. This removes the racy reliance on "files[1] is the resume
run", which can yield an empty _2nd.json when only one events file is
written, when the resume file is empty at read time, or when an extra
file is picked up — surfacing in the test as a pydantic ValidationError
out of `read_golden_values_from_json`.
The --is-second-run flag on `get_test_results_from_tensorboard_logs.py`
is now unused and removed; the single caller in run_ci_test.sh passes
the per-phase directory directly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Contributor
Author
|
/ok to test e38a45c |
Signed-off-by: oliver könig <okoenig@nvidia.com>
Contributor
Author
|
/ok to test a10721c |
thomasdhc
reviewed
Jun 1, 2026
| @@ -1,3 +1,4 @@ | |||
| # Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
Contributor
There was a problem hiding this comment.
do we need the full copyright?
Contributor
Author
There was a problem hiding this comment.
No, we only include the header not the full file
thomasdhc
approved these changes
Jun 1, 2026
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26760172905 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26765973442 |
ko3n1g
added a commit
that referenced
this pull request
Jun 3, 2026
Signed-off-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> (cherry picked from commit 46f1af7) Signed-off-by: oliver könig <okoenig@nvidia.com>
copy-pr-bot Bot
pushed a commit
that referenced
this pull request
Jun 12, 2026
Signed-off-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Claude summary
What
ckpt-resumeandfrozen-resumefunctional tests previously wrote tensorboard events from both training phases into the same$TENSORBOARD_PATHdirectory and relied onfiles[1](mtime-sorted globindex=1) to identify the resume run's data. That assumption is racy:IndexError(the failure mode in 🐛 CI failure(nightlies and MR runs): ckpt-resume tests on dgx_gb200 fail with IndexError when reading 2nd-run tensorboard events #4903),read_tb_logs_as_listreturns{}, the_2nd.jsonfile is written as{}, and the resume pytest fails in fixture setup with a pydanticValidationError: input_value=PydanticUndefinedout ofread_golden_values_from_jsonattests/functional_tests/python_test_utils/common.py:168— the latest h100 incarnation observed on thegpt3_mcore_te_tp2_pp1_resume_torch_dist_te_8experts2parallel_multi_dist_optimizer_instancestest case (same test enumerated in 🐛 CI failure(nightlies and MR runs): ckpt-resume tests on dgx_gb200 fail with IndexError when reading 2nd-run tensorboard events #4903).This PR removes the glob-ordering dependency by giving each phase its own directory.
How
run_ci_test.sh_REPEAT_TENSORBOARD_PATH.ckpt-resume/frozen-resume, before each training phase set:--tensorboard-dir: ${TENSORBOARD_PATH}in the recipe YAML.${_REPEAT_TENSORBOARD_PATH}/run_1for resume tests (unchanged for everything else).${_REPEAT_TENSORBOARD_PATH}/run_2and no longer passes--is-second-run.get_test_results_from_tensorboard_logs.py--is-second-runflag and theindex=1branch it controlled are removed. The single caller now passes an explicit per-phase--logs-dir, so the script always readsfiles[0].Layout, before / after
Scope
frozen-start,release,checkpoint-consistency, inference, and RL modes — they don't write to per-run subdirs.Related
IndexErrorvariant). The multi-node sync fix in fix: Fix multi-node functional test phase sync #4924 didn't address single-node h100 because the underlying cause is the analyzer's index-into-sorted-glob, not multi-node phase ordering.