Skip to content

Various training utils#4872

Merged
maanug-nv merged 20 commits into
NVIDIA:mainfrom
maanug-nv:utils-dir
May 22, 2026
Merged

Various training utils#4872
maanug-nv merged 20 commits into
NVIDIA:mainfrom
maanug-nv:utils-dir

Conversation

@maanug-nv

Copy link
Copy Markdown
Contributor

What does this PR do ?

Add some slurm utils to core, create a utils dir in training, and add some new utils to the dir.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact @NVIDIA/mcore-oncall.

Issue tracking

For PRs from open-source community contributors:

  • New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
  • Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

  1. When your PR is ready, click Ready for Review.
  2. An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
    • Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

@copy-pr-bot

copy-pr-bot Bot commented May 19, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@maanug-nv maanug-nv marked this pull request as ready for review May 19, 2026 17:05
@maanug-nv maanug-nv requested review from a team as code owners May 19, 2026 17:05
@maanug-nv

Copy link
Copy Markdown
Contributor Author

/claude review

Comment thread megatron/training/utils/log_utils.py
Comment thread megatron/training/utils/log_utils.py
Comment thread megatron/training/utils/common_utils.py Outdated
Comment thread megatron/training/utils/log_utils.py Outdated
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
maanug-nv added 3 commits May 21, 2026 16:54
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
@svcnvidia-nemo-ci

Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26310955621

Signed-off-by: Maanu Grover <maanug@nvidia.com>
@svcnvidia-nemo-ci

Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26315381779

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved All necessary approvals have been made complexity: medium

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants