Skip to content

feat: Add support for GLM 5.1 GRPO#2489

Merged
terrykong merged 17 commits into
mainfrom
sl/glm5.1
Jun 3, 2026
Merged

feat: Add support for GLM 5.1 GRPO#2489
terrykong merged 17 commits into
mainfrom
sl/glm5.1

Conversation

@slikhite-1

@slikhite-1 slikhite-1 commented May 13, 2026

Copy link
Copy Markdown
Contributor

What does this PR do ?

Added support + recipe for training GLM 5.1 BF16 with GRPO on 64 H100 nodes with Megatron backend.

  • Upgraded dependency stack for GLM 5.1- vllm 0.19.0, transformers>=5.5.0,<=5.6.0", refreshed uv.lock, and bumped Megatron-Bridge/Megatron-LM SHAs.

  • Added a 64-node GLM 5.1 GRPO Megatron recipe for BS 64 and max_seq_len 2048 on 64 H100 nodes.

  • Added test-suite script (kept the 64-node run manual-only because it exceeds nightly GPU-hour budget)

Screenshot 2026-05-14 101016 Screenshot 2026-05-14 101236

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@slikhite-1 slikhite-1 requested review from a team as code owners May 13, 2026 20:29
@copy-pr-bot

copy-pr-bot Bot commented May 13, 2026

Copy link
Copy Markdown

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the Documentation Improvements or additions to documentation label May 13, 2026
@github-actions

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 2d87d35 (PR #2489 from sl/glm5.1)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@github-actions

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: bfc0b80 (PR #2489 from sl/glm5.1)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@github-actions

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: a70be3d (PR #2489 from sl/glm5.1)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@copy-pr-bot

copy-pr-bot Bot commented May 20, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 7a19c4b (PR #2489 from sl/glm5.1)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@github-actions

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 1ae946f (PR #2489 from sl/glm5.1)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@github-actions

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 595cd38 (PR #2489 from sl/glm5.1)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@github-actions

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 25de6df (PR #2489 from sl/glm5.1)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@github-actions

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 2caf99a (PR #2489 from sl/glm5.1)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@github-actions

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: fe84589 (PR #2489 from sl/glm5.1)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@github-actions

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 47f6ef6 (PR #2489 from sl/glm5.1)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@kajalj22

Copy link
Copy Markdown
Contributor

/ok to test 30efc24

@github-actions

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 30efc24 (PR #2489 from sl/glm5.1)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@kajalj22 kajalj22 added the CI:L1 Run doctests, unit tests, and functional tests label May 29, 2026
kajalj22
kajalj22 previously approved these changes May 29, 2026
@terrykong terrykong enabled auto-merge (squash) June 3, 2026 05:08
slikhite-1 and others added 17 commits June 2, 2026 22:48
Signed-off-by: slikhite-1 <slikhite@nvidia.com>
Signed-off-by: slikhite-1 <slikhite@nvidia.com>
Signed-off-by: slikhite-1 <slikhite@nvidia.com>
Signed-off-by: slikhite-1 <slikhite@nvidia.com>
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
This reverts commit 7a19c4b.

Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Signed-off-by: slikhite-1 <slikhite@nvidia.com>
Signed-off-by: slikhite-1 <slikhite@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Kajal Jain <kajalj@nvidia.com>
Signed-off-by: slikhite-1 <slikhite@nvidia.com>
Signed-off-by: slikhite-1 <slikhite@nvidia.com>
Signed-off-by: slikhite-1 <slikhite@nvidia.com>
Signed-off-by: slikhite-1 <slikhite@nvidia.com>
Signed-off-by: slikhite-1 <slikhite@nvidia.com>
Signed-off-by: slikhite-1 <slikhite@nvidia.com>
Signed-off-by: slikhite-1 <slikhite@nvidia.com>
Signed-off-by: slikhite-1 <slikhite@nvidia.com>
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: f599f89 (PR #2489 from sl/glm5.1)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@terrykong terrykong added CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) and removed CI:L1 Run doctests, unit tests, and functional tests labels Jun 3, 2026
@terrykong

Copy link
Copy Markdown
Collaborator

/ok to test f599f89

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) Documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants