ci: Major refactor of release-workflows#4602
Conversation
Switch the release.yaml orchestration to use NVIDIA-NeMo/FW-CI-templates' new composable workflows: - _release_bump.yml — multi-target bump for both megatron-core and megatron_fsdp via a JSON `bump-targets` input. - _release_finalize.yml — GH release + Slack notify, taking release-version as an input from the bump output. Megatron-LM keeps: - Its own _build_test_publish_wheel.yml (multi-arch matrix manylinux build for megatron-core arm64+amd64 and megatron-fsdp amd64), wired in as a sibling job between bump and finalize. - Its own release-docs.yml (custom docs flow), invoked after finalize. This removes the local _release_library.yml (now superseded) and aligns release orchestration with the rest of the NeMo framework so future upstream improvements (validate-only PR rehearsal, etc.) become available without copying YAML. Signed-off-by: oliver könig <okoenig@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
Merge build-test-publish-wheel.yml (push-triggered, wheel-only) into release.yaml (workflow_dispatch-triggered, full release) so a single file with one FW-CI-templates pin governs both per-PR rehearsal and real release. Mirrors the Megatron-Bridge consolidation. On push (PR/main/deploy-release/merge_group): validate-only=true; the full pipeline rehearses (bump computes only, wheels build without publish, GH release payload echoed, docs publish skipped, Slack suppressed). On workflow_dispatch: validate-only=false; existing dry-run knob still controls whether wheel publish + GH release POST + docs publish fire or stay inert. The push trigger now exercises the bump and finalize paths every PR, not just the wheel build. Signed-off-by: oliver könig <okoenig@nvidia.com>
|
/ok to test 729bb50 |
Signed-off-by: oliver könig <okoenig@nvidia.com>
|
/ok to test ed1d2fe |
When dispatching against a copy-pr-bot mirror branch (pull-request/<id>), pre-flight's 'Get PR info' step matches startsWith(github.ref, 'refs/heads/pull-request/') and tries to look up a PR — but the event_name is workflow_dispatch, not pull_request, so the lookup fails. Skip pre-flight entirely on dispatch events; downstream jobs already short-circuit their pre-flight-output checks when github.event_name == 'workflow_dispatch'. Signed-off-by: oliver könig <okoenig@nvidia.com>
|
/ok to test b8919ca |
|
/ok to test ed1d2fe |
Default needs-success behavior would skip bump/wheels when pre-flight is skipped. Switch their if: to !cancelled() && (pre-flight.result success or skipped), keeping the auto-skip-on-failure semantics intact. Signed-off-by: oliver könig <okoenig@nvidia.com>
|
/ok to test f273553 |
Signed-off-by: oliver könig <okoenig@nvidia.com>
|
/ok to test 4292889 |
|
/ok to test 3fbbf8c |
1 similar comment
|
/ok to test 3fbbf8c |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
|
/ok to test ee9b3f9 |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
|
/ok to test d5c90a9 |
1 similar comment
|
/ok to test d5c90a9 |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
|
/ok to test be26a40 |
…sthrough Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
Lets env-scoped SLACK_WEBHOOK reach the notify job in the called workflow. Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
|
/ok to test a03fd03 |
Signed-off-by: oliver könig <okoenig@nvidia.com>
|
/ok to test f54a6a1 |
Signed-off-by: oliver könig <okoenig@nvidia.com>
|
/ok to test 1717260 |
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
…!failure) Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com> # Conflicts: # .github/workflows/build-test-publish-wheel.yml
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25669939636 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25671734703 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25678362086 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25685644892 |
Why
See the design discussion in NVIDIA-NeMo/FW-CI-templates#466.
What
.github/workflows/build-test-publish-wheel.yml..github/workflows/release.yamlas the single caller for bothpushandworkflow_dispatch.Test plan
workflow_dispatch dry-run=true(sha 1717260, 2026-05-07T11:35:50Z, success): https://github.com/NVIDIA/Megatron-LM/actions/runs/25493359313workflow_dispatch dry-run=falseon the next planned RC.Rollout
v1.0.0.