[training] feat: mark MLflow runs as FAILED on crash and KILLED on SIGTERM by rob-luke · Pull Request #3822 · NVIDIA-NeMo/Megatron-Bridge

rob-luke · 2026-05-14T10:47:16Z

What does this PR do ?

Mark MLflow runs as FAILED on uncaught Python exceptions and KILLED when training is preempted via SIGTERM, instead of MLflow's default FINISHED-on-exit behavior.

Changelog

Add end_active_mlflow_run(status) helper in src/megatron/bridge/training/utils/mlflow_utils.py to end the active run with a given status, suppressing errors so signal-handler reentrancy cannot crash the SIGTERM path.
Add install_mlflow_failure_hook() in the same module that chains a sys.excepthook to mark the run FAILED before MLflow's atexit fires. Idempotent; preserves the previous excepthook so default traceback printing still happens.
In src/megatron/bridge/training/state.py, install the failure hook from mlflow_logger only when Megatron-Bridge owns the run (active_run() is None before start_run). Externally-started parent/shared runs are left untouched.
In src/megatron/bridge/training/train.py, call end_active_mlflow_run("KILLED") in checkpoint_and_decide_exit's SIGTERM-detected branch, between the optional checkpoint save and the exit barrier.
Add 8 unit tests in tests/unit_tests/training/utils/test_mlflow_utils.py covering both helpers (no-op paths, status pass-through, exception suppression, excepthook idempotency and chaining).

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests? (8 unit tests added)
Did you add or update any necessary documentation? (Docstrings on new helpers; no user-facing docs change since the new behavior activates automatically when MLflow is already configured)
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- The helpers guard import mlflow and no-op if MLflow isn't installed, so no impact on optional installs.

Additional Information

Related to # (issue) — none filed

…GTERM MLflow's atexit handler ends the run with the default status FINISHED on process exit, making crashed or preempted runs indistinguishable from successful ones in the UI. - Add end_active_mlflow_run(status) helper that ends the active run with the given status, suppressing errors so signal-handler reentrancy in mlflow.end_run cannot crash the SIGTERM path. - Add install_mlflow_failure_hook() that chains a sys.excepthook to mark the run FAILED before MLflow's atexit handler fires. Idempotent. The previous excepthook is preserved so default traceback printing still happens. - Install the failure hook in state.mlflow_logger when we own the run (not when an externally-started run is detected). - Call end_active_mlflow_run("KILLED") in the SIGTERM-detected branch of checkpoint_and_decide_exit. - 8 unit tests covering helper behavior, hook idempotency, and excepthook chaining. Signed-off-by: Robert Luke <code@robertluke.net>

copy-pr-bot · 2026-05-14T10:47:20Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

The mlflow_utils → checkpoint_utils → state cycle made the top-level import of install_mlflow_failure_hook in state.py raise ImportError ("cannot import name 'install_mlflow_failure_hook' from partially initialized module") whenever mlflow_utils was the entry point. Move the import inside the only call site (GlobalState.mlflow_logger, inside the active_run-is-None branch) so resolution is deferred to runtime and the cycle is broken. Repro before this fix: ``python -c "import megatron.bridge.training.utils.mlflow_utils"``. Signed-off-by: Robert Luke <code@robertluke.net>

kamran-nvidia · 2026-05-20T18:27:12Z

/ok to test ce43b81

kamran-nvidia · 2026-05-20T22:35:19Z

@rob-luke Please have a look at CI failures.

Signed-off-by: Rob Luke <code@robertluke.net>

Signed-off-by: Robert Luke <code@robertluke.net>

rob-luke · 2026-05-21T00:51:08Z

Thanks @kamran-nvidia and @yaoyu-33. The CI caught a test isolation bug which is now fixed. This PR is ready for review again. Thank you

kamran-nvidia · 2026-05-21T12:08:45Z

/ok to test 8966249

kamran-nvidia · 2026-05-21T15:15:05Z

@rob-luke Please address the Lint issues. thanks.

Signed-off-by: Robert Luke <code@robertluke.net>

kamran-nvidia · 2026-05-21T22:43:15Z

/ok to test bb9529b

rob-luke · 2026-05-21T22:52:36Z

Sorry about the lint error @kamran-nvidia , thanks for reviewing.

rob-luke · 2026-05-22T23:31:17Z

Thank you @kamran-nvidia and @yaoyu-33

…GTERM (NVIDIA-NeMo#3822) Signed-off-by: Robert Luke <code@robertluke.net> Signed-off-by: Rob Luke <code@robertluke.net> Co-authored-by: Kamran Jafari <kjafarisadeg@nvidia.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

github-actions Bot added the community-request label May 14, 2026

rob-luke changed the title ~~[training] feat: mark MLflow runs as FAILED on crash and KILLED on SIGTERM~~ feat(training): mark MLflow runs as FAILED on crash and KILLED on SIGTERM May 14, 2026

yaoyu-33 added area:training Training loop, callbacks, and runtime integration feature New capabilities, enhancements, or enablement work waiting-on-maintainers Waiting on maintainers to respond labels May 14, 2026

svcnvidia-nemo-ci removed the waiting-on-maintainers Waiting on maintainers to respond label May 14, 2026

rob-luke changed the title ~~feat(training): mark MLflow runs as FAILED on crash and KILLED on SIGTERM~~ [training] feat: mark MLflow runs as FAILED on crash and KILLED on SIGTERM May 14, 2026

yaoyu-33 added the waiting-on-maintainers Waiting on maintainers to respond label May 14, 2026

svcnvidia-nemo-ci removed the waiting-on-maintainers Waiting on maintainers to respond label May 14, 2026

yaoyu-33 added the needs-review PR is ready for code review and waiting on a reviewer label May 14, 2026

svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 16, 2026

yaoyu-33 removed the needs-review PR is ready for code review and waiting on a reviewer label May 17, 2026

rob-luke added 2 commits May 19, 2026 08:25

Merge branch 'main' into rob-luke/training/mlflow-status

05514ac

Merge branch 'main' into rob-luke/training/mlflow-status

63fe8d6

yaoyu-33 previously approved these changes May 20, 2026

View reviewed changes

yaoyu-33 added ready-to-merge PR is approved, current, and only waiting for CI to pass before merge and removed waiting-on-maintainers Waiting on maintainers to respond labels May 20, 2026

Merge branch 'main' into rob-luke/training/mlflow-status

ce43b81

copy-pr-bot Bot temporarily deployed to public May 20, 2026 18:27 Inactive

copy-pr-bot Bot temporarily deployed to test May 20, 2026 18:28 Inactive

copy-pr-bot Bot temporarily deployed to public May 20, 2026 19:12 Inactive

copy-pr-bot Bot temporarily deployed to public May 20, 2026 19:13 Inactive

copy-pr-bot Bot temporarily deployed to public May 20, 2026 19:30 Inactive

kamran-nvidia removed the ready-to-merge PR is approved, current, and only waiting for CI to pass before merge label May 20, 2026

kamran-nvidia added the waiting-on-customer Waiting on the original author to respond label May 20, 2026

Merge branch 'main' into rob-luke/training/mlflow-status

4d3a9f8

Signed-off-by: Rob Luke <code@robertluke.net>

rob-luke dismissed yaoyu-33’s stale review via 4d3a9f8 May 21, 2026 00:13

[training] test: reset sys.excepthook between mlflow hook tests

8966249

Signed-off-by: Robert Luke <code@robertluke.net>

kamran-nvidia requested a review from yaoyu-33 May 21, 2026 01:11

copy-pr-bot Bot temporarily deployed to public May 21, 2026 12:09 Inactive

copy-pr-bot Bot temporarily deployed to public May 21, 2026 13:02 Inactive

copy-pr-bot Bot temporarily deployed to public May 21, 2026 13:20 Inactive

rob-luke added 2 commits May 22, 2026 08:09

Merge branch 'main' into rob-luke/training/mlflow-status

d6d752d

[training] chore: trim trailing whitespace in state.py

bb9529b

Signed-off-by: Robert Luke <code@robertluke.net>

kamran-nvidia approved these changes May 21, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to public May 21, 2026 22:43 Inactive

copy-pr-bot Bot temporarily deployed to test May 21, 2026 22:43 Inactive

copy-pr-bot Bot temporarily deployed to public May 21, 2026 23:15 Inactive

copy-pr-bot Bot temporarily deployed to public May 21, 2026 23:16 Inactive

copy-pr-bot Bot temporarily deployed to public May 21, 2026 23:33 Inactive

kamran-nvidia merged commit 9281a38 into NVIDIA-NeMo:main May 22, 2026
76 checks passed

cuichenx mentioned this pull request May 26, 2026

[NeMo FW 26.06 Release] MBridge v0.5.0 Roadmap #3754

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[training] feat: mark MLflow runs as FAILED on crash and KILLED on SIGTERM#3822

[training] feat: mark MLflow runs as FAILED on crash and KILLED on SIGTERM#3822
kamran-nvidia merged 9 commits into
NVIDIA-NeMo:mainfrom
rob-luke:rob-luke/training/mlflow-status

rob-luke commented May 14, 2026

Uh oh!

copy-pr-bot Bot commented May 14, 2026

Uh oh!

kamran-nvidia commented May 20, 2026

Uh oh!

kamran-nvidia commented May 20, 2026

Uh oh!

rob-luke commented May 21, 2026

Uh oh!

kamran-nvidia commented May 21, 2026

Uh oh!

kamran-nvidia commented May 21, 2026

Uh oh!

kamran-nvidia commented May 21, 2026

Uh oh!

rob-luke commented May 21, 2026

Uh oh!

Uh oh!

rob-luke commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

rob-luke commented May 14, 2026

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented May 14, 2026

Uh oh!

kamran-nvidia commented May 20, 2026

Uh oh!

kamran-nvidia commented May 20, 2026

Uh oh!

rob-luke commented May 21, 2026

Uh oh!

kamran-nvidia commented May 21, 2026

Uh oh!

kamran-nvidia commented May 21, 2026

Uh oh!

kamran-nvidia commented May 21, 2026

Uh oh!

rob-luke commented May 21, 2026

Uh oh!

Uh oh!

rob-luke commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants