Skip to content

[training] feat: mark MLflow runs as FAILED on crash and KILLED on SIGTERM#3822

Merged
kamran-nvidia merged 9 commits into
NVIDIA-NeMo:mainfrom
rob-luke:rob-luke/training/mlflow-status
May 22, 2026
Merged

[training] feat: mark MLflow runs as FAILED on crash and KILLED on SIGTERM#3822
kamran-nvidia merged 9 commits into
NVIDIA-NeMo:mainfrom
rob-luke:rob-luke/training/mlflow-status

Conversation

@rob-luke

Copy link
Copy Markdown
Contributor

What does this PR do ?

Mark MLflow runs as FAILED on uncaught Python exceptions and KILLED when training is preempted via SIGTERM, instead of MLflow's default FINISHED-on-exit behavior.

Changelog

  • Add end_active_mlflow_run(status) helper in src/megatron/bridge/training/utils/mlflow_utils.py to end the active run with a given status, suppressing errors so signal-handler reentrancy cannot crash the SIGTERM path.
  • Add install_mlflow_failure_hook() in the same module that chains a sys.excepthook to mark the run FAILED before MLflow's atexit fires. Idempotent; preserves the previous excepthook so default traceback printing still happens.
  • In src/megatron/bridge/training/state.py, install the failure hook from mlflow_logger only when Megatron-Bridge owns the run (active_run() is None before start_run). Externally-started parent/shared runs are left untouched.
  • In src/megatron/bridge/training/train.py, call end_active_mlflow_run("KILLED") in checkpoint_and_decide_exit's SIGTERM-detected branch, between the optional checkpoint save and the exit barrier.
  • Add 8 unit tests in tests/unit_tests/training/utils/test_mlflow_utils.py covering both helpers (no-op paths, status pass-through, exception suppression, excepthook idempotency and chaining).

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests? (8 unit tests added)
  • Did you add or update any necessary documentation? (Docstrings on new helpers; no user-facing docs change since the new behavior activates automatically when MLflow is already configured)
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • The helpers guard import mlflow and no-op if MLflow isn't installed, so no impact on optional installs.

Additional Information

  • Related to # (issue) — none filed

…GTERM

MLflow's atexit handler ends the run with the default status FINISHED on
process exit, making crashed or preempted runs indistinguishable from
successful ones in the UI.

- Add end_active_mlflow_run(status) helper that ends the active run with
  the given status, suppressing errors so signal-handler reentrancy in
  mlflow.end_run cannot crash the SIGTERM path.
- Add install_mlflow_failure_hook() that chains a sys.excepthook to mark
  the run FAILED before MLflow's atexit handler fires. Idempotent. The
  previous excepthook is preserved so default traceback printing still
  happens.
- Install the failure hook in state.mlflow_logger when we own the run
  (not when an externally-started run is detected).
- Call end_active_mlflow_run("KILLED") in the SIGTERM-detected branch of
  checkpoint_and_decide_exit.
- 8 unit tests covering helper behavior, hook idempotency, and excepthook
  chaining.

Signed-off-by: Robert Luke <code@robertluke.net>
@copy-pr-bot

copy-pr-bot Bot commented May 14, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@rob-luke rob-luke changed the title [training] feat: mark MLflow runs as FAILED on crash and KILLED on SIGTERM feat(training): mark MLflow runs as FAILED on crash and KILLED on SIGTERM May 14, 2026
@yaoyu-33 yaoyu-33 added area:training Training loop, callbacks, and runtime integration feature New capabilities, enhancements, or enablement work waiting-on-maintainers Waiting on maintainers to respond labels May 14, 2026
The mlflow_utils → checkpoint_utils → state cycle made the top-level
import of install_mlflow_failure_hook in state.py raise ImportError
("cannot import name 'install_mlflow_failure_hook' from partially
initialized module") whenever mlflow_utils was the entry point.

Move the import inside the only call site (GlobalState.mlflow_logger,
inside the active_run-is-None branch) so resolution is deferred to
runtime and the cycle is broken.

Repro before this fix: ``python -c "import megatron.bridge.training.utils.mlflow_utils"``.

Signed-off-by: Robert Luke <code@robertluke.net>
@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the waiting-on-maintainers Waiting on maintainers to respond label May 14, 2026
@rob-luke rob-luke changed the title feat(training): mark MLflow runs as FAILED on crash and KILLED on SIGTERM [training] feat: mark MLflow runs as FAILED on crash and KILLED on SIGTERM May 14, 2026
@yaoyu-33 yaoyu-33 added the waiting-on-maintainers Waiting on maintainers to respond label May 14, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the waiting-on-maintainers Waiting on maintainers to respond label May 14, 2026
@yaoyu-33 yaoyu-33 added the needs-review PR is ready for code review and waiting on a reviewer label May 14, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 16, 2026
@yaoyu-33 yaoyu-33 removed the needs-review PR is ready for code review and waiting on a reviewer label May 17, 2026
yaoyu-33
yaoyu-33 previously approved these changes May 20, 2026
@yaoyu-33 yaoyu-33 added ready-to-merge PR is approved, current, and only waiting for CI to pass before merge and removed waiting-on-maintainers Waiting on maintainers to respond labels May 20, 2026
@kamran-nvidia

Copy link
Copy Markdown
Contributor

/ok to test ce43b81

@kamran-nvidia

Copy link
Copy Markdown
Contributor

@rob-luke Please have a look at CI failures.

@kamran-nvidia kamran-nvidia removed the ready-to-merge PR is approved, current, and only waiting for CI to pass before merge label May 20, 2026
@kamran-nvidia kamran-nvidia added the waiting-on-customer Waiting on the original author to respond label May 20, 2026
Signed-off-by: Rob Luke <code@robertluke.net>
Signed-off-by: Robert Luke <code@robertluke.net>
@rob-luke

Copy link
Copy Markdown
Contributor Author

Thanks @kamran-nvidia and @yaoyu-33. The CI caught a test isolation bug which is now fixed. This PR is ready for review again. Thank you

@kamran-nvidia kamran-nvidia requested a review from yaoyu-33 May 21, 2026 01:11
@kamran-nvidia

Copy link
Copy Markdown
Contributor

/ok to test 8966249

@kamran-nvidia

Copy link
Copy Markdown
Contributor

@rob-luke Please address the Lint issues. thanks.

@kamran-nvidia

Copy link
Copy Markdown
Contributor

/ok to test bb9529b

@rob-luke

Copy link
Copy Markdown
Contributor Author

Sorry about the lint error @kamran-nvidia , thanks for reviewing.

@kamran-nvidia kamran-nvidia merged commit 9281a38 into NVIDIA-NeMo:main May 22, 2026
76 checks passed
@rob-luke

Copy link
Copy Markdown
Contributor Author

Thank you @kamran-nvidia and @yaoyu-33

vasunvidia pushed a commit to vasunvidia/Megatron-Bridge that referenced this pull request Jun 10, 2026
…GTERM (NVIDIA-NeMo#3822)

Signed-off-by: Robert Luke <code@robertluke.net>
Signed-off-by: Rob Luke <code@robertluke.net>
Co-authored-by: Kamran Jafari <kjafarisadeg@nvidia.com>
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:training Training loop, callbacks, and runtime integration community-request feature New capabilities, enhancements, or enablement work waiting-on-customer Waiting on the original author to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants