[training] feat: mark MLflow runs as FAILED on crash and KILLED on SIGTERM#3822
Merged
kamran-nvidia merged 9 commits intoMay 22, 2026
Merged
Conversation
…GTERM
MLflow's atexit handler ends the run with the default status FINISHED on
process exit, making crashed or preempted runs indistinguishable from
successful ones in the UI.
- Add end_active_mlflow_run(status) helper that ends the active run with
the given status, suppressing errors so signal-handler reentrancy in
mlflow.end_run cannot crash the SIGTERM path.
- Add install_mlflow_failure_hook() that chains a sys.excepthook to mark
the run FAILED before MLflow's atexit handler fires. Idempotent. The
previous excepthook is preserved so default traceback printing still
happens.
- Install the failure hook in state.mlflow_logger when we own the run
(not when an externally-started run is detected).
- Call end_active_mlflow_run("KILLED") in the SIGTERM-detected branch of
checkpoint_and_decide_exit.
- 8 unit tests covering helper behavior, hook idempotency, and excepthook
chaining.
Signed-off-by: Robert Luke <code@robertluke.net>
The mlflow_utils → checkpoint_utils → state cycle made the top-level
import of install_mlflow_failure_hook in state.py raise ImportError
("cannot import name 'install_mlflow_failure_hook' from partially
initialized module") whenever mlflow_utils was the entry point.
Move the import inside the only call site (GlobalState.mlflow_logger,
inside the active_run-is-None branch) so resolution is deferred to
runtime and the cycle is broken.
Repro before this fix: ``python -c "import megatron.bridge.training.utils.mlflow_utils"``.
Signed-off-by: Robert Luke <code@robertluke.net>
yaoyu-33
previously approved these changes
May 20, 2026
Contributor
|
/ok to test ce43b81 |
Contributor
|
@rob-luke Please have a look at CI failures. |
Signed-off-by: Rob Luke <code@robertluke.net>
Signed-off-by: Robert Luke <code@robertluke.net>
Contributor
Author
|
Thanks @kamran-nvidia and @yaoyu-33. The CI caught a test isolation bug which is now fixed. This PR is ready for review again. Thank you |
Contributor
|
/ok to test 8966249 |
Contributor
|
@rob-luke Please address the Lint issues. thanks. |
Signed-off-by: Robert Luke <code@robertluke.net>
Contributor
|
/ok to test bb9529b |
kamran-nvidia
approved these changes
May 21, 2026
Contributor
Author
|
Sorry about the lint error @kamran-nvidia , thanks for reviewing. |
Contributor
Author
|
Thank you @kamran-nvidia and @yaoyu-33 |
vasunvidia
pushed a commit
to vasunvidia/Megatron-Bridge
that referenced
this pull request
Jun 10, 2026
…GTERM (NVIDIA-NeMo#3822) Signed-off-by: Robert Luke <code@robertluke.net> Signed-off-by: Rob Luke <code@robertluke.net> Co-authored-by: Kamran Jafari <kjafarisadeg@nvidia.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Mark MLflow runs as FAILED on uncaught Python exceptions and KILLED when training is preempted via SIGTERM, instead of MLflow's default FINISHED-on-exit behavior.
Changelog
end_active_mlflow_run(status)helper insrc/megatron/bridge/training/utils/mlflow_utils.pyto end the active run with a given status, suppressing errors so signal-handler reentrancy cannot crash the SIGTERM path.install_mlflow_failure_hook()in the same module that chains asys.excepthookto mark the runFAILEDbefore MLflow's atexit fires. Idempotent; preserves the previous excepthook so default traceback printing still happens.src/megatron/bridge/training/state.py, install the failure hook frommlflow_loggeronly when Megatron-Bridge owns the run (active_run()isNonebeforestart_run). Externally-started parent/shared runs are left untouched.src/megatron/bridge/training/train.py, callend_active_mlflow_run("KILLED")incheckpoint_and_decide_exit's SIGTERM-detected branch, between the optional checkpoint save and the exit barrier.tests/unit_tests/training/utils/test_mlflow_utils.pycovering both helpers (no-op paths, status pass-through, exception suppression, excepthook idempotency and chaining).GitHub Actions CI
See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
import mlflowand no-op if MLflow isn't installed, so no impact on optional installs.Additional Information