fix(callbacks): Defer step/time-triggered ModelCheckpoint saves until validation metrics are available by littlebullGit · Pull Request #21106 · Lightning-AI/pytorch-lightning

littlebullGit · 2025-08-21T22:09:18Z

Defer step/time-triggered ModelCheckpoint saves until validation metrics are available

Fixes #20919

Root cause

With every_n_train_steps (or train_time_interval), checkpoints could save at train-batch end before validation ran. The monitored validation metric was missing/stale, so best_model_score could be incorrect.

Fix

In [src/lightning/pytorch/callbacks/model_checkpoint.py]:
- [ModelCheckpoint.on_train_batch_end]:
  - Defer saves when the monitored key is missing from [trainer.callback_metrics].
  - If at the last train batch and not saving at train-epoch-end, defer only when validation will run next:
    - trainer.enable_validation is True
    - trainer.num_val_batches > 0
    - trainer.check_val_every_n_epoch schedule matches the upcoming epoch
- [ModelCheckpoint.on_validation_end]:
  - Perform deferred saves to use fresh validation metrics.
- Allow zero timedelta for train_time_interval and broadcast the time-trigger decision across ranks via trainer.strategy.broadcast.
- No deferral when monitoring a train metric or when validation won’t run.

Tests

Repro (previously failing, now passing):
- [tests/tests_pytorch/callbacks/test_model_checkpoint_step_interval_val_metric.py]
Additional validations:
- [tests/tests_pytorch/callbacks/test_model_checkpoint_additional_cases.py]
- [tests/tests_pytorch/callbacks/test_model_checkpoint_edge_cases.py]

Outcome

best_model_score matches the latest validation metric.
Step/time-interval checkpointing behaves correctly without premature or skipped saves.

📚 Documentation preview 📚: https://pytorch-lightning--21106.org.readthedocs.build/en/21106/

codecov · 2025-08-22T02:06:47Z

Codecov Report

❌ Patch coverage is 90.32258% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 87%. Comparing base (e55650d) to head (6c1554a).

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #21106   +/-   ##
=======================================
  Coverage      87%      87%           
=======================================
  Files         269      269           
  Lines       23520    23545   +25     
=======================================
+ Hits        20508    20542   +34     
+ Misses       3012     3003    -9

littlebullGit · 2025-08-22T03:49:55Z

@Borda , I cannot see the error in the two failed jobs. Can you help me or point me to what the error is ?
View more details on Lit OSS [bot]
You don't have access to this Studio

Borda · 2025-08-22T09:53:12Z

I cannot see the error in the two failed jobs. Can you help me or point me to what the error is ? View more details on Lit OSS [bot] You don't have access to this Studio

these tests are optional for now as the same are failing also on master

… validation metrics are available Root cause: - With `every_n_train_steps` (or `train_time_interval`), checkpoints could save at train batch end before validation ran, so the monitored val metric was missing/stale and `best_model_score` was incorrect. (Refs Lightning-AI#20919) Fix: - In [src/lightning/pytorch/callbacks/model_checkpoint.py:ModelCheckpoint.on_train_batch_end]: - Defer saves when the monitored key is missing from [trainer.callback_metrics] - If on the last train batch and not saving at train-epoch-end, defer only when validation will run next: - `trainer.enable_validation` is True - `trainer.num_val_batches` > 0 - `trainer.check_val_every_n_epoch` schedule matches the upcoming epoch - Perform deferred saves in [on_validation_end], ensuring fresh validation metrics are used. - Allow zero `timedelta` for `train_time_interval` and broadcast the time-trigger decision across ranks. - Do not defer when monitoring a train metric or when no validation is scheduled. Tests: - Repro (previously failing, now passing): - [tests/tests_pytorch/callbacks/test_model_checkpoint_step_interval_val_metric.py] - Additional validations: - [tests/tests_pytorch/callbacks/test_model_checkpoint_additional_cases.py] - [tests/tests_pytorch/callbacks/test_model_checkpoint_edge_cases.py] Outcome: - `best_model_score` matches the validation metric after the epoch. - Step/time-interval checkpointing behaves correctly without premature or skipped saves.

…til needed" This reverts commit 59dda02.

…paths" This reverts commit 6c1554a.

… validation metrics are available (#21106) * fix(callbacks): defer step/time-triggered ModelCheckpoint saves until validation metrics are available Root cause: - With `every_n_train_steps` (or `train_time_interval`), checkpoints could save at train batch end before validation ran, so the monitored val metric was missing/stale and `best_model_score` was incorrect. (Refs #20919) Fix: - In [src/lightning/pytorch/callbacks/model_checkpoint.py:ModelCheckpoint.on_train_batch_end]: - Defer saves when the monitored key is missing from [trainer.callback_metrics] - If on the last train batch and not saving at train-epoch-end, defer only when validation will run next: - `trainer.enable_validation` is True - `trainer.num_val_batches` > 0 - `trainer.check_val_every_n_epoch` schedule matches the upcoming epoch - Perform deferred saves in [on_validation_end], ensuring fresh validation metrics are used. - Allow zero `timedelta` for `train_time_interval` and broadcast the time-trigger decision across ranks. - Do not defer when monitoring a train metric or when no validation is scheduled. Tests: - Repro (previously failing, now passing): - [tests/tests_pytorch/callbacks/test_model_checkpoint_step_interval_val_metric.py] - Additional validations: - [tests/tests_pytorch/callbacks/test_model_checkpoint_additional_cases.py] - [tests/tests_pytorch/callbacks/test_model_checkpoint_edge_cases.py] Outcome: - `best_model_score` matches the validation metric after the epoch. - Step/time-interval checkpointing behaves correctly without premature or skipped saves. * test: disable logger in model checkpoint tests to avoid side effects * chlog --------- Co-authored-by: Jirka B <j.borovec+github@gmail.com> (cherry picked from commit b1cc925)

littlebullGit requested review from Borda, ethanwharris, justusschock, lantiga and tchaton as code owners August 21, 2025 22:09

github-actions Bot added pl Generic label for PyTorch Lightning package fabric lightning.fabric.Fabric labels Aug 21, 2025

Borda approved these changes Aug 22, 2025

View reviewed changes

github-actions Bot added has conflicts and removed has conflicts labels Aug 22, 2025

littlebullGit added 4 commits August 27, 2025 18:09

test: disable logger in model checkpoint tests to avoid side effects

b88b546

refactor: defer DeepSpeed import and logging configuration until needed

59dda02

test: add mock-based CPU tests for DeepSpeed strategy import paths

6c1554a

littlebullGit force-pushed the fix/20919-checkpoint-step-val-metric branch from 094b278 to 6c1554a Compare August 27, 2025 22:10

justusschock approved these changes Aug 28, 2025

View reviewed changes

Comment thread src/lightning/fabric/strategies/deepspeed.py Outdated

littlebullGit and others added 4 commits August 28, 2025 19:30

Revert "refactor: defer DeepSpeed import and logging configuration un…

ef816b6

…til needed" This reverts commit 59dda02.

Revert "test: add mock-based CPU tests for DeepSpeed strategy import …

836de5a

…paths" This reverts commit 6c1554a.

Merge branch 'master' into fix/20919-checkpoint-step-val-metric

ced28da

chlog

a2a5964

Borda merged commit b1cc925 into Lightning-AI:master Aug 29, 2025
88 of 91 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(callbacks): Defer step/time-triggered ModelCheckpoint saves until validation metrics are available#21106

fix(callbacks): Defer step/time-triggered ModelCheckpoint saves until validation metrics are available#21106
Borda merged 8 commits intoLightning-AI:masterfrom
littlebullGit:fix/20919-checkpoint-step-val-metric

littlebullGit commented Aug 21, 2025 •

edited by github-actions Bot

Loading

Uh oh!

codecov Bot commented Aug 22, 2025 •

edited

Loading

Uh oh!

littlebullGit commented Aug 22, 2025

Uh oh!

Borda commented Aug 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

littlebullGit commented Aug 21, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Defer step/time-triggered ModelCheckpoint saves until validation metrics are available

Root cause

Fix

Tests

Outcome

Uh oh!

codecov Bot commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

littlebullGit commented Aug 22, 2025

Uh oh!

Borda commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

littlebullGit commented Aug 21, 2025 •

edited by github-actions Bot

Loading

codecov Bot commented Aug 22, 2025 •

edited

Loading

Borda commented Aug 22, 2025 •

edited

Loading