Support DDP in PyTorch-Lightning#4384
Conversation
|
I made this PR review ready. Note that this PR is based on #4322 and should be merged after it. |
Codecov Report
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more @@ Coverage Diff @@
## master #4384 +/- ##
==========================================
- Coverage 90.40% 90.16% -0.24%
==========================================
Files 172 181 +9
Lines 13682 14037 +355
==========================================
+ Hits 12369 12657 +288
- Misses 1313 1380 +67
... and 16 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
|
@HideakiImamura This is a follow-up for #4322. Could you join the review, please? |
|
@Alnusjaponica #4322 was merged into |
|
This pull request has not seen any recent activity. |
toshihikoyanase
left a comment
There was a problem hiding this comment.
Let me share my early comments.
I guess the change uses system attrs to store intermediate values even if users execute only a single process. I'm thinking of keeping them to simplify the logic. What do you think of it?
Co-authored-by: Toshihiko Yanase <toshihiko.yanase@gmail.com>
|
@toshihikoyanase Thank you for your indication. I fixed the comment and am going to make change to use system_attr only under distributed situation. |
|
@Alnusjaponica Thank you for your update. As we pair-programmed the code, we may simplify the code in terms of the following points:
|
toshihikoyanase
left a comment
There was a problem hiding this comment.
Thank you for your update. Let me add some small comments.
Co-authored-by: Toshihiko Yanase <toshihiko.yanase@gmail.com>
Co-authored-by: Toshihiko Yanase <toshihiko.yanase@gmail.com>
Co-authored-by: Toshihiko Yanase <toshihiko.yanase@gmail.com>
Co-authored-by: Toshihiko Yanase <toshihiko.yanase@gmail.com>
Co-authored-by: Toshihiko Yanase <toshihiko.yanase@gmail.com>
Co-authored-by: Toshihiko Yanase <toshihiko.yanase@gmail.com>
toshihikoyanase
left a comment
There was a problem hiding this comment.
I confirmed that the updated callback worked with the PyTorch Lightning DDP example in optuna-examples. I have a small comment, but the change almost looks good to me.
Diff of pytorch_lightning_ddp.py
```diff diff --git a/pytorch/pytorch_lightning_ddp.py b/pytorch/pytorch_lightning_ddp.py index 030834d..6d609cb 100644 --- a/pytorch/pytorch_lightning_ddp.py +++ b/pytorch/pytorch_lightning_ddp.py @@ -78,6 +78,9 @@ class LightningNet(pl.LightningModule): self.log("val_acc", accuracy, sync_dist=True) self.log("hp_metric", accuracy, on_step=False, on_epoch=True, sync_dist=True)- def validation_epoch_end(self, output) -> None:
-
return - def configure_optimizers(self) -> optim.Optimizer:
return optim.Adam(self.model.parameters())
@@ -124,19 +127,22 @@ def objective(trial: optuna.trial.Trial) -> float:
model = LightningNet(dropout, output_dims)
datamodule = FashionMNISTDataModule(data_dir=DIR, batch_size=BATCHSIZE)
- callback = PyTorchLightningPruningCallback(trial, monitor="val_acc")
trainer = pl.Trainer(
logger=True,
limit_val_batches=PERCENT_VALID_EXAMPLES,
enable_checkpointing=False,
max_epochs=EPOCHS,
gpus=-1 if torch.cuda.is_available() else None,
-
accelerator="ddp_cpu" if not torch.cuda.is_available() else None,
-
accelerator="cpu" if not torch.cuda.is_available() else None, -
strategy="ddp_spawn", num_processes=os.cpu_count() if not torch.cuda.is_available() else None,
-
callbacks=[PyTorchLightningPruningCallback(trial, monitor="val_acc")],
-
callbacks=[callback],)
hyperparameters = dict(n_layers=n_layers, dropout=dropout, output_dims=output_dims)
trainer.logger.log_hyperparams(hyperparameters)
trainer.fit(model, datamodule=datamodule) -
callback.check_pruned()
return trainer.callback_metrics["val_acc"].item()
</details>
toshihikoyanase
left a comment
There was a problem hiding this comment.
We have some follow-up tasks as described in TODO comments, but I think we can work on them in a new PR.
LGTM. Thank you!
HideakiImamura
left a comment
There was a problem hiding this comment.
Thanks for the PR and sorry for the late reply. I checked the overall codes and basically looks good to me. Could you check my several comment?
| self._trial.storage.set_trial_system_attr(self._trial._trial_id, _PRUNED_KEY, True) | ||
| self._trial.storage.set_trial_system_attr(self._trial._trial_id, _EPOCH_KEY, epoch) | ||
|
|
||
| def check_pruned(self) -> None: |
There was a problem hiding this comment.
This function is intended to be called by users by hand after the Trainer.fit. How about adding the concrete instruction on the document about where this function should be called?
There was a problem hiding this comment.
I added some explanation and example codes in docstrings.
HideakiImamura
left a comment
There was a problem hiding this comment.
Thanks for the update. LGTM.
Motivation
Follow up #4322.
Temporary, DDP is not supported in
PyTorchLightningPruningCallbackbecause of the problem described in #4322.This PR make
PyTorchLightningPruningCallbacksupport DDP again.Description of the changes
This PR
callback.check_pruned()in objective functions when they use DDPtest_pytorch_lightning_pruning_callback_ddp_monitorandtest_pytorch_lightning_pruning_callback_ddp_unsupported_storage