This repository was archived by the owner on Nov 3, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 33
This repository was archived by the owner on Nov 3, 2023. It is now read-only.
the training results can be pulled to the main process #162
Copy link
Copy link
Closed
Description
this is weights from ddp-spawn
ic| spawn_output: _SpawnOutput(best_model_path='./lightning_logs/version_53/checkpoints/epoch=0-step=10.ckpt', weights_path='./.temp.ckpt', trainer_state=TrainerState(status=<TrainerStatus.FINISHED: 'finished'>, fn=<TrainerFn.FITTING: 'fit'>, stage=None, _fault_tolerant_mode=<_FaultTolerantMode.DISABLED: 'disabled'>), trainer_results=None, extra=[{'val_loss': array(1., dtype=float32)}])this is the weights from ray_ddp
NoneThis is because in the ddp-spawn
https://github.com/Lightning-AI/lightning/blob/master/src/pytorch_lightning/strategies/launchers/spawn.py#L104-L105
results = function(*args, **kwargs)
if trainer is not None:
results = self._collect_rank_zero_results(trainer, results)the output is
ic| results: None, 'raw'
ic| results: _SpawnOutput(best_model_path='./lightning_logs/version_57/checkpoints/epoch=0-step=10.ckpt', weights_path='./.temp.ckpt', trainer_state=TrainerState(status=<TrainerStatus.FINISHED: 'finished'>, fn=<TrainerFn.FITTING: 'fit'>, stage=None, _fault_tolerant_mode=<_FaultTolerantMode.DISABLED: 'disabled'>), trainer_results=None, extra=[{'val_loss': array(1., dtype=float32)}])
'2nd handed': '2nd handed'
ic| spawn_output: _SpawnOutput(best_model_path='./lightning_logs/version_57/checkpoints/epoch=0-step=10.ckpt', weights_path='./.temp.ckpt', trainer_state=TrainerState(status=<TrainerStatus.FINISHED: 'finished'>, fn=<TrainerFn.FITTING: 'fit'>, stage=None, _fault_tolerant_mode=<_FaultTolerantMode.DISABLED: 'disabled'>), trainer_results=None, extra=[{'val_loss': array(1., dtype=float32)}])on the other hand, for the ray ddp, these output is
(RayExecutor pid=7048) socket.gethostbyname(socket.gethostname()): '10.0.2.160'
(RayExecutor pid=7048) ic| results: None, '1st import'
(RayExecutor pid=7048) _SpawnOutput(best_model_path='', weights_path=None, trainer_state=TrainerState(status=<TrainerStatus.INITIALIZING: 'initializing'>, fn=None, stage=None, _fault_tolerant_mode=<_FaultTolerantMode.DISABLED: 'disabled'>), trainer_results=None, extra=[{}])
(RayExecutor pid=7048) '2nd import': '2nd import'this is still because the trainer is only the copy here.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels