Skip to content
This repository was archived by the owner on Nov 3, 2023. It is now read-only.
This repository was archived by the owner on Nov 3, 2023. It is now read-only.

the training results can be pulled to the main process #162

@JiahaoYao

Description

@JiahaoYao

this is weights from ddp-spawn

ic| spawn_output: _SpawnOutput(best_model_path='./lightning_logs/version_53/checkpoints/epoch=0-step=10.ckpt', weights_path='./.temp.ckpt', trainer_state=TrainerState(status=<TrainerStatus.FINISHED: 'finished'>, fn=<TrainerFn.FITTING: 'fit'>, stage=None, _fault_tolerant_mode=<_FaultTolerantMode.DISABLED: 'disabled'>), trainer_results=None, extra=[{'val_loss': array(1., dtype=float32)}])

this is the weights from ray_ddp

None

This is because in the ddp-spawn
https://github.com/Lightning-AI/lightning/blob/master/src/pytorch_lightning/strategies/launchers/spawn.py#L104-L105

        results = function(*args, **kwargs)

        if trainer is not None:
            results = self._collect_rank_zero_results(trainer, results)

the output is

ic| results: None, 'raw'
ic| results: _SpawnOutput(best_model_path='./lightning_logs/version_57/checkpoints/epoch=0-step=10.ckpt', weights_path='./.temp.ckpt', trainer_state=TrainerState(status=<TrainerStatus.FINISHED: 'finished'>, fn=<TrainerFn.FITTING: 'fit'>, stage=None, _fault_tolerant_mode=<_FaultTolerantMode.DISABLED: 'disabled'>), trainer_results=None, extra=[{'val_loss': array(1., dtype=float32)}])
    '2nd handed': '2nd handed'
ic| spawn_output: _SpawnOutput(best_model_path='./lightning_logs/version_57/checkpoints/epoch=0-step=10.ckpt', weights_path='./.temp.ckpt', trainer_state=TrainerState(status=<TrainerStatus.FINISHED: 'finished'>, fn=<TrainerFn.FITTING: 'fit'>, stage=None, _fault_tolerant_mode=<_FaultTolerantMode.DISABLED: 'disabled'>), trainer_results=None, extra=[{'val_loss': array(1., dtype=float32)}])

on the other hand, for the ray ddp, these output is

https://github.com/JiahaoYao/ray_lightning/blob/2727fd441a62e0e6763fd1f25ed97575dc5a6733/ray_lightning/ray_ddp.py#L252-L255

(RayExecutor pid=7048)     socket.gethostbyname(socket.gethostname()): '10.0.2.160'
(RayExecutor pid=7048) ic| results: None, '1st import'
(RayExecutor pid=7048) _SpawnOutput(best_model_path='', weights_path=None, trainer_state=TrainerState(status=<TrainerStatus.INITIALIZING: 'initializing'>, fn=None, stage=None, _fault_tolerant_mode=<_FaultTolerantMode.DISABLED: 'disabled'>), trainer_results=None, extra=[{}])
(RayExecutor pid=7048)     '2nd import': '2nd import'

this is still because the trainer is only the copy here.

#143

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions