This repository was archived by the owner on Nov 3, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 33
This repository was archived by the owner on Nov 3, 2023. It is now read-only.
Fractional GPU training #124
Copy link
Copy link
Closed
Description
Hi guys, just stumbled across these days and love your package!
However, I get an error when trying to train with GPU fractions. The last PR was here #121.
Makes perfect sense that this error is thrown, but I do not know enough about ray nor ray_lightning to understand local_rank.
Do you have an idea how to fix this?
File "/home/gugl/miniconda3/envs/mwe_ray/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 62, in execute
return fn(*args, **kwargs)
File "/home/gugl/miniconda3/envs/mwe_ray/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 449, in execute_remote
self._worker_setup(process_idx=global_rank)
File "/home/gugl/miniconda3/envs/mwe_ray/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 409, in _worker_setup
self.torch_distributed_backend,
File "/home/gugl/miniconda3/envs/mwe_ray/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/parallel.py", line 103, in torch_distributed_backend
torch_backend = "nccl" if self.on_gpu else "gloo"
File "/home/gugl/miniconda3/envs/mwe_ray/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/parallel.py", line 51, in on_gpu
return self.root_device.type == "cuda" and torch.cuda.is_available()
File "/home/gugl/miniconda3/envs/mwe_ray/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 526, in root_device
return torch.device("cuda", device_id)
(train_network pid=725884) TypeError: Device(): argument 'index' (position 2) must be int, not float
Maybe relevant pieces of code:
def train_network(config):
...
trainer = pl.Trainer(
...
strategy=RayPlugin(num_workers=1, find_unused_parameters=False, resources_per_worker={"CPU": 2, "GPU": 0.5}),
)
trainer.fit(model, datamodule=mySimDataModule)
analysis = tune.run(
train_network,
...
resources_per_trial=tune.PlacementGroupFactory([{"CPU": 1}, {"CPU": 2, "GPU": 0.5}]),
)Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels