This repository was archived by the owner on Nov 3, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 33
This repository was archived by the owner on Nov 3, 2023. It is now read-only.
unhandled cuda error, NCCL version 2.7.8 #61
Copy link
Copy link
Closed
Description
Hi, I'm getting the following error when trying to run with RayPlugin:
E File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
E File "python/ray/_raylet.pyx", line 451, in ray._raylet.execute_task.function_executor
E File "/home/rizhiy/miniconda3/envs/ntf/lib/python3.8/site-packages/ray/_private/function_manager.py", line 563, in actor_method_executor
E return method(__ray_actor, *args, **kwargs)
E File "/home/rizhiy/miniconda3/envs/ntf/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 41, in execute
E return fn(*args, **kwargs)
E File "/home/rizhiy/miniconda3/envs/ntf/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 265, in execute_remote
E super(RayPlugin, self).new_process(
E File "/home/rizhiy/miniconda3/envs/ntf/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 168, in new_process
E self.configure_ddp()
E File "/home/rizhiy/miniconda3/envs/ntf/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 215, in configure_ddp
E self._model = DistributedDataParallel(
E File "/home/rizhiy/miniconda3/envs/ntf/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 496, in __init__
E dist._verify_model_across_ranks(self.process_group, parameters)
E RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8
E ncclUnhandledCudaError: Call to CUDA function failed.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels