-
Notifications
You must be signed in to change notification settings - Fork 7.3k
Description
I am in the final stages of a project I’ve been working on for a while now in RLlib and as I try to train my model using the gpu (and the Tune API with config[“num_gpus”] = 1), I can’t seem to get it to run without throwing errors.
Specifically, when I try to train my agent, I get an error thrown from here (Line 157) essentially telling me that len(self.devices) is 0 and that no GPUS are being detected.
Initially I thought it was because my GPU was not set up to work with PyTorch (which is the framework I am using for my project), but after running a simple test with torch.cuda.is_available(), torch.cuda.device(0), and torch.cuda.get_device_name(0) I can see that my GPU is being recognized by Torch (RTX 2060-Max Q, just for reference).
Has anyone encountered this error before and are there any workarounds to it? I saw someone suggest removing config["num_gpus] = 1 https://www.gitmemory.com/issue/ray-project/ray/16459/862005565 from the tune.run config call, but that seems to just cause the PyTorch policies to run on my CPU (and they train properly), which is not what I wanted.
Thanks for your help