-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Closed
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn't
Milestone
Description
What is the problem?
Reported from this #10739
When running the script below, GPU resources are never returned. There is a couple interesting behaviors.
- We have local reference from our internal methods.
-
When we move
X_idandint_refinto the task, or when we pass them by arg, the error disappeared. -
When we remove
max_call, the error disappeared.
It might be some unknown bugs from ref counting, but I am not sure.
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
import ray
import time
ray.init(num_cpus=2, num_gpus=1)
sleep_seconds = 5
import numpy as np
X = np.zeros(int(1e6))
X_id = ray.put(X)
int_ref = ray.put(11)
@ray.remote(num_cpus=1, num_gpus=0.25, max_calls=1)
def do_thing():
time.sleep(sleep_seconds)
return len(ray.get(X_id))
@ray.remote(num_cpus=1, num_gpus=0.25, max_calls=1)
def do_sum():
time.sleep(sleep_seconds)
return ray.get(int_ref) + 20
for i in range(3):
print(ray.get(do_thing.remote()))
print(ray.get(do_sum.remote()))
print(i, ray.available_resources())
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn't
