[core] Release resources only after tasks have stopped executing#53660
[core] Release resources only after tasks have stopped executing#53660edoakes merged 5 commits intoray-project:masterfrom
Conversation
|
@edoakes Can you please review the test? This is how I have tried to repro the resource oversubscription issue due to early |
cd69568 to
177840a
Compare
python/ray/tests/test_scheduling.py
Outdated
There was a problem hiding this comment.
@edoakes can this API be stale? This checks to GCS for available resources, is it possible that the task has started executing and the GCS has not been updated instead?
There was a problem hiding this comment.
you're right; there's no guarantee that this API returns the updated value immediately. it will be updated once the raylet broadcasts its resource usage to the GCS.
so we need to wrap this in a wait_for_condition (or drop the check since it's non-essential)
There was a problem hiding this comment.
Actually, that's the reason i had added some sleep. But, now I've modified the test to assert actual behavior using SignalActors instead of asserting resource accounting dependent on these APIs.
python/ray/tests/test_scheduling.py
Outdated
There was a problem hiding this comment.
I think this test is getting too specific to Ray's implementation and therefore will be brittle.
What you actually want to enforce that task2 does not start executing before task1 exits.
There was a problem hiding this comment.
This is a good point -- better way to write the test
There was a problem hiding this comment.
Thanks for suggestion of using SignalActor. I have modified the test as follows:
- Start task1 and wait for it to signal it's running
- Submit task2 (should be queued)
- Try to wait for task2 to start with 1-second timeout:
- No timeout: Bug! Task2 started immediately --> cleanup and fail
- Timeout occurs: Correct! Task2 is queued --> continue
- Complete task1 and assert expected result
- Wait for task2 to start (should happen immediately now)
- Complete task2 and assert expected result
177840a to
a25a462
Compare
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
…s other comments Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
21336f7 to
5f48512
Compare
) ## Why are these changes needed? In `CoreWorker::Exit()`, the code calls: https://github.com/ray-project/ray/blob/c54437c42fa138580a0367f813b8c4bd9ca0b3e8/src/ray/core_worker/core_worker.cc#L1169 This tells the raylet to immediately release all resources that were allocated to this worker. However, the worker may still have tasks running at this point in the exit sequence. I have added a test to reproduce the issue. However, the test is reproducible 2 out of 5 times locally. In the test: 0. Ray init with only 1 CPU. 1. Start task1 and wait for it to signal it's running 2. Submit task2 (should be queued) 3. Try to wait for task2 to start with 1-second timeout: - No timeout: Bug! Task2 started immediately --> cleanup and fail - Timeout occurs: Correct! Task2 is queued --> continue 4. Complete task1 and assert expected result 5. Wait for task2 to start (should happen immediately now) 6. Complete task2 and assert expected result With the fix, the test always passes (no oversubscription detected) --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…-project#53660) ## Why are these changes needed? In `CoreWorker::Exit()`, the code calls: https://github.com/ray-project/ray/blob/c54437c42fa138580a0367f813b8c4bd9ca0b3e8/src/ray/core_worker/core_worker.cc#L1169 This tells the raylet to immediately release all resources that were allocated to this worker. However, the worker may still have tasks running at this point in the exit sequence. I have added a test to reproduce the issue. However, the test is reproducible 2 out of 5 times locally. In the test: 0. Ray init with only 1 CPU. 1. Start task1 and wait for it to signal it's running 2. Submit task2 (should be queued) 3. Try to wait for task2 to start with 1-second timeout: - No timeout: Bug! Task2 started immediately --> cleanup and fail - Timeout occurs: Correct! Task2 is queued --> continue 4. Complete task1 and assert expected result 5. Wait for task2 to start (should happen immediately now) 6. Complete task2 and assert expected result With the fix, the test always passes (no oversubscription detected) --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
) ## Why are these changes needed? In `CoreWorker::Exit()`, the code calls: https://github.com/ray-project/ray/blob/c54437c42fa138580a0367f813b8c4bd9ca0b3e8/src/ray/core_worker/core_worker.cc#L1169 This tells the raylet to immediately release all resources that were allocated to this worker. However, the worker may still have tasks running at this point in the exit sequence. I have added a test to reproduce the issue. However, the test is reproducible 2 out of 5 times locally. In the test: 0. Ray init with only 1 CPU. 1. Start task1 and wait for it to signal it's running 2. Submit task2 (should be queued) 3. Try to wait for task2 to start with 1-second timeout: - No timeout: Bug! Task2 started immediately --> cleanup and fail - Timeout occurs: Correct! Task2 is queued --> continue 4. Complete task1 and assert expected result 5. Wait for task2 to start (should happen immediately now) 6. Complete task2 and assert expected result With the fix, the test always passes (no oversubscription detected) --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Why are these changes needed?
In
CoreWorker::Exit(), the code calls:ray/src/ray/core_worker/core_worker.cc
Line 1169 in c54437c
This tells the raylet to immediately release all resources that were allocated to this worker. However, the worker may still have tasks running at this point in the exit sequence.
I have added a test to reproduce the issue. However, the test is reproducible 2 out of 5 times locally. In the test:
With the fix, the test always passes (no oversubscription detected)