[Serve] Resolve deadlock when replica result is accessed from two eve… by abrarsheikh · Pull Request #59385 · ray-project/ray

abrarsheikh · 2025-12-11T11:59:09Z

Summary

Fixes a deadlock that occurs when awaiting intermediate DeploymentResponse objects in a chain of deployment calls.

Problem

The following code would hang indefinitely:

a = self.deployment_a.remote(number)
b = self.deployment_b.remote(a)
c = self.deployment_c.remote(b)

true_b = await b  # HANGS
true_c = await c

Interestingly, reversing the await order (await c then await b) worked fine.

Root Cause

When chained DeploymentResponse objects are used, the same ReplicaResult can be accessed from two different event loops concurrently:

User's event loop (replica): When user code calls await b
Router's event loop (singleton thread): When resolving b as an argument to c

Both code paths call ReplicaResult.to_object_ref_async(), which used an asyncio.Lock for synchronization. However, asyncio.Lock is not thread-safe and not designed for cross-event-loop usage, causing a deadlock.

Fix

Replace the asyncio.Lock with a non-blocking acquire pattern using the existing thread-safe threading.Lock:

Testing

Added test_chained_deployment_response_await_order which tests both await orders (b_first and c_first) to ensure they complete without hanging. Ran this test with all combinations of RAY_SERVE_RUN_USER_CODE_IN_SEPARATE_THREAD and RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP env vars

Tested against the repro in #54201

Performance

Use the same repro script as given in the GH issue except i switch the order, i'e await c then await b so that i can run it on master.

locust --headless -u 100 -r 10 --host http://localhost:8000

Metric	Branch	Master	Notes
Requests (#reqs)	4030	4465	Master handled 10.8% more requests in same test window
Requests/sec	80.39	80.83	Nearly identical throughput
Avg latency (ms)	1109	1118	Essentially the same
Median (p50) latency	650 ms	700 ms	Branch slightly faster at p50
p90 latency	2200 ms	2200 ms	Identical
p95 latency	2200 ms	2300 ms	Master slightly slower
p99 latency	2400 ms	2500 ms	Master slightly slower
Max latency	2519 ms	2760 ms	Master has higher tail
Failures	0	0	Both perfect

…nt loops Signed-off-by: abrar <abrar@anyscale.com>

Signed-off-by: abrar <abrar@anyscale.com>

gemini-code-assist

Code Review

This pull request effectively resolves a critical deadlock issue that occurred when awaiting chained DeploymentResponse objects from different event loops. The root cause was correctly identified as the non-thread-safe nature of asyncio.Lock across event loops. The fix, which replaces it with a non-blocking acquire pattern on a threading.Lock, is a robust and well-implemented solution for cross-loop synchronization. The new logic in to_object_ref_async is sound, and the comprehensive new test test_chained_deployment_response_await_order thoroughly validates the fix by checking various await orders. Overall, this is an excellent change that significantly improves the stability of deployment composition in Ray Serve. I have one suggestion for a minor code cleanup.

python/ray/serve/_private/replica_result.py

Signed-off-by: abrar <abrar@anyscale.com>

akyang-anyscale · 2025-12-12T20:35:50Z

in the repro, why does the user event loop being blocked prevent the router loop from making progress to resolve b?

abrarsheikh · 2025-12-13T01:07:35Z

in the repro, why does the user event loop being blocked prevent the router loop from making progress to resolve b?

discussed offline, here is a small toy example to demo the behavior

def demo_the_actual_error():
    """
    Show the actual error that occurs.
    """
    print()
    print("=" * 70)
    print("DEMO 3: The actual error when using asyncio.Lock across loops")
    print("=" * 70)
    print()
    
    lock = asyncio.Lock()
    lock_acquired = threading.Event()
    
    def thread_a():
        async def hold_lock():
            async with lock:
                print("[Thread A] Acquired lock, holding for 1 second...")
                lock_acquired.set()
                await asyncio.sleep(1)
                print("[Thread A] Releasing lock...")
        
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)
        loop.run_until_complete(hold_lock())
        loop.close()
    
    def thread_b():
        lock_acquired.wait()  # Wait for Thread A to acquire the lock
        time.sleep(0.1)       # Small delay to ensure we try to acquire after
        
        async def try_acquire():
            print("[Thread B] Trying to acquire lock...")
            async with lock:  # This will fail!
                print("[Thread B] Got the lock!")
        
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)
        try:
            loop.run_until_complete(try_acquire())
        except RuntimeError as e:
            print(f"[Thread B] ERROR: {e}")
        finally:
            loop.close()
    
    t1 = threading.Thread(target=thread_a)
    t2 = threading.Thread(target=thread_b)
    
    t1.start()
    t2.start()
    t1.join()
    t2.join()

ray-project#59385) ### Summary Fixes a deadlock that occurs when awaiting intermediate `DeploymentResponse` objects in a chain of deployment calls. ### Problem The following code would hang indefinitely: ```python a = self.deployment_a.remote(number) b = self.deployment_b.remote(a) c = self.deployment_c.remote(b) true_b = await b # HANGS true_c = await c ``` Interestingly, reversing the await order (`await c` then `await b`) worked fine. ### Root Cause When chained `DeploymentResponse` objects are used, the same `ReplicaResult` can be accessed from two different event loops concurrently: 1. **User's event loop (replica)**: When user code calls `await b` 2. **Router's event loop (singleton thread)**: When resolving `b` as an argument to `c` Both code paths call `ReplicaResult.to_object_ref_async()`, which used an `asyncio.Lock` for synchronization. However, `asyncio.Lock` is **not thread-safe** and **not designed for cross-event-loop usage**, causing a deadlock. ### Fix Replace the `asyncio.Lock` with a non-blocking acquire pattern using the existing thread-safe `threading.Lock`: ### Testing Added `test_chained_deployment_response_await_order` which tests both await orders (`b_first` and `c_first`) to ensure they complete without hanging. Ran this test with all combinations of `RAY_SERVE_RUN_USER_CODE_IN_SEPARATE_THREAD` and `RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP` env vars Tested against the repro in ray-project#54201 ### Performance Use the same repro script as given in the GH issue except i switch the order, i'e `await c` then `await b` so that i can run it on master. `locust --headless -u 100 -r 10 --host http://localhost:8000` Metric | Branch | Master | Notes -- | -- | -- | -- Requests (#reqs) | 4030 | 4465 | Master handled 10.8% more requests in same test window Requests/sec | 80.39 | 80.83 | Nearly identical throughput Avg latency (ms) | 1109 | 1118 | Essentially the same Median (p50) latency | 650 ms | 700 ms | Branch slightly faster at p50 p90 latency | 2200 ms | 2200 ms | Identical p95 latency | 2200 ms | 2300 ms | Master slightly slower p99 latency | 2400 ms | 2500 ms | Master slightly slower Max latency | 2519 ms | 2760 ms | Master has higher tail Failures | 0 | 0 | Both perfect --------- Signed-off-by: abrar <abrar@anyscale.com>

ray-project#59385) ### Summary Fixes a deadlock that occurs when awaiting intermediate `DeploymentResponse` objects in a chain of deployment calls. ### Problem The following code would hang indefinitely: ```python a = self.deployment_a.remote(number) b = self.deployment_b.remote(a) c = self.deployment_c.remote(b) true_b = await b # HANGS true_c = await c ``` Interestingly, reversing the await order (`await c` then `await b`) worked fine. ### Root Cause When chained `DeploymentResponse` objects are used, the same `ReplicaResult` can be accessed from two different event loops concurrently: 1. **User's event loop (replica)**: When user code calls `await b` 2. **Router's event loop (singleton thread)**: When resolving `b` as an argument to `c` Both code paths call `ReplicaResult.to_object_ref_async()`, which used an `asyncio.Lock` for synchronization. However, `asyncio.Lock` is **not thread-safe** and **not designed for cross-event-loop usage**, causing a deadlock. ### Fix Replace the `asyncio.Lock` with a non-blocking acquire pattern using the existing thread-safe `threading.Lock`: ### Testing Added `test_chained_deployment_response_await_order` which tests both await orders (`b_first` and `c_first`) to ensure they complete without hanging. Ran this test with all combinations of `RAY_SERVE_RUN_USER_CODE_IN_SEPARATE_THREAD` and `RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP` env vars Tested against the repro in ray-project#54201 ### Performance Use the same repro script as given in the GH issue except i switch the order, i'e `await c` then `await b` so that i can run it on master. `locust --headless -u 100 -r 10 --host http://localhost:8000` Metric | Branch | Master | Notes -- | -- | -- | -- Requests (#reqs) | 4030 | 4465 | Master handled 10.8% more requests in same test window Requests/sec | 80.39 | 80.83 | Nearly identical throughput Avg latency (ms) | 1109 | 1118 | Essentially the same Median (p50) latency | 650 ms | 700 ms | Branch slightly faster at p50 p90 latency | 2200 ms | 2200 ms | Identical p95 latency | 2200 ms | 2300 ms | Master slightly slower p99 latency | 2400 ms | 2500 ms | Master slightly slower Max latency | 2519 ms | 2760 ms | Master has higher tail Failures | 0 | 0 | Both perfect --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

abrarsheikh added 2 commits December 11, 2025 11:47

[Serve] Resolve deadlock when replica result is accessed from two eve…

dcfb88b

…nt loops Signed-off-by: abrar <abrar@anyscale.com>

fix test

f5b1fa5

Signed-off-by: abrar <abrar@anyscale.com>

gemini-code-assist bot reviewed Dec 11, 2025

View reviewed changes

python/ray/serve/_private/replica_result.py Show resolved Hide resolved

remove unused async lock

2f107d9

Signed-off-by: abrar <abrar@anyscale.com>

abrarsheikh mentioned this pull request Dec 11, 2025

[Serve] Deadlock when awaiting DeploymentResponse #54201

Closed

abrarsheikh added the go add ONLY when ready to merge, run all tests label Dec 11, 2025

abrarsheikh marked this pull request as ready for review December 11, 2025 13:49

abrarsheikh requested a review from a team as a code owner December 11, 2025 13:49

abrarsheikh requested review from akyang-anyscale and edoakes December 11, 2025 13:49

ray-gardener bot added the serve Ray Serve Related Issue label Dec 11, 2025

akyang-anyscale approved these changes Dec 13, 2025

View reviewed changes

abrarsheikh merged commit 28f2cf8 into master Dec 13, 2025
7 checks passed

abrarsheikh deleted the 54201-abrar-dead branch December 13, 2025 01:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Resolve deadlock when replica result is accessed from two eve…#59385

[Serve] Resolve deadlock when replica result is accessed from two eve…#59385
abrarsheikh merged 3 commits intomasterfrom
54201-abrar-dead

abrarsheikh commented Dec 11, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

akyang-anyscale commented Dec 12, 2025

Uh oh!

abrarsheikh commented Dec 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

abrarsheikh commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Root Cause

Fix

Testing

Performance

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

akyang-anyscale commented Dec 12, 2025

Uh oh!

abrarsheikh commented Dec 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

abrarsheikh commented Dec 11, 2025 •

edited

Loading