Fix deadlock in PipelineExecutor downscaling logic#86089
Conversation
|
Workflow [PR], commit [8335bbc] Summary: ⏳
|
|
I've discovered that a similar problem may occur with preempted threads. For example, a query might have 6 threads: 2 of which are preempted, waiting for 1 1-second preemption timeout to downscale or a brand new granted slot to continue working. In the meantime other 4 threads do the rest of the the work an finish by going to idle state. This creates a hanging query waiting for its preempted threads. In this PR, I will add the logic to properly wake and shut down preempted threads when the query finishes, and update the pipeline shutdown condition to take preempted threads into account. |
This reverts commit e2e17cd.
|
There are too many failed tests, but all seem unrelated. Let's rerun one more time. |
alesapin
left a comment
There was a problem hiding this comment.
It would be nice to have a test...
|
Okay, I think I can do one with a reduced preemption timeout to trigger the issue more readily |
|
Sanitizer have found related issue. Investigating... UPD. I was not careful enough and introduced the following data race. |
Cherry pick #86089 to 25.8: Fix deadlock in PipelineExecutor downscaling logic
Backport #86089 to 25.8: Fix deadlock in PipelineExecutor downscaling logic
The pipeline shutdown logical condition is that the number of idle threads equals the total number of threads + no more work. It was checked only when the thread was transitioning into the idle state (i.e., putting itself into the threads_queue). However, with preemption and downscaling logic, the total number of threads can also be decreased dynamically, which may also trigger the pipeline's shutdown condition. Without this fix, the pipeline hangs.
It is hard to add a test for this change because it is rare. The issue happens only when a thread is downscaled just after it executes the last task of the whole pipeline. Existing tests cover this, but downscales are rare in these tests.
Changelog category (leave one):