Skip to content

mempool: rework lock discipline to mitigate callback deadlocks#9030

Merged
creachadair merged 2 commits intov0.35.xfrom
mjf/outer-lock
Jul 19, 2022
Merged

mempool: rework lock discipline to mitigate callback deadlocks#9030
creachadair merged 2 commits intov0.35.xfrom
mjf/outer-lock

Conversation

@creachadair
Copy link

@creachadair creachadair commented Jul 16, 2022

The priority mempool has a stricter synchronization requirement than the legacy
mempool. Under sufficiently-heavy load, exclusive access can lead to deadlocks
when processing a large batch of transaction rechecks through an out-of-process
application using the socket client.

By design, a socket client stalls when its send buffer fills, during which time
it holds a lock shared with the receive thread. While blocked in this state, a
response read by the receive thread waits for the shared lock so the callback
can be invoked.

If we're lucky, the server will then read the next request and make enough room
in the buffer for the sender to proceed. If not however (e.g., if the next
request is bigger than the one just consumed), the receive thread is blocked:
It is waiting on the lock and cannot read a response. Once the server's output
buffer fills, the system deadlocks.

This can happen with any sufficiently-busy workload, but is more likely during
a large recheck in the v1 mempool, where the callbacks need exclusive access to
mempool state. As a workaround, process rechecks for the priority mempool in
their own goroutines outside the mempool mutex. Responses still head-of-line
block, but will no longer get pushback due to contention on the mempool itself.

General notes

  • It appears that the ordering of responses the socket client induces head-of-line blocking when multiple ABCI requests are made under the mempool mutex (e.g., flush + recheck). This doesn't seem to happen enough to be a problem in small networks, but beyond a certain size it is basically inevitable.
  • We can't fix this by changing the socket client's ordering, since the legacy mempool requires fixed order in order to update correctly (specifically, to process rechecks). That means any workaround more or less has to live in the priority mempool (which is easier anyway).
  • Managing concurrency explicitly for rechecks appears to fix the problem.

@creachadair creachadair force-pushed the mjf/outer-lock branch 3 times, most recently from 29b4a2f to 2213c80 Compare July 17, 2022 01:52
creachadair pushed a commit that referenced this pull request Jul 18, 2022
(experimental)

This is a preliminary manual backport of #9030.

- Handle all locking internally (for v1 only).
- Process initial ABCI checks inline (synchronously).

Synchronous processing doesn't change the order of operations, since the local
client processes async calls immediately anyway, and the socket client has to
respond in strict order of request anyway.
creachadair pushed a commit that referenced this pull request Jul 18, 2022
(experimental)

This is a preliminary manual backport of #9030.

- Handle all locking internally (for v1 only).
- Process initial ABCI checks inline (synchronously).

Synchronous processing doesn't change the order of operations, since the local
client processes async calls immediately anyway, and the socket client has to
respond in strict order of request anyway.
@creachadair creachadair force-pushed the mjf/outer-lock branch 11 times, most recently from 5cb951a to 1b49aee Compare July 19, 2022 13:25
creachadair pushed a commit that referenced this pull request Jul 19, 2022
(manual backport of #9030)

The priority mempool has a stricter synchronization requirement than the legacy
mempool. Under sufficiently-heavy load, exclusive access can lead to deadlocks
when processing a large batch of transaction rechecks through an out-of-process
application using the socket client.

By design, a socket client stalls when its send buffer fills, during which time
it holds a lock shared with the receive thread.  While blocked in this state, a
response read by the receive thread waits for the shared lock so the callback
can be invoked.

If we're lucky, the server will then read the next request and make enough room
in the buffer for the sender to proceed. If not however (e.g., if the next
request is bigger than the one just consumed), the receive thread is blocked:
It is waiting on the lock and cannot read a response.  Once the server's output
buffer fills, the system deadlocks.

This can happen with any sufficiently-busy workload, but is more likely during
a large recheck in the v1 mempool, where the callbacks need exclusive access to
mempool state.  As a workaround, process rechecks for the priority mempool in
their own goroutines outside the mempool mutex.  Responses still head-of-line
block, but will no longer get pushback due to contention on the mempool itself.
creachadair pushed a commit that referenced this pull request Jul 19, 2022
(manual backport of #9030)

The priority mempool has a stricter synchronization requirement than the legacy
mempool. Under sufficiently-heavy load, exclusive access can lead to deadlocks
when processing a large batch of transaction rechecks through an out-of-process
application using the socket client.

By design, a socket client stalls when its send buffer fills, during which time
it holds a lock shared with the receive thread.  While blocked in this state, a
response read by the receive thread waits for the shared lock so the callback
can be invoked.

If we're lucky, the server will then read the next request and make enough room
in the buffer for the sender to proceed. If not however (e.g., if the next
request is bigger than the one just consumed), the receive thread is blocked:
It is waiting on the lock and cannot read a response.  Once the server's output
buffer fills, the system deadlocks.

This can happen with any sufficiently-busy workload, but is more likely during
a large recheck in the v1 mempool, where the callbacks need exclusive access to
mempool state.  As a workaround, process rechecks for the priority mempool in
their own goroutines outside the mempool mutex.  Responses still head-of-line
block, but will no longer get pushback due to contention on the mempool itself.
@creachadair creachadair marked this pull request as ready for review July 19, 2022 14:10
The priority mempool has a stricter synchronization requirement than the legacy
mempool. Under sufficiently-heavy load, exclusive access can lead to deadlocks
when processing a large batch of transaction rechecks through an out-of-process
application using the socket client.

By design, a socket client stalls when its send buffer fills, during which time
it holds a lock shared with the receive thread.  While blocked in this state, a
response read by the receive thread waits for the shared lock so the callback
can be invoked.

If we're lucky, the server will then read the next request and make enough room
in the buffer for the sender to proceed. If not however (e.g., if the next
request is bigger than the one just consumed), the receive thread is blocked:
It is waiting on the lock and cannot read a response.  Once the server's output
buffer fills, the system deadlocks.

This can happen with any sufficiently-busy workload, but is more likely during
a large recheck in the v1 mempool, where the callbacks need exclusive access to
mempool state.  As a workaround, process rechecks for the priority mempool in
their own goroutines outside the mempool mutex.  Responses still head-of-line
block, but will no longer get pushback due to contention on the mempool itself.
Copy link
Contributor

@williambanfield williambanfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great solution to the problem of two async requests trying to grab the same locks.

@creachadair creachadair merged commit 22ed610 into v0.35.x Jul 19, 2022
@creachadair creachadair deleted the mjf/outer-lock branch July 19, 2022 20:28
creachadair pushed a commit that referenced this pull request Jul 19, 2022
…ort #9030)

(manual cherry-pick of commit 22ed610)

The priority mempool has a stricter synchronization requirement than the legacy
mempool. Under sufficiently-heavy load, exclusive access can lead to deadlocks
when processing a large batch of transaction rechecks through an out-of-process
application using the socket client.

By design, a socket client stalls when its send buffer fills, during which time
it holds a lock shared with the receive thread.  While blocked in this state, a
response read by the receive thread waits for the shared lock so the callback
can be invoked.

If we're lucky, the server will then read the next request and make enough room
in the buffer for the sender to proceed. If not however (e.g., if the next
request is bigger than the one just consumed), the receive thread is blocked:
It is waiting on the lock and cannot read a response.  Once the server's output
buffer fills, the system deadlocks.

This can happen with any sufficiently-busy workload, but is more likely during
a large recheck in the v1 mempool, where the callbacks need exclusive access to
mempool state.  As a workaround, process rechecks for the priority mempool in
their own goroutines outside the mempool mutex.  Responses still head-of-line
block, but will no longer get pushback due to contention on the mempool itself.
creachadair pushed a commit that referenced this pull request Jul 19, 2022
cmwaters pushed a commit that referenced this pull request Jul 20, 2022
The priority mempool has a stricter synchronization requirement than the legacy
mempool. Under sufficiently-heavy load, exclusive access can lead to deadlocks
when processing a large batch of transaction rechecks through an out-of-process
application using the socket client.

By design, a socket client stalls when its send buffer fills, during which time
it holds a lock shared with the receive thread.  While blocked in this state, a
response read by the receive thread waits for the shared lock so the callback
can be invoked.

If we're lucky, the server will then read the next request and make enough room
in the buffer for the sender to proceed. If not however (e.g., if the next
request is bigger than the one just consumed), the receive thread is blocked:
It is waiting on the lock and cannot read a response.  Once the server's output
buffer fills, the system deadlocks.

This can happen with any sufficiently-busy workload, but is more likely during
a large recheck in the v1 mempool, where the callbacks need exclusive access to
mempool state.  As a workaround, process rechecks for the priority mempool in
their own goroutines outside the mempool mutex.  Responses still head-of-line
block, but will no longer get pushback due to contention on the mempool itself.
cmwaters pushed a commit to cmwaters/tendermint that referenced this pull request Dec 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants