ready-cache: Avoid panic on strange race#420
Merged
Conversation
It's been observed that occasionally tower-ready-cache would panic trying to find an already canceled service in `cancel_pending_txs` (#415). The source of the race is not entirely clear, but extensive debugging demonstrated that occasionally a call to `evict` would send on the `CancelTx` for a service, yet that service would be yielded back from `pending` in `poll_pending` in a non-`Canceled` state. This is equivalent to saying that this code may panic: ```rust async { let (tx, rx) = oneshot::channel(); tx.send(42).unwrap(); yield_once().await; rx.try_recv().unwrap(); // <- may occasionally panic } ``` I have not been able to demonstrate a self-contained example failing in this way, but it's the only explanation I have found for the observed bug. Pinning the entire runtime to one core still produced the bug, which indicates that it is not a memory ordering issue. Replacing oneshot with `mpsc::channel(1)` still produced the bug, which indicates that the bug is not with the implementation of `oneshot`. Logs also indicate that the `ChannelTx` we send on in `evict()` truly is the same one associated with the `ChannelRx` polled in `Pending::poll`, so we're not getting our wires crossed somewhere. It truly is bizarre. This patch resolves the issue by considering a failure to find a ready/errored service's `CancelTx` as another signal that a service has been removed. Specifically, if `poll_pending` finds a service that returns `Ok` or `Err`, but does _not_ find its `CancelTx`, then it assumes that it must be because the service _was_ canceled, but did not observe that cancellation signal. As an explanation, this isn't entirely satisfactory, since we do not fully understand the underlying problem. It _may_ be that a canceled service could remain in the pending state for a very long time if it does not become ready _and_ does not see the cancellation signal (so it returns `Poll::Pending` and is not removed). That, in turn, might cause an issue if the driver of the `ReadyCache` then chooses to re-use a key they believe they have evicted. However, any such case _must_ first hit the panic that exists in the code today, so this is still an improvement over the status quo. Fixes #415.
olix0r
approved these changes
Feb 24, 2020
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
It's been observed that occasionally tower-ready-cache would panic
trying to find an already canceled service in
cancel_pending_txs(#415). The source of the race is not entirely clear, but extensive
debugging demonstrated that occasionally a call to
evictwould send onthe
CancelTxfor a service, yet that service would be yielded backfrom
pendinginpoll_pendingin a non-Canceledstate. Thisis equivalent to saying that this code may panic:
I have not been able to demonstrate a self-contained example failing in
this way, but it's the only explanation I have found for the observed
bug. Pinning the entire runtime to one core still produced the bug,
which indicates that it is not a memory ordering issue. Replacing
oneshot with
mpsc::channel(1)still produced the bug, which indicatesthat the bug is not with the implementation of
oneshot. Logs alsoindicate that the
ChannelTxwe send on inevict()truly is the sameone associated with the
ChannelRxpolled inPending::poll, so we'renot getting our wires crossed somewhere. It truly is bizarre.
This patch resolves the issue by considering a failure to find a
ready/errored service's
CancelTxas another signal that a service hasbeen removed. Specifically, if
poll_pendingfinds a service thatreturns
OkorErr, but does not find itsCancelTx, then itassumes that it must be because the service was canceled, but did not
observe that cancellation signal.
As an explanation, this isn't entirely satisfactory, since we do not
fully understand the underlying problem. It may be that a canceled
service could remain in the pending state for a very long time if it
does not become ready and does not see the cancellation signal (so it
returns
Poll::Pendingand is not removed). That, in turn, might causean issue if the driver of the
ReadyCachethen chooses to re-use a keythey believe they have evicted. However, any such case must first hit
the panic that exists in the code today, so this is still an improvement
over the status quo.
Fixes #415.