Skip to content

Revert "client channel: don't hold mutexes while calling the ConfigSelector or the LB picker"#32324

Merged
ctiller merged 1 commit intomasterfrom
revert-31973-client_channel_picker_mutex
Feb 8, 2023
Merged

Revert "client channel: don't hold mutexes while calling the ConfigSelector or the LB picker"#32324
ctiller merged 1 commit intomasterfrom
revert-31973-client_channel_picker_mutex

Conversation

@markdroth
Copy link
Copy Markdown
Member

Reverts #31973 due to test flakiness

@markdroth markdroth added the release notes: no Indicates if PR should not be in release notes label Feb 8, 2023
@markdroth markdroth requested a review from ctiller February 8, 2023 16:24
@ctiller ctiller merged commit 90524b0 into master Feb 8, 2023
@copybara-service copybara-service bot added the imported Specifies if the PR has been imported to the internal repository label Feb 8, 2023
markdroth added a commit to markdroth/grpc that referenced this pull request Feb 8, 2023
@markdroth markdroth deleted the revert-31973-client_channel_picker_mutex branch February 8, 2023 17:10
markdroth added a commit that referenced this pull request Feb 16, 2023
…ConfigSelector or the LB picker (#32326)

Original attempt was #31973, reverted in #32324 due to test flakiness.

There were two problems causing test flakiness here.

The first problem was that, upon resolver error, we were dispatching an
async callback to re-process each of the queued picks *before* we
updated the channel's connectivity state, which meant that the queued
picks might be re-processed in another thread before the new
connectivity state was set, so tests that expected the state to be
TRANSIENT_FAILURE once RPCs failed might not see the expected state.

The second problem affected the xDS ring hash tests, and it's a bit more
involved to explain.

We have an e2e test that simulates an aggregate cluster failover from a
primary cluster using ring_hash at startup. The primary cluster has two
addresses, both of which are unreachable when the client starts up, so
the client should immediately fail over to the secondary cluster, which
does have reachable endpoints. The test requires that no RPCs are failed
while this failover occurs. The original PR made this test flaky.

The problem here was caused by a combination of two factors:

1. Prior to the original PR, when the picker was updated (which happens
inside the WorkSerializer), we re-processed previously queued picks
synchronously, so it was not possible for another subchannel
connectivity state update (which also happens in the WorkSerializer) to
be processed between the time that we updated the picker and the time
that we re-processed the previously queued picks. The original PR
changed this such that the queued picks are re-processed asynchronously
(outside of the WorkSerializer), so it is now possible for a subchannel
connectivity state update to be processed between when the picker is
updated and when we re-process the previously queued picks.

2. Unlike most LB policies, where the picker does not see updated
subchannel connectivity states until a new picker is created, the
ring_hash picker gets the subchannel connectivity states from the LB
policy via a lock, so it can wind up seeing the new states before it
gets updated. This means that when a subchannel connectivity state
update is processed by the ring_hash policy in the WorkSerializer, it
will immediately be seen by the existing picker, even without a picker
update.

With those two points in mind, the sequence of events in the failing
test were as follows:

1. The pick is attempted in the ring_hash picker for the primary
cluster. This causes the first subchannel to attempt to connect.
2. The subchannel transitions from IDLE to CONNECTING. A new picker is
returned due to the subchannel connectivity state change, and the
channel retries the queued pick. The retried pick is done
asynchronously, but in this case it does not matter: the call will be
re-queued.
3. The connection attempt fails, and the subchannel reports
TRANSIENT_FAILURE. A new picker is again returned, and the channel
retries the queued pick. The retried pick is done asynchronously, but in
this case it does not matter: this causes the picker to trigger a
connection attempt for the second subchannel.
4. The second subchannel transitions from IDLE to CONNECTING. A new
picker is again returned, and the channel retries the queued pick. The
retried pick is done asynchronously, and in this case it *does* matter.
5. The second subchannel now transitions to TRANSIENT_FAILURE. The
ring_hash policy will now report TRANSIENT_FAILURE, but before it can
finish that...
6. ...In another thread, the channel now tries to re-process the queued
pick using the CONNECTING picker from step 4. However, because the
ring_hash policy has already seen the TRANSIENT_FAILURE report from the
second subchannel, that picker will now fail the pick instead of queuing
it.

After discussion with @ejona86 and @dfawley (since this bug actually
exists in Java and Go as well), we agreed that the right solution is to
change the ring_hash picker to contain its own copy of the subchannel
connectivity state information, rather than sharing that information
with the LB policy using synchronization.
XuanWang-Amos pushed a commit to XuanWang-Amos/grpc that referenced this pull request May 1, 2023
XuanWang-Amos pushed a commit to XuanWang-Amos/grpc that referenced this pull request May 1, 2023
…ConfigSelector or the LB picker (grpc#32326)

Original attempt was grpc#31973, reverted in grpc#32324 due to test flakiness.

There were two problems causing test flakiness here.

The first problem was that, upon resolver error, we were dispatching an
async callback to re-process each of the queued picks *before* we
updated the channel's connectivity state, which meant that the queued
picks might be re-processed in another thread before the new
connectivity state was set, so tests that expected the state to be
TRANSIENT_FAILURE once RPCs failed might not see the expected state.

The second problem affected the xDS ring hash tests, and it's a bit more
involved to explain.

We have an e2e test that simulates an aggregate cluster failover from a
primary cluster using ring_hash at startup. The primary cluster has two
addresses, both of which are unreachable when the client starts up, so
the client should immediately fail over to the secondary cluster, which
does have reachable endpoints. The test requires that no RPCs are failed
while this failover occurs. The original PR made this test flaky.

The problem here was caused by a combination of two factors:

1. Prior to the original PR, when the picker was updated (which happens
inside the WorkSerializer), we re-processed previously queued picks
synchronously, so it was not possible for another subchannel
connectivity state update (which also happens in the WorkSerializer) to
be processed between the time that we updated the picker and the time
that we re-processed the previously queued picks. The original PR
changed this such that the queued picks are re-processed asynchronously
(outside of the WorkSerializer), so it is now possible for a subchannel
connectivity state update to be processed between when the picker is
updated and when we re-process the previously queued picks.

2. Unlike most LB policies, where the picker does not see updated
subchannel connectivity states until a new picker is created, the
ring_hash picker gets the subchannel connectivity states from the LB
policy via a lock, so it can wind up seeing the new states before it
gets updated. This means that when a subchannel connectivity state
update is processed by the ring_hash policy in the WorkSerializer, it
will immediately be seen by the existing picker, even without a picker
update.

With those two points in mind, the sequence of events in the failing
test were as follows:

1. The pick is attempted in the ring_hash picker for the primary
cluster. This causes the first subchannel to attempt to connect.
2. The subchannel transitions from IDLE to CONNECTING. A new picker is
returned due to the subchannel connectivity state change, and the
channel retries the queued pick. The retried pick is done
asynchronously, but in this case it does not matter: the call will be
re-queued.
3. The connection attempt fails, and the subchannel reports
TRANSIENT_FAILURE. A new picker is again returned, and the channel
retries the queued pick. The retried pick is done asynchronously, but in
this case it does not matter: this causes the picker to trigger a
connection attempt for the second subchannel.
4. The second subchannel transitions from IDLE to CONNECTING. A new
picker is again returned, and the channel retries the queued pick. The
retried pick is done asynchronously, and in this case it *does* matter.
5. The second subchannel now transitions to TRANSIENT_FAILURE. The
ring_hash policy will now report TRANSIENT_FAILURE, but before it can
finish that...
6. ...In another thread, the channel now tries to re-process the queued
pick using the CONNECTING picker from step 4. However, because the
ring_hash policy has already seen the TRANSIENT_FAILURE report from the
second subchannel, that picker will now fail the pick instead of queuing
it.

After discussion with @ejona86 and @dfawley (since this bug actually
exists in Java and Go as well), we agreed that the right solution is to
change the ring_hash picker to contain its own copy of the subchannel
connectivity state information, rather than sharing that information
with the LB policy using synchronization.
wanlin31 pushed a commit that referenced this pull request May 18, 2023
wanlin31 pushed a commit that referenced this pull request May 18, 2023
…ConfigSelector or the LB picker (#32326)

Original attempt was #31973, reverted in #32324 due to test flakiness.

There were two problems causing test flakiness here.

The first problem was that, upon resolver error, we were dispatching an
async callback to re-process each of the queued picks *before* we
updated the channel's connectivity state, which meant that the queued
picks might be re-processed in another thread before the new
connectivity state was set, so tests that expected the state to be
TRANSIENT_FAILURE once RPCs failed might not see the expected state.

The second problem affected the xDS ring hash tests, and it's a bit more
involved to explain.

We have an e2e test that simulates an aggregate cluster failover from a
primary cluster using ring_hash at startup. The primary cluster has two
addresses, both of which are unreachable when the client starts up, so
the client should immediately fail over to the secondary cluster, which
does have reachable endpoints. The test requires that no RPCs are failed
while this failover occurs. The original PR made this test flaky.

The problem here was caused by a combination of two factors:

1. Prior to the original PR, when the picker was updated (which happens
inside the WorkSerializer), we re-processed previously queued picks
synchronously, so it was not possible for another subchannel
connectivity state update (which also happens in the WorkSerializer) to
be processed between the time that we updated the picker and the time
that we re-processed the previously queued picks. The original PR
changed this such that the queued picks are re-processed asynchronously
(outside of the WorkSerializer), so it is now possible for a subchannel
connectivity state update to be processed between when the picker is
updated and when we re-process the previously queued picks.

2. Unlike most LB policies, where the picker does not see updated
subchannel connectivity states until a new picker is created, the
ring_hash picker gets the subchannel connectivity states from the LB
policy via a lock, so it can wind up seeing the new states before it
gets updated. This means that when a subchannel connectivity state
update is processed by the ring_hash policy in the WorkSerializer, it
will immediately be seen by the existing picker, even without a picker
update.

With those two points in mind, the sequence of events in the failing
test were as follows:

1. The pick is attempted in the ring_hash picker for the primary
cluster. This causes the first subchannel to attempt to connect.
2. The subchannel transitions from IDLE to CONNECTING. A new picker is
returned due to the subchannel connectivity state change, and the
channel retries the queued pick. The retried pick is done
asynchronously, but in this case it does not matter: the call will be
re-queued.
3. The connection attempt fails, and the subchannel reports
TRANSIENT_FAILURE. A new picker is again returned, and the channel
retries the queued pick. The retried pick is done asynchronously, but in
this case it does not matter: this causes the picker to trigger a
connection attempt for the second subchannel.
4. The second subchannel transitions from IDLE to CONNECTING. A new
picker is again returned, and the channel retries the queued pick. The
retried pick is done asynchronously, and in this case it *does* matter.
5. The second subchannel now transitions to TRANSIENT_FAILURE. The
ring_hash policy will now report TRANSIENT_FAILURE, but before it can
finish that...
6. ...In another thread, the channel now tries to re-process the queued
pick using the CONNECTING picker from step 4. However, because the
ring_hash policy has already seen the TRANSIENT_FAILURE report from the
second subchannel, that picker will now fail the pick instead of queuing
it.

After discussion with @ejona86 and @dfawley (since this bug actually
exists in Java and Go as well), we agreed that the right solution is to
change the ring_hash picker to contain its own copy of the subchannel
connectivity state information, rather than sharing that information
with the LB policy using synchronization.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

imported Specifies if the PR has been imported to the internal repository lang/core release notes: no Indicates if PR should not be in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants