Dynamically check if expected IG is primary#23143
Dynamically check if expected IG is primary#23143ericgribkoff wants to merge 2 commits intogrpc:masterfrom
Conversation
| # primary. An unconditional wait is necessary since, prior to the patch, | ||
| # traffic can only go to primary_instance_group regardless of TD's | ||
| # assignment post-patch. | ||
| time.sleep(_WAIT_FOR_VALID_CONFIG_SEC) |
There was a problem hiding this comment.
Is this really the only way? It seems like this could introduce a flake. What if it takes longer than this for the config to propagate?
There was a problem hiding this comment.
So this check is essentially trying to prove that, when the config propagates, traffic will not go to secondary_instance_group. Since that definitely holds before propagation and should continue to hold afterwards, there's not really anything we can wait for here...I guess we could exit early in the (uncommon) case where traffic does start going to secondary_instance_group and save some of the unconditional 60 second wait. But the case that occurs 99+% of the time, with all traffic continuing to go to the expected primary locality, does as best as I can tell require this unconditional waiting.
Hmm. Actually, maybe it would be preferable just to more-or-less wrap this entire test's asserts into a try-except block - so we begin with the assumption that the expected primary zone really is primary and, only if that fails (which will be very rare), fall back to check if the test would have passed if we swapped our primary/secondary expectation. This avoids the silly unconditional wait and keeps things "speedy" for the common case.
I'll try to rework things and update this PR.
There was a problem hiding this comment.
Thanks for the review and helpful feedback here! Closing this in favor of #23206 (sorry for the churn)
| secondary_instance_names = get_instance_names( | ||
| gcp, secondary_zone_instance_group) | ||
| secondary_instance_group) | ||
| # Ensure the backend patch has propagated before verifying which IG is |
There was a problem hiding this comment.
Nit: 530-541 looks identical to 492-502.
In certain test environments (and even, potentially, in prod) we may not connect to a xds server in the same region as the client. This will skew the primary/secondary calculation, which will then be based off the location of the xds server and not the client itself. Currently we unconditionally expect RPCs to initially be routed only to a MIG in the same zone as the client; when the above scenario occurs, this is no longer valid.
This PR changes the secondary_locality_* tests to first verify that the MIG we expect to be primary is actually primary, based on where traffic is assigned. This allows the test to continue to operate by, when necessary, swapping the roles of primary/secondary MIGs and validating that failover then works as expected.