Skip to content

Dynamically check if expected IG is primary#23143

Closed
ericgribkoff wants to merge 2 commits intogrpc:masterfrom
ericgribkoff:determine_primary
Closed

Dynamically check if expected IG is primary#23143
ericgribkoff wants to merge 2 commits intogrpc:masterfrom
ericgribkoff:determine_primary

Conversation

@ericgribkoff
Copy link
Copy Markdown
Contributor

In certain test environments (and even, potentially, in prod) we may not connect to a xds server in the same region as the client. This will skew the primary/secondary calculation, which will then be based off the location of the xds server and not the client itself. Currently we unconditionally expect RPCs to initially be routed only to a MIG in the same zone as the client; when the above scenario occurs, this is no longer valid.

This PR changes the secondary_locality_* tests to first verify that the MIG we expect to be primary is actually primary, based on where traffic is assigned. This allows the test to continue to operate by, when necessary, swapping the roles of primary/secondary MIGs and validating that failover then works as expected.

@ericgribkoff ericgribkoff added lang/other release notes: no Indicates if PR should not be in release notes labels Jun 5, 2020
@ericgribkoff ericgribkoff requested a review from gnossen June 5, 2020 03:55
# primary. An unconditional wait is necessary since, prior to the patch,
# traffic can only go to primary_instance_group regardless of TD's
# assignment post-patch.
time.sleep(_WAIT_FOR_VALID_CONFIG_SEC)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really the only way? It seems like this could introduce a flake. What if it takes longer than this for the config to propagate?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this check is essentially trying to prove that, when the config propagates, traffic will not go to secondary_instance_group. Since that definitely holds before propagation and should continue to hold afterwards, there's not really anything we can wait for here...I guess we could exit early in the (uncommon) case where traffic does start going to secondary_instance_group and save some of the unconditional 60 second wait. But the case that occurs 99+% of the time, with all traffic continuing to go to the expected primary locality, does as best as I can tell require this unconditional waiting.

Hmm. Actually, maybe it would be preferable just to more-or-less wrap this entire test's asserts into a try-except block - so we begin with the assumption that the expected primary zone really is primary and, only if that fails (which will be very rare), fall back to check if the test would have passed if we swapped our primary/secondary expectation. This avoids the silly unconditional wait and keeps things "speedy" for the common case.

I'll try to rework things and update this PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review and helpful feedback here! Closing this in favor of #23206 (sorry for the churn)

secondary_instance_names = get_instance_names(
gcp, secondary_zone_instance_group)
secondary_instance_group)
# Ensure the backend patch has propagated before verifying which IG is
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: 530-541 looks identical to 492-502.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lang/other release notes: no Indicates if PR should not be in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants