Skip to content

Conversation

@shivaspeaks
Copy link
Member

@shivaspeaks shivaspeaks commented Oct 27, 2025

Copy link
Contributor

@kannanjgithub kannanjgithub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sending what I could review today.

Copy link
Contributor

@kannanjgithub kannanjgithub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review comments.

@shivaspeaks
Copy link
Member Author

Note: The failing test log shown here in github Actions is not failing in local even after clean run/build.

@kannanjgithub
Copy link
Contributor

kannanjgithub commented Nov 24, 2025 via email

@shivaspeaks
Copy link
Member Author

The purpose of gRFC A88 is to introduce onAmbientError to notify watchers of transient errors without them having to drop their cached data. If the underlying ControlPlaneClient hides the error and reports Status.OK, then XdsClientImpl would never know a transient failure occurred and would therefore never be able to call onAmbientError on the watchers. This would completely break this new mechanism.

@shivaspeaks
Copy link
Member Author

Note: I added a TODO in connect_then_mainServerDown_fallbackServerUp in XdsClientFallbackTest. Unable to write this test. Any suggestions on this would be appreciated.

@shivaspeaks
Copy link
Member Author

if (subscriber.hasResult() || !authoritiesForClosedCpc.contains(subscriber.authority)) {
    subscriber.onError(status, null);
    continue;
}

This above code block was the initial fix for the failing fallback test that was mentioned above.
But it introduced the flakiness in testTwoBadUrl() in XdsClientFallbackTest.java. The test became flaky because sometimes it might see the premature garbageUri1 error, and sometimes it might run long enough to also see the final garbageUri2 error, and ldsUpdateCaptor.getValue() would happen to hold the correct final value. The behavior was no longer deterministic.

The solution is to restructure the logic in the for loop to handle different subscriber states more clearly and deterministically. The updated logic is more clearer:

  1. If a subscriber already has data, it should get an ambient error.
  2. If a subscriber does not have data, the client should attempt to fall back.
  3. Only if the subscriber has no data AND all fallback attempts have failed should a definitive error be reported.

@shivaspeaks shivaspeaks changed the title xds: Changes to XdsClient Watcher APIs xds: gRFC A88 - Changes to XdsClient Watcher APIs Nov 30, 2025
@shivaspeaks shivaspeaks merged commit f385add into grpc:master Nov 30, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants