Skip to content

xdsdepmgr: Avoid using actual DNS lookups and fix race during test failure#8805

Merged
eshitachandwani merged 9 commits into
grpc:masterfrom
eshitachandwani:aggregatetest
Jan 11, 2026
Merged

xdsdepmgr: Avoid using actual DNS lookups and fix race during test failure#8805
eshitachandwani merged 9 commits into
grpc:masterfrom
eshitachandwani:aggregatetest

Conversation

@eshitachandwani

@eshitachandwani eshitachandwani commented Jan 6, 2026

Copy link
Copy Markdown
Member

Fixes: #8801

There were 2 problems here :

  1. The actual DNS lookup calls return different resolved addresses for "localhost" on different machines. Changed the code to replace the DNS resolver with a manual resolver in test to mock the DNS resolver.

  2. In the case where the tests send a EDS or cluster error , the xds management server keeps sending resource error continuously which in turn triggers updates from dependency manager which pushes the update to a channel. To mitigate this situation we had a done channel , that we close when the test ends and the update function check for the done channel , if it is close, it returns. In TestAggregateCLusterChildError test , (and other tests too) , the channel was being closed at the end of the test , which meant it will close only when the test passes. And if the tests fails, the done channel was not getting closed , which made the update function to block.
    Changed the close(done channel) to be a defer function , but this should be the first thing that happens after the test is done , before the dependency manager closes, so this is the last defer called in the test.

Also fixes TestRouteConfigResourceError flaking due to a similar error.

RELEASE NOTES: None

@eshitachandwani eshitachandwani added this to the 1.79 Release milestone Jan 6, 2026
@eshitachandwani eshitachandwani added Type: Bug Area: xDS Includes everything xDS related, including LB policies used with xDS. labels Jan 6, 2026
@codecov

codecov Bot commented Jan 6, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.29%. Comparing base (88ac703) to head (0fb3ee8).
⚠️ Report is 16 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8805      +/-   ##
==========================================
- Coverage   83.30%   83.29%   -0.01%     
==========================================
  Files         418      417       -1     
  Lines       32897    32920      +23     
==========================================
+ Hits        27404    27420      +16     
- Misses       4093     4095       +2     
- Partials     1400     1405       +5     

see 48 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@easwars

easwars commented Jan 6, 2026

Copy link
Copy Markdown
Contributor

Were you able to reproduce this on your setup (either on a debian linux workstation or on forge) and ensure that the fix actually solves the problem?

Comment thread internal/xds/xdsdepmgr/xds_dependency_manager_test.go Outdated
Comment thread internal/xds/xdsdepmgr/xds_dependency_manager_test.go Outdated
Comment thread internal/xds/xdsdepmgr/xds_dependency_manager_test.go Outdated
@easwars easwars assigned eshitachandwani and unassigned easwars and arjan-bal Jan 6, 2026
@eshitachandwani

Copy link
Copy Markdown
Member Author

Were you able to reproduce this on your setup (either on a debian linux workstation or on forge) and ensure that the fix actually solves the problem?

Yes, I was able to reproduce the hang by making the test fail on purpose , and have verified that it is fixed with this change.

@arjan-bal arjan-bal left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with some nits.

t.Fatalf("received unexpected error from dependency manager: %v", err)
}
dnsR := replaceDNSResolver(t)
dnsR.InitialState(resolver.State{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You can call UpdateState now; after last year's changes, it correctly buffers updates if the resolver hasn't been built yet. The InitialState method is now redundant.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

case err := <-watcher.errorCh:
t.Fatalf("received unexpected error from dependency manager: %v", err)
}
dnsR := replaceDNSResolver(t)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please move the DNS resolver replacement near the top of the test, before any xDS configuration is sent? This would avoid potential races where the real DNS resolver gets used before replacement.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +1292 to +1301
{
Addresses: []resolver.Address{
{Addr: "127.0.0.1:8081"},
},
},
{
Addresses: []resolver.Address{
{Addr: "[::1]:8081"},
},
},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please revert the formatting changes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@easwars easwars assigned eshitachandwani and unassigned easwars Jan 8, 2026
@eshitachandwani

eshitachandwani commented Jan 10, 2026

Copy link
Copy Markdown
Member Author

Also fixed TestRouteConfigResourceError failing due to a similar reason as mentioned in point 2 but sending error to error channel multiple times.
So , added a select with <-w.done case in Error() function similar to update function and made sure the w.done is closed as the first thing in TestRouteConfigResourceError .

@eshitachandwani eshitachandwani merged commit 629ef39 into grpc:master Jan 11, 2026
14 checks passed
mbissa pushed a commit to mbissa/grpc-go that referenced this pull request Feb 16, 2026
…ilure (grpc#8805)

Fixes: grpc#8801

There were 2 problems here : 
1. The actual DNS lookup calls return different resolved addresses for
"localhost" on different machines. Changed the code to replace the DNS
resolver with a manual resolver in test to mock the DNS resolver.

2. In the case where the tests send a EDS or cluster error , the xds
management server keeps sending resource error continuously which in
turn triggers updates from dependency manager which pushes the [update
to a
channel.](https://github.com/grpc/grpc-go/blob/bccbb10454782f18ecdfd60ca39ef413357edb0b/internal/xds/xdsdepmgr/xds_dependency_manager_test.go#L97)
To mitigate this situation we had a done channel , that we [close when
the test
ends](https://github.com/grpc/grpc-go/blob/bccbb10454782f18ecdfd60ca39ef413357edb0b/internal/xds/xdsdepmgr/xds_dependency_manager_test.go#L1273)
and the update function
[check](https://github.com/grpc/grpc-go/blob/bccbb10454782f18ecdfd60ca39ef413357edb0b/internal/xds/xdsdepmgr/xds_dependency_manager_test.go#L95)
for the done channel , if it is close, it returns. In
[TestAggregateCLusterChildError](https://github.com/grpc/grpc-go/blob/bccbb10454782f18ecdfd60ca39ef413357edb0b/internal/xds/xdsdepmgr/xds_dependency_manager_test.go#L1170)
test , (and other tests too) , the channel was being closed at the end
of the test , which meant it will close only when the test passes. And
if the tests fails, the done channel was not getting closed , which made
the update function to block.
Changed the close(done channel) to be a defer function , but this should
be the first thing that happens after the test is done , before the
dependency manager closes, so this is the last defer called in the test.

Also fixes `TestRouteConfigResourceError` flaking due to a similar
error.

RELEASE NOTES: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: xDS Includes everything xDS related, including LB policies used with xDS. Type: Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky Test: TestAggregateCluster

3 participants