xdsdepmgr: Avoid using actual DNS lookups and fix race during test failure#8805
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #8805 +/- ##
==========================================
- Coverage 83.30% 83.29% -0.01%
==========================================
Files 418 417 -1
Lines 32897 32920 +23
==========================================
+ Hits 27404 27420 +16
- Misses 4093 4095 +2
- Partials 1400 1405 +5 🚀 New features to boost your workflow:
|
|
Were you able to reproduce this on your setup (either on a debian linux workstation or on forge) and ensure that the fix actually solves the problem? |
Yes, I was able to reproduce the hang by making the test fail on purpose , and have verified that it is fixed with this change. |
| t.Fatalf("received unexpected error from dependency manager: %v", err) | ||
| } | ||
| dnsR := replaceDNSResolver(t) | ||
| dnsR.InitialState(resolver.State{ |
There was a problem hiding this comment.
nit: You can call UpdateState now; after last year's changes, it correctly buffers updates if the resolver hasn't been built yet. The InitialState method is now redundant.
| case err := <-watcher.errorCh: | ||
| t.Fatalf("received unexpected error from dependency manager: %v", err) | ||
| } | ||
| dnsR := replaceDNSResolver(t) |
There was a problem hiding this comment.
Can you please move the DNS resolver replacement near the top of the test, before any xDS configuration is sent? This would avoid potential races where the real DNS resolver gets used before replacement.
| { | ||
| Addresses: []resolver.Address{ | ||
| {Addr: "127.0.0.1:8081"}, | ||
| }, | ||
| }, | ||
| { | ||
| Addresses: []resolver.Address{ | ||
| {Addr: "[::1]:8081"}, | ||
| }, | ||
| }, |
There was a problem hiding this comment.
nit: Please revert the formatting changes.
|
Also fixed |
…ilure (grpc#8805) Fixes: grpc#8801 There were 2 problems here : 1. The actual DNS lookup calls return different resolved addresses for "localhost" on different machines. Changed the code to replace the DNS resolver with a manual resolver in test to mock the DNS resolver. 2. In the case where the tests send a EDS or cluster error , the xds management server keeps sending resource error continuously which in turn triggers updates from dependency manager which pushes the [update to a channel.](https://github.com/grpc/grpc-go/blob/bccbb10454782f18ecdfd60ca39ef413357edb0b/internal/xds/xdsdepmgr/xds_dependency_manager_test.go#L97) To mitigate this situation we had a done channel , that we [close when the test ends](https://github.com/grpc/grpc-go/blob/bccbb10454782f18ecdfd60ca39ef413357edb0b/internal/xds/xdsdepmgr/xds_dependency_manager_test.go#L1273) and the update function [check](https://github.com/grpc/grpc-go/blob/bccbb10454782f18ecdfd60ca39ef413357edb0b/internal/xds/xdsdepmgr/xds_dependency_manager_test.go#L95) for the done channel , if it is close, it returns. In [TestAggregateCLusterChildError](https://github.com/grpc/grpc-go/blob/bccbb10454782f18ecdfd60ca39ef413357edb0b/internal/xds/xdsdepmgr/xds_dependency_manager_test.go#L1170) test , (and other tests too) , the channel was being closed at the end of the test , which meant it will close only when the test passes. And if the tests fails, the done channel was not getting closed , which made the update function to block. Changed the close(done channel) to be a defer function , but this should be the first thing that happens after the test is done , before the dependency manager closes, so this is the last defer called in the test. Also fixes `TestRouteConfigResourceError` flaking due to a similar error. RELEASE NOTES: None
Fixes: #8801
There were 2 problems here :
The actual DNS lookup calls return different resolved addresses for "localhost" on different machines. Changed the code to replace the DNS resolver with a manual resolver in test to mock the DNS resolver.
In the case where the tests send a EDS or cluster error , the xds management server keeps sending resource error continuously which in turn triggers updates from dependency manager which pushes the update to a channel. To mitigate this situation we had a done channel , that we close when the test ends and the update function check for the done channel , if it is close, it returns. In TestAggregateCLusterChildError test , (and other tests too) , the channel was being closed at the end of the test , which meant it will close only when the test passes. And if the tests fails, the done channel was not getting closed , which made the update function to block.
Changed the close(done channel) to be a defer function , but this should be the first thing that happens after the test is done , before the dependency manager closes, so this is the last defer called in the test.
Also fixes
TestRouteConfigResourceErrorflaking due to a similar error.RELEASE NOTES: None