xds/internal/client: Add async gauge metrics for Connected and Resources (A78)#8807
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8807 +/- ##
==========================================
- Coverage 83.06% 83.01% -0.05%
==========================================
Files 413 413
Lines 33135 33229 +94
==========================================
+ Hits 27523 27586 +63
- Misses 4199 4226 +27
- Partials 1413 1417 +4
🚀 New features to boost your workflow:
|
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces asynchronous gauge metrics for tracking xDS client connectivity and resource counts, which is a valuable addition for observability. The implementation looks mostly solid, but I've found a critical issue in the connectivity metric logic that causes it to report incorrect status. I've also included a couple of medium-severity suggestions to improve test coverage and code clarity. Please see the detailed comments below.
916bd0e to
30b9c73
Compare
sequenceDiagram
participant OS as "OTel/Stats System"
participant Rec as "MetricsRecorder (Core)"
participant MR as "MetricsReporter (Adapter)"
participant XC as "XDSClient"
participant AS as "ADS Stream"
Note over XC, Rec: Phase 1: Registration (One-time)
XC->>MR: RegisterAsyncReporter(reporter)
MR->>Rec: RegisterAsyncReporter(hook)
Rec-->>XC: metricsCleanup func
Note over XC, AS: Phase 2: Synchronous Events & ADS Lifecycle
XC->>AS: WatchResource(lds:foo)
AS->>AS: NewStream() Success
AS->>XC: established = true
AS->>XC: Recv(Response: lds:foo)
XC->>XC: Cache Update (Requested -> ACKed)
AS->>AS: Recv() Error (Stream Breaks)
AS->>XC: established = false
alt if legitimate (gRFC A9)
AS->>MR: (No Counter)
else non-legitimate
AS->>MR: ReportMetric(ServerFailure)
end
MR->>Rec: Record Counter
Note over OS, XC: Phase 3: Asynchronous Scrape (Periodic)
OS->>Rec: Scrape Metrics
Rec->>MR: Trigger Hook
MR->>XC: Report(AsyncMetricsRecorder)
XC->>XC: reportConnectedState() --> 1 if established, else 0
XC->>XC: reportResourceStats() --> counts by (Type, CacheState)
XC-->>OS: Return Gauge Values (via Core)
Note over XC, Rec: Phase 4: Cleanup
XC->>Rec: metricsCleanup()
|
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces asynchronous gauge metrics for tracking the xDS client's connection status and the number of cached xDS resources, as specified in gRFC A78. The implementation is well-structured, introducing new interfaces for asynchronous metric reporting and ensuring proper lifecycle management. The accompanying tests are thorough and cover various states and scenarios for the new metrics. I have a couple of minor suggestions to improve variable naming for better code clarity and maintainability.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
7582c77 to
720d321
Compare
720d321 to
f397493
Compare
| // Verify state transitions to disconnected (Value: 0) after NewStream attempt fails. | ||
| close(unblock) | ||
|
|
||
| if err := tmr.waitForSpecificMetric(ctx, &metrics.ServerFailure{ServerURI: mgmtServer.Address}); err != nil { |
There was a problem hiding this comment.
think it makes sense to specify a Value: 0 in this case as it makes things explicit.
| if err := tmr.waitForSpecificMetric(ctx, &metrics.XDSClientConnected{ServerURI: mgmtServer.Address, Value: 1}); err != nil { | ||
| t.Fatalf("Expected XDSClientConnected to be 1 after response, got: %v", err) | ||
| } | ||
| } |
There was a problem hiding this comment.
This is a very long test that is testing a whole bunch of transitions. Is it possible to break them into smaller test(s) so that each one is just testing one particular transition. It is always better to have tests that do one particular thing instead of having tests that do N things. It is easier to maintain and easier to debug when something fails.
There was a problem hiding this comment.
to reach to step 4/5 we will need to go through initial steps - because the logic is varies for initial NewStream and then disconnection (semantically) and then reconnection (semantically means different the second time).
| }, | ||
| }) | ||
|
|
||
| // Verify that resources missing from the authoritative update transition to |
There was a problem hiding this comment.
Same here. authoritative update is not a term that is well known or regularly used.
bf0c55d to
32dc1cb
Compare
This PR leverages the async gauge framework and implements the xdsclient metrics to report number of xds resources and whether or not the xDS client currently has a working ADS stream to the xDS server along with the required labels as part of A78
RELEASE NOTES: