Rksrcl/string grpc#34220
Conversation
Go Package Import DifferencesBaseline: 45d43d5
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 45d43d5 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | quality_gate_logs | % cpu utilization | +2.67 | [-0.36, +5.70] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | +0.75 | [+0.69, +0.80] | 1 | Logs bounds checks dashboard |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | +0.50 | [-0.27, +1.27] | 1 | Logs |
| ➖ | file_tree | memory utilization | +0.23 | [+0.17, +0.30] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency_linear_load | egress throughput | +0.21 | [-0.26, +0.67] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | +0.18 | [-0.80, +1.16] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | +0.15 | [+0.10, +0.20] | 1 | Logs bounds checks dashboard |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | +0.03 | [-0.66, +0.72] | 1 | Logs |
| ➖ | file_to_blackhole_300ms_latency | egress throughput | +0.02 | [-0.61, +0.64] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency_http1 | egress throughput | +0.01 | [-0.80, +0.82] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | +0.01 | [-0.30, +0.32] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | -0.00 | [-0.84, +0.83] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.03, +0.02] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency_http2 | egress throughput | -0.01 | [-0.82, +0.79] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.04 | [-0.81, +0.73] | 1 | Logs |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -1.05 | [-1.10, -0.99] | 1 | Logs |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | links |
|---|---|---|---|---|
| ✅ | file_to_blackhole_0ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency_http1 | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency_http1 | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency_http2 | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency_http2 | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_1000ms_latency_linear_load | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_100ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_300ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_300ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_500ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | |
| ✅ | quality_gate_logs | lost_bytes | 10/10 | |
| ✅ | quality_gate_logs | memory_usage | 10/10 |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
| return strconv.FormatUint(c, 10) | ||
| } | ||
| strCUpper := strings.ToUpper(strC) | ||
| if strCUpper == "CANCELED" || strCUpper == "CANCELLED" { // the rpc code google api checks for "CANCELLED" but we receive "Canceled" from upstream |
There was a problem hiding this comment.
Can we add a unit test in aggregation_test.go to make sure that the "canceled" status code is being properly set.
| } | ||
| } | ||
|
|
||
| return "" // invalid gRPC code |
There was a problem hiding this comment.
We can take this comment out
It's not always invalid if we reach this point. Only grpc traces have grpc_status_codes. All other traces will not have the tag in the span, in which case there is nothing to return.
| // E.g., `grpc.target` to describe the name of a gRPC peer, or `db.hostname` to describe the name of peer DB | ||
| repeated string peer_tags = 16; | ||
| Trilean is_trace_root = 17; // this field's value is equal to span's ParentID == 0. | ||
| string GRPC_status_code = 18; |
There was a problem hiding this comment.
nit: can this be a number rather than a string as IIRC we're displaying these as the raw number rather than the string associated with the code
There was a problem hiding this comment.
If it's an integer the default if it's not set is 0, so this screws up the end to end tests for dd-go because a gRPC status code tag with 0 will be appended when not intentionally set and 0 is a valid gRPC status code that should get sent downstream.
There was a problem hiding this comment.
Integers have a default of 0 but 0 is also a valid status code. We've switched to a string since a string allows us to handle nil status codes better. Sources like USM won't be submitting grpc status codes.
Though it could be argued that we could keep the string format representation of the grpc code while the variable is of type string until we create the metric tag
There was a problem hiding this comment.
You can regain "presence" checking in protobuf by marking the field optional (check the proto3 section not the proto2 section)
https://protobuf.dev/programming-guides/field_presence/
There was a problem hiding this comment.
Our proto files in dd-go are currently generated with protoc-gen-gogo which doesn't support the optional field. Regenerating them using the normal protoc-gen-go changed the structs' formats. We've scoped updating our protobuf files out of this project and onto a separate M&R jira ticket.
| Synthetics bool | ||
| PeerTagsHash uint64 | ||
| IsTraceRoot pb.Trilean | ||
| GRPCStatusCode string |
There was a problem hiding this comment.
nit: same question as above r.e can this be an int
|
|
||
| func getGRPCStatusCode(meta map[string]string, metrics map[string]float64) string { | ||
| // List of possible keys to check in order | ||
| metaKeys := []string{"rpc.grpc.status_code", "grpc.code", "rpc.grpc.status.code", "grpc.status.code"} |
There was a problem hiding this comment.
Where did this list come from? If possible let's make this configurable and have these tags be the default values
There was a problem hiding this comment.
These are the various ways the tracer libraries will send the tag from upstream depending on the tracer library: https://docs.google.com/spreadsheets/d/1tbn04E-wLv8ozTtDO02PCqkruCNbJ-aK-1G8Ytp7H5o/edit?gid=0#gid=0. The getGRPCStatusCode function just tries to return a value of the integer gRPC code regardless of how it was sent by the tracers.
There was a problem hiding this comment.
I think we'll want this to be configurable so we can easily adapt to changes and new traces, but this is a good default list for sure
| return strconv.FormatUint(uint64(codes.Code(codeNum)), 10) | ||
| } | ||
|
|
||
| log.Debugf("Invalid status code %s. Using empty string", strC) |
There was a problem hiding this comment.
nit: we can probably get rid of this log, we can recover it from looking at the input data as needed
| @@ -0,0 +1,4 @@ | |||
| upgrade: | |||
| - | | |||
| APM: Adds grpc status codes to trace metrics | |||
There was a problem hiding this comment.
nit: it might be more accurate to say something like "aggregate apm stats payloads by grpc code" since this change alone doesn't put them on trace metrics
| "testing" | ||
| "time" | ||
|
|
||
| "github.com/DataDog/datadog-agent/pkg/trace/traceutil" |
There was a problem hiding this comment.
nit: my formatter does this too but since we didn't change anything else here let's undo this change to keep the diff smaller
Static quality checks ❌Please find below the results from static quality gates Error
Gate failure full details
Successful checksInfo
|
ichinaski
left a comment
There was a problem hiding this comment.
Leaving the PR approved not to block it on the response value casing, but I think the spec and possible values from tracers should be clearer and defined somewhere, so we don't have to account for multiple variants.
|
/merge |
|
View all feedbacks in Devflow UI.
The median merge time in
|
What does this PR do?
The agent will send grpc status codes to the stats backend
Motivation
We want grpc status codes to appear in trace metrics
Describe how you validated your changes
Added tests:
Ran system tests to check changes with all the tracer libraries: https://github.com/DataDog/system-tests-dashboard/actions/runs/13530539811
Possible Drawbacks / Trade-offs
Additional Notes