Rksrcl/string grpc by rksrcl · Pull Request #34220 · DataDog/datadog-agent

rksrcl · 2025-02-19T17:51:59Z

What does this PR do?

The agent will send grpc status codes to the stats backend

Motivation

We want grpc status codes to appear in trace metrics

Describe how you validated your changes

Added tests:

pkg/trace/stats/aggregation_test.go
pkg/trace/stats/client_stats_aggregator_test.go

Ran system tests to check changes with all the tracer libraries: https://github.com/DataDog/system-tests-dashboard/actions/runs/13530539811

Possible Drawbacks / Trade-offs

Additional Notes

cit-pr-commenter · 2025-02-19T18:28:13Z

Go Package Import Differences

Baseline: 45d43d5
Comparison: 2b40790

binary	os	arch	change
agent	darwin	amd64	+1, -0 +google.golang.org/genproto/googleapis/rpc/code
agent	darwin	arm64	+1, -0 +google.golang.org/genproto/googleapis/rpc/code
iot-agent	linux	amd64	+1, -0 +google.golang.org/genproto/googleapis/rpc/code
iot-agent	linux	arm64	+1, -0 +google.golang.org/genproto/googleapis/rpc/code
heroku-agent	linux	amd64	+1, -0 +google.golang.org/genproto/googleapis/rpc/code
cluster-agent	linux	amd64	+1, -0 +google.golang.org/genproto/googleapis/rpc/code
cluster-agent	linux	arm64	+1, -0 +google.golang.org/genproto/googleapis/rpc/code
serverless	linux	amd64	+1, -0 +google.golang.org/genproto/googleapis/rpc/code
serverless	linux	arm64	+1, -0 +google.golang.org/genproto/googleapis/rpc/code
trace-agent	linux	amd64	+1, -0 +google.golang.org/genproto/googleapis/rpc/code
trace-agent	linux	arm64	+1, -0 +google.golang.org/genproto/googleapis/rpc/code
trace-agent	windows	amd64	+1, -0 +google.golang.org/genproto/googleapis/rpc/code
trace-agent	darwin	amd64	+1, -0 +google.golang.org/genproto/googleapis/rpc/code
trace-agent	darwin	arm64	+1, -0 +google.golang.org/genproto/googleapis/rpc/code
heroku-trace-agent	linux	amd64	+1, -0 +google.golang.org/genproto/googleapis/rpc/code

cit-pr-commenter · 2025-02-19T19:14:53Z

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: cf7fc08b-1e13-4929-8b70-1de69b54c989

Baseline: 45d43d5
Comparison: 2b40790
Diff

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	quality_gate_logs	% cpu utilization	+2.67	[-0.36, +5.70]	1	Logs
➖	quality_gate_idle_all_features	memory utilization	+0.75	[+0.69, +0.80]	1	Logs bounds checks dashboard
➖	file_to_blackhole_500ms_latency	egress throughput	+0.50	[-0.27, +1.27]	1	Logs
➖	file_tree	memory utilization	+0.23	[+0.17, +0.30]	1	Logs
➖	file_to_blackhole_1000ms_latency_linear_load	egress throughput	+0.21	[-0.26, +0.67]	1	Logs
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	+0.18	[-0.80, +1.16]	1	Logs
➖	quality_gate_idle	memory utilization	+0.15	[+0.10, +0.20]	1	Logs bounds checks dashboard
➖	file_to_blackhole_100ms_latency	egress throughput	+0.03	[-0.66, +0.72]	1	Logs
➖	file_to_blackhole_300ms_latency	egress throughput	+0.02	[-0.61, +0.64]	1	Logs
➖	file_to_blackhole_0ms_latency_http1	egress throughput	+0.01	[-0.80, +0.82]	1	Logs
➖	uds_dogstatsd_to_api	ingress throughput	+0.01	[-0.30, +0.32]	1	Logs
➖	file_to_blackhole_0ms_latency	egress throughput	-0.00	[-0.84, +0.83]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	-0.00	[-0.03, +0.02]	1	Logs
➖	file_to_blackhole_0ms_latency_http2	egress throughput	-0.01	[-0.82, +0.79]	1	Logs
➖	file_to_blackhole_1000ms_latency	egress throughput	-0.04	[-0.81, +0.73]	1	Logs
➖	tcp_syslog_to_blackhole	ingress throughput	-1.05	[-1.10, -0.99]	1	Logs

Bounds Checks: ✅ Passed

perf	experiment	bounds_check_name	replicates_passed	links
✅	file_to_blackhole_0ms_latency	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency	memory_usage	10/10
✅	file_to_blackhole_0ms_latency_http1	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency_http1	memory_usage	10/10
✅	file_to_blackhole_0ms_latency_http2	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency_http2	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency_linear_load	memory_usage	10/10
✅	file_to_blackhole_100ms_latency	lost_bytes	10/10
✅	file_to_blackhole_100ms_latency	memory_usage	10/10
✅	file_to_blackhole_300ms_latency	lost_bytes	10/10
✅	file_to_blackhole_300ms_latency	memory_usage	10/10
✅	file_to_blackhole_500ms_latency	lost_bytes	10/10
✅	file_to_blackhole_500ms_latency	memory_usage	10/10
✅	quality_gate_idle	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_idle	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_idle_all_features	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_idle_all_features	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_logs	intake_connections	10/10
✅	quality_gate_logs	lost_bytes	10/10
✅	quality_gate_logs	memory_usage	10/10

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

CI Pass/Fail Decision

✅ Passed. All Quality Gates passed.

quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.

simont1 · 2025-02-21T18:33:36Z

+				return strconv.FormatUint(c, 10)
+			}
+			strCUpper := strings.ToUpper(strC)
+			if strCUpper == "CANCELED" || strCUpper == "CANCELLED" { // the rpc code google api checks for "CANCELLED" but we receive "Canceled" from upstream


Can we add a unit test in aggregation_test.go to make sure that the "canceled" status code is being properly set.

simont1 · 2025-02-21T18:35:02Z

+		}
+	}
+
+	return "" // invalid gRPC code


We can take this comment out
It's not always invalid if we reach this point. Only grpc traces have grpc_status_codes. All other traces will not have the tag in the span, in which case there is nothing to return.

kitfre · 2025-02-21T18:29:58Z

 	// E.g., `grpc.target` to describe the name of a gRPC peer, or `db.hostname` to describe the name of peer DB
 	repeated string peer_tags = 16;
 	Trilean is_trace_root = 17; // this field's value is equal to span's ParentID == 0.
+	string GRPC_status_code = 18;


nit: can this be a number rather than a string as IIRC we're displaying these as the raw number rather than the string associated with the code

If it's an integer the default if it's not set is 0, so this screws up the end to end tests for dd-go because a gRPC status code tag with 0 will be appended when not intentionally set and 0 is a valid gRPC status code that should get sent downstream.

Integers have a default of 0 but 0 is also a valid status code. We've switched to a string since a string allows us to handle nil status codes better. Sources like USM won't be submitting grpc status codes.

Though it could be argued that we could keep the string format representation of the grpc code while the variable is of type string until we create the metric tag

You can regain "presence" checking in protobuf by marking the field optional (check the proto3 section not the proto2 section)
https://protobuf.dev/programming-guides/field_presence/

Our proto files in dd-go are currently generated with protoc-gen-gogo which doesn't support the optional field. Regenerating them using the normal protoc-gen-go changed the structs' formats. We've scoped updating our protobuf files out of this project and onto a separate M&R jira ticket.

kitfre · 2025-02-21T18:31:15Z

+	Synthetics     bool
+	PeerTagsHash   uint64
+	IsTraceRoot    pb.Trilean
+	GRPCStatusCode string


nit: same question as above r.e can this be an int

kitfre · 2025-02-21T18:32:40Z

+
+func getGRPCStatusCode(meta map[string]string, metrics map[string]float64) string {
+	// List of possible keys to check in order
+	metaKeys := []string{"rpc.grpc.status_code", "grpc.code", "rpc.grpc.status.code", "grpc.status.code"}


Where did this list come from? If possible let's make this configurable and have these tags be the default values

These are the various ways the tracer libraries will send the tag from upstream depending on the tracer library: https://docs.google.com/spreadsheets/d/1tbn04E-wLv8ozTtDO02PCqkruCNbJ-aK-1G8Ytp7H5o/edit?gid=0#gid=0. The getGRPCStatusCode function just tries to return a value of the integer gRPC code regardless of how it was sent by the tracers.

I think we'll want this to be configurable so we can easily adapt to changes and new traces, but this is a good default list for sure

kitfre · 2025-02-21T18:33:37Z

+				return strconv.FormatUint(uint64(codes.Code(codeNum)), 10)
+			}
+
+			log.Debugf("Invalid status code %s. Using empty string", strC)


nit: we can probably get rid of this log, we can recover it from looking at the input data as needed

kitfre · 2025-02-21T18:34:50Z

@@ -0,0 +1,4 @@
+upgrade:
+  - |
+    APM: Adds grpc status codes to trace metrics


nit: it might be more accurate to say something like "aggregate apm stats payloads by grpc code" since this change alone doesn't put them on trace metrics

kitfre · 2025-02-21T18:35:15Z

 	"testing"
 	"time"

+	"github.com/DataDog/datadog-agent/pkg/trace/traceutil"


nit: my formatter does this too but since we didn't change anything else here let's undo this change to keep the diff smaller

agent-platform-auto-pr · 2025-02-24T20:58:10Z

Static quality checks ❌

Please find below the results from static quality gates

Error

Result	Quality gate	On disk size	On disk size limit	On wire size	On wire size limit
❌	static_quality_gate_agent_rpm_arm64	DataNotFound	836.66MiB	DataNotFound	194.24MiB

Gate failure full details

Quality gate	Error type	Error message
static_quality_gate_agent_rpm_arm64	StackTrace	Traceback (most recent call last): File "/go/src/github.com/DataDog/datadog-agent/tasks/quality_gates.py", line 121, in parse_and_trigger_gates gate_mod.entrypoint(**gate_inputs) File "/go/src/github.com/DataDog/datadog-agent/tasks/static_quality_gates/static_quality_gate_agent_rpm_arm64.py", line 5, in entrypoint generic_package_agent_quality_gate( File "/go/src/github.com/DataDog/datadog-agent/tasks/static_quality_gates/lib/package_agent_lib.py", line 69, in generic_package_agent_quality_gate package_on_wire_size, package_on_disk_size = calculate_package_size( ^^^^^^^^^^^^^^^^^^^^^^^ File "/go/src/github.com/DataDog/datadog-agent/tasks/static_quality_gates/lib/package_agent_lib.py", line 10, in calculate_package_size extract_package(ctx=ctx, package_os=package_os, package_path=package_path, extract_dir=extract_dir) File "/go/src/github.com/DataDog/datadog-agent/tasks/libs/package/size.py", line 72, in extract_package return extract_rpm_package(ctx, package_path, extract_dir) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/go/src/github.com/DataDog/datadog-agent/tasks/libs/package/size.py", line 65, in extract_rpm_package ctx.run(f"rpm2cpio {package_path}

Successful checks

Info

Result	Quality gate	On disk size	On disk size limit	On wire size	On wire size limit
✅	static_quality_gate_agent_deb_amd64	831.59MiB	847.49MiB	181.72MiB	212.33MiB
✅	static_quality_gate_agent_deb_arm64	821.41MiB	836.66MiB	167.35MiB	192.5MiB
✅	static_quality_gate_agent_rpm_amd64	831.65MiB	847.82MiB	184.14MiB	215.76MiB
✅	static_quality_gate_agent_suse_amd64	831.58MiB	847.82MiB	184.14MiB	215.76MiB
✅	static_quality_gate_agent_suse_arm64	821.43MiB	836.66MiB	166.49MiB	194.24MiB
✅	static_quality_gate_dogstatsd_deb_amd64	39.57MiB	49.7MiB	10.41MiB	20.6MiB
✅	static_quality_gate_dogstatsd_deb_arm64	37.91MiB	48.1MiB	8.98MiB	19.1MiB
✅	static_quality_gate_dogstatsd_rpm_amd64	39.57MiB	49.7MiB	10.42MiB	20.6MiB
✅	static_quality_gate_dogstatsd_suse_amd64	39.57MiB	49.7MiB	10.42MiB	20.6MiB
✅	static_quality_gate_iot_agent_deb_amd64	59.24MiB	69.0MiB	14.59MiB	24.8MiB
✅	static_quality_gate_iot_agent_deb_arm64	56.61MiB	66.4MiB	12.58MiB	22.8MiB
✅	static_quality_gate_iot_agent_rpm_amd64	59.24MiB	69.0MiB	14.61MiB	24.8MiB
✅	static_quality_gate_iot_agent_rpm_arm64	56.61MiB	66.4MiB	12.6MiB	22.8MiB
✅	static_quality_gate_iot_agent_suse_amd64	59.24MiB	69.0MiB	14.61MiB	24.8MiB
✅	static_quality_gate_docker_agent_amd64	915.91MiB	931.7MiB	307.44MiB	318.67MiB
✅	static_quality_gate_docker_agent_arm64	928.98MiB	944.08MiB	292.44MiB	303.0MiB
✅	static_quality_gate_docker_agent_jmx_amd64	1.09GiB	1.1GiB	382.55MiB	393.75MiB
✅	static_quality_gate_docker_agent_jmx_arm64	1.09GiB	1.1GiB	363.52MiB	373.71MiB
✅	static_quality_gate_docker_dogstatsd_amd64	47.71MiB	57.88MiB	18.26MiB	28.29MiB
✅	static_quality_gate_docker_dogstatsd_arm64	46.09MiB	56.27MiB	17.02MiB	27.06MiB
✅	static_quality_gate_docker_cluster_agent_amd64	264.96MiB	274.78MiB	106.36MiB	116.28MiB
✅	static_quality_gate_docker_cluster_agent_arm64	280.92MiB	290.82MiB	101.17MiB	111.12MiB

ichinaski

Leaving the PR approved not to block it on the response value casing, but I think the spec and possible values from tracers should be clearer and defined somewhere, so we don't have to account for multiple variants.

rksrcl · 2025-02-27T14:56:54Z

/merge

dd-devflow · 2025-02-27T14:56:59Z

View all feedbacks in Devflow UI.
2025-02-27 14:56:58 UTC ℹ️ Start processing command /merge

2025-02-27 14:57:06 UTC ℹ️ MergeQueue: pull request added to the queue

The median merge time in main is 29m.

2025-02-27 15:33:19 UTC ℹ️ MergeQueue: This merge request was merged

rksrcl and others added 14 commits February 10, 2025 10:56

Added grpc field and translation layer logic

9b5349a

Cleaned up status code logic

a836405

merge from main

746db87

Removed metrics

a8bd286

Merge branch 'main' into rksrcl/grpc-status-code-stats-payloads

710c577

Update protobufs

f2d51de

linter

c7e2918

Add grpc status code to tests

16dc7c7

Add tests

9c09986

linter

4888052

Merge branch 'main' into rksrcl/grpc-status-code-stats-payloads

1a540f7

Update tests

ddc791f

Merge branch 'main' into rksrcl/grpc-status-code-stats-payloads

a8713a6

string grpc status code

75455b5

rksrcl requested review from a team as code owners February 19, 2025 17:52

rksrcl requested review from dineshg13 and liustanley February 19, 2025 17:52

rksrcl marked this pull request as draft February 19, 2025 17:52

github-actions Bot added medium review PR review might take time team/agent-apm trace-agent labels Feb 19, 2025

rksrcl added 2 commits February 19, 2025 15:21

canceled and cancelled are the same

451ef56

remove log lines

25e1286

rksrcl marked this pull request as ready for review February 19, 2025 21:21

release note

5f5b874

rksrcl requested a review from a team as a code owner February 19, 2025 21:23

github-actions Bot removed the medium review PR review might take time label Feb 19, 2025

Merge branch 'main' into rksrcl/string-grpc

f3de0ee

dineshg13 reviewed Feb 20, 2025

View reviewed changes

Comment thread pkg/trace/stats/aggregation.go Outdated

variable

21946ba

rksrcl requested a review from dineshg13 February 20, 2025 20:22

dineshg13 reviewed Feb 20, 2025

View reviewed changes

Comment thread pkg/trace/stats/aggregation.go Outdated

variable

9660aab

rksrcl requested a review from dineshg13 February 20, 2025 21:13

rksrcl and others added 2 commits February 21, 2025 07:04

Merge branch 'main' into rksrcl/string-grpc

2c6903e

linter

ffe7e25

simont1 reviewed Feb 21, 2025

View reviewed changes

kitfre reviewed Feb 21, 2025

View reviewed changes

dineshg13 approved these changes Feb 21, 2025

View reviewed changes

liustanley approved these changes Feb 21, 2025

View reviewed changes

rksrcl and others added 5 commits February 22, 2025 11:20

Comments

9f0a69a

Linter

6b8a86f

Merge branch 'main' into rksrcl/string-grpc

609174d

Merge branch 'main' into rksrcl/string-grpc

1de339c

Improved string processing

2b40790

simont1 approved these changes Feb 26, 2025

View reviewed changes

ichinaski approved these changes Feb 27, 2025

View reviewed changes

dd-mergequeue Bot merged commit 0ce7715 into main Feb 27, 2025

dd-mergequeue Bot deleted the rksrcl/string-grpc branch February 27, 2025 15:33

github-actions Bot added this to the 7.65.0 milestone Feb 27, 2025

rksrcl mentioned this pull request Feb 28, 2025

Account for more tracer library values for gRPC codes #34587

Merged

bric3 mentioned this pull request Mar 11, 2026

Report gRPC status code in client-computed stats DataDog/dd-trace-java#10805

Merged

Conversation

rksrcl commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Describe how you validated your changes

Possible Drawbacks / Trade-offs

Additional Notes

Uh oh!

cit-pr-commenter Bot commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Go Package Import Differences

Uh oh!

cit-pr-commenter Bot commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Regression Detector

Regression Detector Results

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

Bounds Checks: ✅ Passed

Explanation

CI Pass/Fail Decision

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agent-platform-auto-pr Bot commented Feb 24, 2025

Static quality checks ❌

Error

Info

Uh oh!

ichinaski left a comment

Choose a reason for hiding this comment

Uh oh!

rksrcl commented Feb 27, 2025

Uh oh!

dd-devflow Bot commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

rksrcl commented Feb 19, 2025 •

edited

Loading

cit-pr-commenter Bot commented Feb 19, 2025 •

edited

Loading

cit-pr-commenter Bot commented Feb 19, 2025 •

edited

Loading

dd-devflow Bot commented Feb 27, 2025 •

edited

Loading