Skip to content

Support adding kafka_cluster_id for Confluent.Kafka#7702

Merged
robcarlan-datadog merged 28 commits into
masterfrom
rob.carlan/kafka-cluster-id-tag-reflection
Mar 20, 2026
Merged

Support adding kafka_cluster_id for Confluent.Kafka#7702
robcarlan-datadog merged 28 commits into
masterfrom
rob.carlan/kafka-cluster-id-tag-reflection

Conversation

@robcarlan-datadog

@robcarlan-datadog robcarlan-datadog commented Oct 23, 2025

Copy link
Copy Markdown
Contributor

Summary of changes

  • Adds kafka_cluster_id as a tag for APM Spans and DSM checkpoints/offsets. (Note that DSM is enabled by default). Without this, we cannot differentiate between reported offsets on the same topic but across different clusters, which leads us to wildly incorrect metric values. (For example, a prod cluster with consume/produce offsets at 1000,1001 should have an offset lag of 1. If there is also a matching dev cluster with offsets at 0 and 1, then we can calculate the offset lag as 1000).
  • In most tracer libraries (python, java, node) we support tracking kafka_cluster_id. Without this, metrics become incorrect when there are the same topics across different clusters (i.e. dev/prod environments).
  • It's available for Confluent.Kafka 2.3.0 and above only - roughly half of orgs that use this library according to telemetry. (For DSM where this is most useful, we can add a recommendation to our docs to use this version or above.)

Note: I drafted an alternative approach calling librdkafka directly here. That gives the cluster id without any external API call or version dependency (It's available and unchanged from librdkafka 1.0.0), but calls into unmanaged code

Reason for change

This functionality exists in other tracers:

  1. Java (doesn't block, intercepts the metadata response to enrich cluster id going forwards, see here and here)

  2. Node (blocks, see here and here)

  3. Python (blocks with a 1 second timeout, see here and here)

Implementation details

Constructs the Kafka Admin API client and queries for the cluster id directly on consumer/producer startup.
We cache this by bootstrap servers, so we expect to make this API call once.

Test coverage

Other details

These DSM metrics will now tag by kafka_cluster_id
Screenshot 2026-02-25 at 12 42 00 pm
Added to spans (I checked this on produce too, I just don't have a screenshot of it)
Screenshot 2026-02-24 at 5 21 59 pm

@datadog-datadog-prod-us1

This comment has been minimized.

@dd-trace-dotnet-ci-bot

dd-trace-dotnet-ci-bot Bot commented Oct 23, 2025

Copy link
Copy Markdown

Execution-Time Benchmarks Report ⏱️

Execution-time results for samples comparing This PR (7702) and master.

✅ No regressions detected - check the details below

Full Metrics Comparison

FakeDbCommand

Metric Master (Mean ± 95% CI) Current (Mean ± 95% CI) Change Status
.NET Framework 4.8 - Baseline
duration69.31 ± (69.54 - 69.99) ms68.91 ± (68.86 - 69.13) ms-0.6%
.NET Framework 4.8 - Bailout
duration73.09 ± (73.02 - 73.28) ms73.13 ± (72.98 - 73.37) ms+0.1%✅⬆️
.NET Framework 4.8 - CallTarget+Inlining+NGEN
duration1044.03 ± (1046.70 - 1054.47) ms1039.94 ± (1043.59 - 1051.21) ms-0.4%
.NET Core 3.1 - Baseline
process.internal_duration_ms21.90 ± (21.87 - 21.94) ms21.92 ± (21.88 - 21.95) ms+0.1%✅⬆️
process.time_to_main_ms79.85 ± (79.68 - 80.03) ms79.88 ± (79.73 - 80.03) ms+0.0%✅⬆️
runtime.dotnet.exceptions.count0 ± (0 - 0)0 ± (0 - 0)+0.0%
runtime.dotnet.mem.committed10.89 ± (10.89 - 10.89) MB10.91 ± (10.91 - 10.91) MB+0.2%✅⬆️
runtime.dotnet.threads.count12 ± (12 - 12)12 ± (12 - 12)+0.0%
.NET Core 3.1 - Bailout
process.internal_duration_ms21.88 ± (21.85 - 21.91) ms21.82 ± (21.80 - 21.85) ms-0.3%
process.time_to_main_ms81.31 ± (81.16 - 81.45) ms81.21 ± (81.06 - 81.35) ms-0.1%
runtime.dotnet.exceptions.count0 ± (0 - 0)0 ± (0 - 0)+0.0%
runtime.dotnet.mem.committed10.94 ± (10.93 - 10.94) MB10.94 ± (10.94 - 10.94) MB+0.0%✅⬆️
runtime.dotnet.threads.count13 ± (13 - 13)13 ± (13 - 13)+0.0%
.NET Core 3.1 - CallTarget+Inlining+NGEN
process.internal_duration_ms220.38 ± (219.28 - 221.48) ms218.23 ± (217.04 - 219.41) ms-1.0%
process.time_to_main_ms472.68 ± (472.05 - 473.32) ms472.80 ± (472.23 - 473.37) ms+0.0%✅⬆️
runtime.dotnet.exceptions.count0 ± (0 - 0)0 ± (0 - 0)+0.0%
runtime.dotnet.mem.committed47.24 ± (47.21 - 47.27) MB47.26 ± (47.23 - 47.28) MB+0.0%✅⬆️
runtime.dotnet.threads.count28 ± (28 - 28)28 ± (28 - 28)-0.3%
.NET 6 - Baseline
process.internal_duration_ms20.78 ± (20.75 - 20.81) ms20.77 ± (20.74 - 20.80) ms-0.0%
process.time_to_main_ms69.91 ± (69.76 - 70.07) ms69.42 ± (69.28 - 69.56) ms-0.7%
runtime.dotnet.exceptions.count0 ± (0 - 0)0 ± (0 - 0)+0.0%
runtime.dotnet.mem.committed10.62 ± (10.62 - 10.62) MB10.62 ± (10.61 - 10.62) MB-0.0%
runtime.dotnet.threads.count10 ± (10 - 10)10 ± (10 - 10)+0.0%
.NET 6 - Bailout
process.internal_duration_ms20.52 ± (20.50 - 20.54) ms20.88 ± (20.85 - 20.91) ms+1.7%✅⬆️
process.time_to_main_ms70.16 ± (70.07 - 70.26) ms70.73 ± (70.60 - 70.85) ms+0.8%✅⬆️
runtime.dotnet.exceptions.count0 ± (0 - 0)0 ± (0 - 0)+0.0%
runtime.dotnet.mem.committed10.73 ± (10.73 - 10.74) MB10.73 ± (10.73 - 10.73) MB-0.0%
runtime.dotnet.threads.count11 ± (11 - 11)11 ± (11 - 11)+0.0%
.NET 6 - CallTarget+Inlining+NGEN
process.internal_duration_ms206.33 ± (204.11 - 208.56) ms201.86 ± (199.77 - 203.95) ms-2.2%
process.time_to_main_ms473.56 ± (472.71 - 474.41) ms472.33 ± (471.58 - 473.08) ms-0.3%
runtime.dotnet.exceptions.count0 ± (0 - 0)0 ± (0 - 0)+0.0%
runtime.dotnet.mem.committed49.03 ± (48.97 - 49.10) MB48.96 ± (48.90 - 49.03) MB-0.1%
runtime.dotnet.threads.count29 ± (29 - 29)29 ± (29 - 29)-0.0%
.NET 8 - Baseline
process.internal_duration_ms18.85 ± (18.82 - 18.88) ms18.86 ± (18.83 - 18.88) ms+0.0%✅⬆️
process.time_to_main_ms68.63 ± (68.48 - 68.77) ms68.57 ± (68.45 - 68.70) ms-0.1%
runtime.dotnet.exceptions.count0 ± (0 - 0)0 ± (0 - 0)+0.0%
runtime.dotnet.mem.committed7.69 ± (7.68 - 7.69) MB7.66 ± (7.65 - 7.66) MB-0.4%
runtime.dotnet.threads.count10 ± (10 - 10)10 ± (10 - 10)+0.0%
.NET 8 - Bailout
process.internal_duration_ms18.88 ± (18.86 - 18.91) ms18.94 ± (18.91 - 18.97) ms+0.3%✅⬆️
process.time_to_main_ms69.67 ± (69.55 - 69.79) ms69.79 ± (69.66 - 69.91) ms+0.2%✅⬆️
runtime.dotnet.exceptions.count0 ± (0 - 0)0 ± (0 - 0)+0.0%
runtime.dotnet.mem.committed7.74 ± (7.73 - 7.75) MB7.72 ± (7.71 - 7.73) MB-0.3%
runtime.dotnet.threads.count11 ± (11 - 11)11 ± (11 - 11)+0.0%
.NET 8 - CallTarget+Inlining+NGEN
process.internal_duration_ms157.63 ± (156.77 - 158.48) ms156.24 ± (155.26 - 157.22) ms-0.9%
process.time_to_main_ms451.64 ± (451.12 - 452.16) ms450.27 ± (449.57 - 450.97) ms-0.3%
runtime.dotnet.exceptions.count0 ± (0 - 0)0 ± (0 - 0)+0.0%
runtime.dotnet.mem.committed36.53 ± (36.51 - 36.55) MB36.51 ± (36.48 - 36.53) MB-0.1%
runtime.dotnet.threads.count28 ± (28 - 28)28 ± (28 - 28)+0.0%✅⬆️

HttpMessageHandler

Metric Master (Mean ± 95% CI) Current (Mean ± 95% CI) Change Status
.NET Framework 4.8 - Baseline
duration195.58 ± (195.53 - 196.34) ms194.06 ± (194.00 - 194.86) ms-0.8%
.NET Framework 4.8 - Bailout
duration197.34 ± (197.14 - 197.92) ms197.48 ± (197.22 - 197.73) ms+0.1%✅⬆️
.NET Framework 4.8 - CallTarget+Inlining+NGEN
duration1143.98 ± (1144.38 - 1151.56) ms1149.86 ± (1150.60 - 1157.67) ms+0.5%✅⬆️
.NET Core 3.1 - Baseline
process.internal_duration_ms188.27 ± (187.94 - 188.59) ms189.22 ± (188.81 - 189.63) ms+0.5%✅⬆️
process.time_to_main_ms80.90 ± (80.73 - 81.08) ms81.58 ± (81.32 - 81.84) ms+0.8%✅⬆️
runtime.dotnet.exceptions.count3 ± (3 - 3)3 ± (3 - 3)+0.0%
runtime.dotnet.mem.committed16.14 ± (16.10 - 16.17) MB16.15 ± (16.13 - 16.18) MB+0.1%✅⬆️
runtime.dotnet.threads.count20 ± (19 - 20)20 ± (19 - 20)-0.0%
.NET Core 3.1 - Bailout
process.internal_duration_ms188.05 ± (187.62 - 188.48) ms188.40 ± (187.96 - 188.84) ms+0.2%✅⬆️
process.time_to_main_ms82.62 ± (82.44 - 82.80) ms82.69 ± (82.52 - 82.86) ms+0.1%✅⬆️
runtime.dotnet.exceptions.count3 ± (3 - 3)3 ± (3 - 3)+0.0%
runtime.dotnet.mem.committed16.09 ± (16.07 - 16.12) MB16.11 ± (16.09 - 16.14) MB+0.1%✅⬆️
runtime.dotnet.threads.count21 ± (20 - 21)21 ± (21 - 21)+0.4%✅⬆️
.NET Core 3.1 - CallTarget+Inlining+NGEN
process.internal_duration_ms397.34 ± (395.60 - 399.08) ms394.74 ± (392.70 - 396.79) ms-0.7%
process.time_to_main_ms474.27 ± (473.58 - 474.95) ms475.90 ± (475.26 - 476.53) ms+0.3%✅⬆️
runtime.dotnet.exceptions.count3 ± (3 - 3)3 ± (3 - 3)+0.0%
runtime.dotnet.mem.committed57.93 ± (57.78 - 58.08) MB57.78 ± (57.61 - 57.96) MB-0.2%
runtime.dotnet.threads.count30 ± (30 - 30)30 ± (30 - 30)-0.1%
.NET 6 - Baseline
process.internal_duration_ms192.60 ± (192.19 - 193.00) ms192.95 ± (192.58 - 193.31) ms+0.2%✅⬆️
process.time_to_main_ms70.45 ± (70.24 - 70.66) ms70.81 ± (70.62 - 71.00) ms+0.5%✅⬆️
runtime.dotnet.exceptions.count4 ± (4 - 4)4 ± (4 - 4)+0.0%
runtime.dotnet.mem.committed16.15 ± (16.02 - 16.28) MB16.17 ± (16.04 - 16.30) MB+0.1%✅⬆️
runtime.dotnet.threads.count18 ± (18 - 19)18 ± (18 - 19)+0.0%✅⬆️
.NET 6 - Bailout
process.internal_duration_ms191.69 ± (191.36 - 192.03) ms192.59 ± (192.26 - 192.92) ms+0.5%✅⬆️
process.time_to_main_ms71.18 ± (71.06 - 71.30) ms71.60 ± (71.48 - 71.73) ms+0.6%✅⬆️
runtime.dotnet.exceptions.count4 ± (4 - 4)4 ± (4 - 4)+0.0%
runtime.dotnet.mem.committed16.17 ± (16.04 - 16.31) MB16.03 ± (15.88 - 16.19) MB-0.9%
runtime.dotnet.threads.count20 ± (20 - 20)19 ± (19 - 20)-1.1%
.NET 6 - CallTarget+Inlining+NGEN
process.internal_duration_ms428.23 ± (426.42 - 430.04) ms428.71 ± (426.88 - 430.55) ms+0.1%✅⬆️
process.time_to_main_ms476.18 ± (475.42 - 476.94) ms478.91 ± (478.05 - 479.77) ms+0.6%✅⬆️
runtime.dotnet.exceptions.count4 ± (4 - 4)4 ± (4 - 4)+0.0%
runtime.dotnet.mem.committed60.22 ± (60.15 - 60.29) MB60.32 ± (60.26 - 60.38) MB+0.2%✅⬆️
runtime.dotnet.threads.count30 ± (30 - 30)30 ± (30 - 30)+0.0%✅⬆️
.NET 8 - Baseline
process.internal_duration_ms190.84 ± (190.41 - 191.27) ms190.84 ± (190.47 - 191.21) ms-0.0%
process.time_to_main_ms70.21 ± (70.00 - 70.43) ms70.31 ± (70.11 - 70.51) ms+0.1%✅⬆️
runtime.dotnet.exceptions.count4 ± (4 - 4)4 ± (4 - 4)+0.0%
runtime.dotnet.mem.committed11.77 ± (11.74 - 11.79) MB11.78 ± (11.76 - 11.80) MB+0.1%✅⬆️
runtime.dotnet.threads.count18 ± (18 - 18)18 ± (18 - 18)-0.4%
.NET 8 - Bailout
process.internal_duration_ms190.29 ± (189.98 - 190.61) ms190.98 ± (190.60 - 191.36) ms+0.4%✅⬆️
process.time_to_main_ms70.94 ± (70.81 - 71.06) ms71.35 ± (71.22 - 71.48) ms+0.6%✅⬆️
runtime.dotnet.exceptions.count4 ± (4 - 4)4 ± (4 - 4)+0.0%
runtime.dotnet.mem.committed11.85 ± (11.82 - 11.88) MB11.84 ± (11.81 - 11.87) MB-0.1%
runtime.dotnet.threads.count19 ± (19 - 19)19 ± (19 - 19)-0.6%
.NET 8 - CallTarget+Inlining+NGEN
process.internal_duration_ms351.91 ± (350.56 - 353.26) ms354.79 ± (353.54 - 356.05) ms+0.8%✅⬆️
process.time_to_main_ms452.69 ± (452.09 - 453.29) ms455.80 ± (455.21 - 456.39) ms+0.7%✅⬆️
runtime.dotnet.exceptions.count4 ± (4 - 4)4 ± (4 - 4)+0.0%
runtime.dotnet.mem.committed48.41 ± (48.36 - 48.45) MB48.38 ± (48.34 - 48.42) MB-0.1%
runtime.dotnet.threads.count30 ± (30 - 30)30 ± (30 - 30)-0.3%
Comparison explanation

Execution-time benchmarks measure the whole time it takes to execute a program, and are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are highlighted in **red**. The following thresholds were used for comparing the execution times:

  • Welch test with statistical test for significance of 5%
  • Only results indicating a difference greater than 5% and 5 ms are considered.

Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard.

Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph).

Duration charts
FakeDbCommand (.NET Framework 4.8)
gantt
    title Execution time (ms) FakeDbCommand (.NET Framework 4.8)
    dateFormat  x
    axisFormat %Q
    todayMarker off
    section Baseline
    This PR (7702) - mean (69ms)  : 67, 71
    master - mean (70ms)  : 66, 73

    section Bailout
    This PR (7702) - mean (73ms)  : 71, 75
    master - mean (73ms)  : 72, 74

    section CallTarget+Inlining+NGEN
    This PR (7702) - mean (1,047ms)  : 993, 1102
    master - mean (1,051ms)  : 995, 1106

Loading
FakeDbCommand (.NET Core 3.1)
gantt
    title Execution time (ms) FakeDbCommand (.NET Core 3.1)
    dateFormat  x
    axisFormat %Q
    todayMarker off
    section Baseline
    This PR (7702) - mean (108ms)  : 104, 111
    master - mean (107ms)  : 104, 111

    section Bailout
    This PR (7702) - mean (108ms)  : 107, 110
    master - mean (109ms)  : 107, 110

    section CallTarget+Inlining+NGEN
    This PR (7702) - mean (727ms)  : 708, 746
    master - mean (730ms)  : 711, 750

Loading
FakeDbCommand (.NET 6)
gantt
    title Execution time (ms) FakeDbCommand (.NET 6)
    dateFormat  x
    axisFormat %Q
    todayMarker off
    section Baseline
    This PR (7702) - mean (95ms)  : 93, 98
    master - mean (96ms)  : 93, 99

    section Bailout
    This PR (7702) - mean (97ms)  : 95, 98
    master - mean (96ms)  : 94, 98

    section CallTarget+Inlining+NGEN
    This PR (7702) - mean (703ms)  : 654, 752
    master - mean (709ms)  : 668, 751

Loading
FakeDbCommand (.NET 8)
gantt
    title Execution time (ms) FakeDbCommand (.NET 8)
    dateFormat  x
    axisFormat %Q
    todayMarker off
    section Baseline
    This PR (7702) - mean (94ms)  : 91, 97
    master - mean (94ms)  : 92, 97

    section Bailout
    This PR (7702) - mean (95ms)  : 93, 98
    master - mean (95ms)  : 93, 97

    section CallTarget+Inlining+NGEN
    This PR (7702) - mean (636ms)  : 620, 651
    master - mean (641ms)  : 615, 667

Loading
HttpMessageHandler (.NET Framework 4.8)
gantt
    title Execution time (ms) HttpMessageHandler (.NET Framework 4.8)
    dateFormat  x
    axisFormat %Q
    todayMarker off
    section Baseline
    This PR (7702) - mean (194ms)  : 190, 199
    master - mean (196ms)  : 191, 201

    section Bailout
    This PR (7702) - mean (197ms)  : 195, 200
    master - mean (198ms)  : 194, 201

    section CallTarget+Inlining+NGEN
    This PR (7702) - mean (1,154ms)  : 1103, 1205
    master - mean (1,148ms)  : 1097, 1199

Loading
HttpMessageHandler (.NET Core 3.1)
gantt
    title Execution time (ms) HttpMessageHandler (.NET Core 3.1)
    dateFormat  x
    axisFormat %Q
    todayMarker off
    section Baseline
    This PR (7702) - mean (279ms)  : 274, 285
    master - mean (277ms)  : 273, 282

    section Bailout
    This PR (7702) - mean (279ms)  : 273, 285
    master - mean (279ms)  : 274, 284

    section CallTarget+Inlining+NGEN
    This PR (7702) - mean (902ms)  : 878, 927
    master - mean (903ms)  : 876, 929

Loading
HttpMessageHandler (.NET 6)
gantt
    title Execution time (ms) HttpMessageHandler (.NET 6)
    dateFormat  x
    axisFormat %Q
    todayMarker off
    section Baseline
    This PR (7702) - mean (272ms)  : 268, 276
    master - mean (272ms)  : 266, 277

    section Bailout
    This PR (7702) - mean (273ms)  : 269, 277
    master - mean (271ms)  : 267, 275

    section CallTarget+Inlining+NGEN
    This PR (7702) - mean (941ms)  : 912, 970
    master - mean (936ms)  : 909, 963

Loading
HttpMessageHandler (.NET 8)
gantt
    title Execution time (ms) HttpMessageHandler (.NET 8)
    dateFormat  x
    axisFormat %Q
    todayMarker off
    section Baseline
    This PR (7702) - mean (271ms)  : 263, 279
    master - mean (271ms)  : 263, 278

    section Bailout
    This PR (7702) - mean (272ms)  : 266, 278
    master - mean (271ms)  : 266, 275

    section CallTarget+Inlining+NGEN
    This PR (7702) - mean (843ms)  : 814, 873
    master - mean (835ms)  : 814, 855

Loading

@pr-commenter

pr-commenter Bot commented Oct 23, 2025

Copy link
Copy Markdown

Benchmarks

Benchmark execution time: 2026-03-20 14:34:16

Comparing candidate commit d527467 in PR branch rob.carlan/kafka-cluster-id-tag-reflection with baseline commit b0dc7b1 in branch master.

Found 5 performance improvements and 10 performance regressions! Performance is the same for 251 metrics, 22 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

scenario:Benchmarks.Trace.ActivityBenchmark.StartStopWithChild netcoreapp3.1

  • 🟥 execution_time [+111.029ms; +112.823ms] or [+128.054%; +130.123%]

scenario:Benchmarks.Trace.AgentWriterBenchmark.WriteAndFlushEnrichedTraces net6.0

  • 🟩 execution_time [-103.461ms; -103.374ms] or [-50.207%; -50.165%]

scenario:Benchmarks.Trace.Asm.AppSecBodyBenchmark.AllCycleMoreComplexBody netcoreapp3.1

  • 🟥 execution_time [+10.973ms; +15.840ms] or [+5.538%; +7.994%]

scenario:Benchmarks.Trace.Asm.AppSecBodyBenchmark.ObjectExtractorSimpleBody netcoreapp3.1

  • 🟩 execution_time [-18.777ms; -12.801ms] or [-8.712%; -5.939%]
  • 🟩 throughput [+139554.415op/s; +210777.762op/s] or [+6.223%; +9.399%]

scenario:Benchmarks.Trace.Asm.AppSecEncoderBenchmark.EncodeLegacyArgs netcoreapp3.1

  • 🟩 execution_time [-21.797ms; -20.994ms] or [-10.849%; -10.450%]

scenario:Benchmarks.Trace.AspNetCoreBenchmark.SendRequest netcoreapp3.1

  • 🟥 execution_time [+68.466ms; +77.876ms] or [+59.385%; +67.547%]

scenario:Benchmarks.Trace.CIVisibilityProtocolWriterBenchmark.WriteAndFlushEnrichedTraces net472

  • 🟩 throughput [+119.976op/s; +143.472op/s] or [+11.601%; +13.873%]

scenario:Benchmarks.Trace.GraphQLBenchmark.ExecuteAsync netcoreapp3.1

  • 🟥 execution_time [+10.423ms; +13.835ms] or [+5.286%; +7.017%]

scenario:Benchmarks.Trace.ILoggerBenchmark.EnrichedLog net6.0

  • 🟥 execution_time [+15.377ms; +19.336ms] or [+7.829%; +9.844%]

scenario:Benchmarks.Trace.Iast.StringAspectsBenchmark.StringConcatAspectBenchmark net6.0

  • 🟥 allocated_mem [+14.625KB; +14.654KB] or [+5.671%; +5.682%]

scenario:Benchmarks.Trace.Iast.StringAspectsBenchmark.StringConcatAspectBenchmark netcoreapp3.1

  • 🟥 allocated_mem [+20.193KB; +20.220KB] or [+7.855%; +7.866%]

scenario:Benchmarks.Trace.SingleSpanAspNetCoreBenchmark.SingleSpanAspNetCore net6.0

  • 🟥 execution_time [+100.199ms; +101.308ms] or [+100.444%; +101.556%]

scenario:Benchmarks.Trace.SingleSpanAspNetCoreBenchmark.SingleSpanAspNetCore netcoreapp3.1

  • 🟥 throughput [-15715213.516op/s; -14556485.880op/s] or [-6.520%; -6.039%]

scenario:Benchmarks.Trace.SpanBenchmark.StartFinishTwoScopes net6.0

  • 🟥 execution_time [+15.486ms; +20.834ms] or [+7.733%; +10.404%]

@bouwkast bouwkast force-pushed the rob.carlan/kafka-cluster-id-tag-reflection branch from f83a435 to 3d16b25 Compare November 4, 2025 19:34
Extract Kafka cluster_id via AdminClient.DescribeClusterAsync using duck typing
and add it to span tags, DSM edge tags, and backlog tracking tags.

Changes:
- Add KafkaHelper.GetClusterId() using duck typed Confluent.Kafka AdminClient API
  with re-entrancy guard (AdminClient internally creates a Producer)
- Extend ConsumerCache/ProducerCache to store cluster_id alongside existing metadata
- Add ClusterId property to KafkaTags (tag: messaging.kafka.cluster_id)
- Include kafka_cluster_id in DSM edge tags for produce/consume checkpoints
- Include kafka_cluster_id in commit/produce backlog tracking tags
- Add duck type interfaces: IAdminClient, IAdminClientBuilder, IAdminClientConfig,
  IDescribeClusterResult with IDuckTypeTask for async handling
- Add unit tests for cache classes, KafkaTags, and GetClusterId edge cases

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@robcarlan-datadog robcarlan-datadog force-pushed the rob.carlan/kafka-cluster-id-tag-reflection branch from e3c11d3 to 04181fd Compare February 19, 2026 22:20
robcarlan-datadog and others added 4 commits February 20, 2026 13:01
- Fix DSM edge tags alphabetical sort: direction before kafka_cluster_id
- Guard kafka_cluster_id in OffsetsCommittedCallbacks with IsNullOrEmpty
  check, consistent with all other callsites
- Rename misleading test name to HandlesConnectionFailureGracefully

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The span_attribute_schema markdown files had formatting-only changes
(table alignment) that were accidentally included from WIP commits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verify the duck typing chain works for IAdminClientConfig, IAdminClientBuilder,
and check DescribeClusterAsync availability (requires Confluent.Kafka 2.0+).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@robcarlan-datadog robcarlan-datadog added area:data-streams-monitoring AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos labels Feb 23, 2026
@robcarlan-datadog robcarlan-datadog changed the title Draft: .NET cluster id via reflection Support adding kafka_cluster_id for Confluent.Kafka Feb 24, 2026
robcarlan-datadog and others added 5 commits February 24, 2026 14:35
…st coverage

Pass the producer instance to TryInjectHeaders so it can query
ProducerCache directly for cluster_id, instead of casting span.Tags
back to KafkaTags. This matches how the consumer side uses
ConsumerCache. Also add cluster_id assertions to integration tests
and register the tag as optional in span metadata rules.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tead of sender

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
{
_isGettingClusterId = true;

var configType = Type.GetType("Confluent.Kafka.AdminClientConfig, Confluent.Kafka");

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik this is the only way to get this object, and the builder below

@robcarlan-datadog robcarlan-datadog marked this pull request as ready for review February 25, 2026 17:54
@robcarlan-datadog robcarlan-datadog requested review from a team as code owners February 25, 2026 17:54
robcarlan-datadog and others added 7 commits February 25, 2026 12:54
…lusterAsync

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…or cluster_id

Instead of creating a standalone AdminClient with a new connection to the broker,
use DependentAdminClientBuilder which reuses the producer/consumer's existing
connection. Move cluster_id resolution from OnMethodBegin to OnMethodEnd so the
client handle is available after construction.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… fallback

Remove standalone AdminClient fallback and IAdminClientConfig since
DependentAdminClientBuilder is available in all supported Confluent.Kafka
versions (1.4.0+). Clean up constructor OnMethodEnd to use early returns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread tracer/src/Datadog.Trace/ClrProfiler/AutoInstrumentation/Kafka/KafkaHelper.cs Outdated
Comment thread tracer/src/Datadog.Trace/ClrProfiler/AutoInstrumentation/Kafka/IClientHandle.cs Outdated
x => x.Tags[Tags.KafkaConsumerGroup] == "Samples.Kafka.AutoCommitConsumer1"
|| x.Tags[Tags.KafkaConsumerGroup] == "Samples.Kafka.ManualCommitConsumer2");

// cluster_id is only available with Confluent.Kafka 2.0+ (DescribeClusterAsync)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right... I forgot this 🤔 I think we should potentially split our integrations to account for this, because otherwise we're going to be doing expensive work which we know is going to fail when we try to describe the cluster.

There's two ways to do that, depending on how reliably we can detect what version we're in from the KafkaHelper:

  • Extract the common code out and create a duplicate of the integration
    • One integration targets 1.x.x only, and does the existing stuff
    • The other integration targets 2.x.x+, and includes the describe cluster work
  • Pro-actively try to grab the DescribeClusterOptions type - if it doesn't exist, there's no point trying to build and duck type an AdminClient, so we can (and should) just bail at that point. What's more, we know that can never work, so we should probably cache a null/"" value for the bootstrap servers (and potentially bypass that cache entirely - just always return null if that lookup of the Type has failed...)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the second approach in terms of simplicity, I'll do that. Flagging an early null/"" result in that case seems like a good thing to do

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(My original comment was wrong and the version requirement is actually > 2.3.0)

Comment thread tracer/src/Datadog.Trace/ClrProfiler/AutoInstrumentation/Kafka/KafkaHelper.cs Outdated
Comment thread tracer/src/Datadog.Trace/ClrProfiler/AutoInstrumentation/Kafka/KafkaHelper.cs Outdated

@zacharycmontoya zacharycmontoya left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a couple of comments but I think Andrew already did a very comprehensive review. Looking forward to changes based on PR feedback, but in general I think this is looking pretty promising

robcarlan-datadog and others added 3 commits March 11, 2026 16:27
- Use inline ternary for backlog tags to avoid extra string allocations
- Replace string.IsNullOrEmpty with StringUtil.IsNullOrEmpty throughout
- Simplify GetClusterId with direct DuckCast and using statement
- Accept nullable clusterId in ConsumerCache.SetConsumerGroup
- Use typed Config class instead of string[] in consumer constructor
- Add nullability annotations to duck typing interfaces

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…duck typing

Anchor reflection type lookups to the Confluent.Kafka assembly obtained from
the actual client instance, avoiding fragile global assembly searches. Replace
raw reflection property access on DescribeClusterOptions with a duck type.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@robcarlan-datadog robcarlan-datadog force-pushed the rob.carlan/kafka-cluster-id-tag-reflection branch 4 times, most recently from 79d465e to 3962123 Compare March 13, 2026 14:11
…duck typing

Anchor reflection type lookups to the Confluent.Kafka assembly obtained from
the actual client instance, avoiding fragile global assembly searches. Replace
raw reflection property access on DescribeClusterOptions with a duck type.
Early-return and cache empty result for Confluent.Kafka < 2.3.0 where
DescribeClusterAsync is unavailable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@robcarlan-datadog robcarlan-datadog force-pushed the rob.carlan/kafka-cluster-id-tag-reflection branch from 3962123 to c6531c5 Compare March 13, 2026 14:14

@andrewlock andrewlock left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall - just a couple of final tweaks around logging and caching, that should probably be addressed, but otherwise looking good!

Comment thread tracer/src/Datadog.Trace/ClrProfiler/AutoInstrumentation/Kafka/KafkaHelper.cs Outdated
Comment thread tracer/src/Datadog.Trace/ClrProfiler/AutoInstrumentation/Kafka/KafkaHelper.cs Outdated
Comment thread tracer/src/Datadog.Trace/ClrProfiler/AutoInstrumentation/Kafka/KafkaHelper.cs Outdated
Comment thread tracer/src/Datadog.Trace/ClrProfiler/AutoInstrumentation/Kafka/KafkaHelper.cs Outdated

@zacharycmontoya zacharycmontoya left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the small comments, LGTM!

@robcarlan-datadog robcarlan-datadog merged commit 6f4edac into master Mar 20, 2026
140 checks passed
@robcarlan-datadog robcarlan-datadog deleted the rob.carlan/kafka-cluster-id-tag-reflection branch March 20, 2026 16:33
@github-actions github-actions Bot added this to the vNext-v3 milestone Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos area:data-streams-monitoring

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants