Skip OTLP exporter errors from being sent to telemetry#8133
Skip OTLP exporter errors from being sent to telemetry#8133
Conversation
They all seem to be network-related
BenchmarksBenchmark execution time: 2026-02-03 20:38:17 Comparing candidate commit efa17bc in PR branch Found 2 performance improvements and 5 performance regressions! Performance is the same for 169 metrics, 16 unstable metrics. scenario:Benchmarks.Trace.CharSliceBenchmark.OptimizedCharSliceWithPool net6.0
scenario:Benchmarks.Trace.ElasticsearchBenchmark.CallElasticsearch net6.0
scenario:Benchmarks.Trace.GraphQLBenchmark.ExecuteAsync net472
scenario:Benchmarks.Trace.GraphQLBenchmark.ExecuteAsync netcoreapp3.1
scenario:Benchmarks.Trace.SpanBenchmark.StartFinishSpan net6.0
scenario:Benchmarks.Trace.TraceAnnotationsBenchmark.RunOnMethodBegin net6.0
|
andrewlock
left a comment
There was a problem hiding this comment.
Thanks, I think most of these Should stay as errors, but mostly because AFAICT, they never occur. And we're missing some logs in some places we should have them I think
| // Seeing network connectivity errors so skipping telemetry | ||
| Log.ErrorSkipTelemetry(ex, "Error sending OTLP request (attempt {Attempt})", (attempt + 1).ToString()); |
There was a problem hiding this comment.
Arguably, this shouldn't be an error at all IMO. Transient errors are expected, so this almost certainly shouldn't be logging an error at all. In Api for example, these "intermediate" attempts are Debug. Also nit, AI did the ToString() I assume 🙄
| // Seeing network connectivity errors so skipping telemetry | |
| Log.ErrorSkipTelemetry(ex, "Error sending OTLP request (attempt {Attempt})", (attempt + 1).ToString()); | |
| Log.Debug<int>(ex, "Error sending OTLP request (attempt {Attempt})", attempt + 1);``` |
There was a problem hiding this comment.
Ah yes, I ran my /analyze-error on it, forgot to add the label
There was a problem hiding this comment.
I'll swap these over thanks!
Co-authored-by: Andrew Lock <andrew.lock@datadoghq.com>
Execution-Time Benchmarks Report ⏱️Execution-time results for samples comparing This PR (8133) and master. ✅ No regressions detected - check the details below Full Metrics ComparisonFakeDbCommand
HttpMessageHandler
Comparison explanationExecution-time benchmarks measure the whole time it takes to execute a program, and are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are highlighted in **red**. The following thresholds were used for comparing the execution times:
Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard. Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph). Duration chartsFakeDbCommand (.NET Framework 4.8)gantt
title Execution time (ms) FakeDbCommand (.NET Framework 4.8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8133) - mean (80ms) : 78, 82
master - mean (79ms) : 77, 82
section Bailout
This PR (8133) - mean (85ms) : 79, 91
master - mean (85ms) : 83, 86
section CallTarget+Inlining+NGEN
This PR (8133) - mean (1,124ms) : 1068, 1181
master - mean (1,122ms) : 1067, 1178
FakeDbCommand (.NET Core 3.1)gantt
title Execution time (ms) FakeDbCommand (.NET Core 3.1)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8133) - mean (137ms) : 133, 141
master - mean (126ms) : 123, 129
section Bailout
This PR (8133) - mean (139ms) : crit, 137, 142
master - mean (126ms) : 122, 130
section CallTarget+Inlining+NGEN
This PR (8133) - mean (904ms) : crit, 855, 953
master - mean (808ms) : 755, 860
FakeDbCommand (.NET 6)gantt
title Execution time (ms) FakeDbCommand (.NET 6)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8133) - mean (112ms) : 109, 116
master - mean (111ms) : 108, 114
section Bailout
This PR (8133) - mean (115ms) : 112, 117
master - mean (112ms) : 110, 114
section CallTarget+Inlining+NGEN
This PR (8133) - mean (790ms) : 731, 849
master - mean (784ms) : 717, 851
FakeDbCommand (.NET 8)gantt
title Execution time (ms) FakeDbCommand (.NET 8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8133) - mean (118ms) : 113, 122
master - mean (107ms) : 103, 111
section Bailout
This PR (8133) - mean (119ms) : crit, 115, 123
master - mean (108ms) : 105, 111
section CallTarget+Inlining+NGEN
This PR (8133) - mean (776ms) : crit, 752, 800
master - mean (723ms) : 698, 749
HttpMessageHandler (.NET Framework 4.8)gantt
title Execution time (ms) HttpMessageHandler (.NET Framework 4.8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8133) - mean (212ms) : 201, 222
master - mean (204ms) : 196, 212
section Bailout
This PR (8133) - mean (211ms) : 205, 217
master - mean (207ms) : 200, 213
section CallTarget+Inlining+NGEN
This PR (8133) - mean (1,207ms) : 1139, 1275
master - mean (1,191ms) : 1143, 1239
HttpMessageHandler (.NET Core 3.1)gantt
title Execution time (ms) HttpMessageHandler (.NET Core 3.1)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8133) - mean (305ms) : 284, 326
master - mean (296ms) : 283, 309
section Bailout
This PR (8133) - mean (303ms) : 283, 324
master - mean (296ms) : 288, 305
section CallTarget+Inlining+NGEN
This PR (8133) - mean (1,003ms) : 952, 1054
master - mean (972ms) : 910, 1033
HttpMessageHandler (.NET 6)gantt
title Execution time (ms) HttpMessageHandler (.NET 6)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8133) - mean (310ms) : 292, 327
master - mean (290ms) : 276, 304
section Bailout
This PR (8133) - mean (313ms) : crit, 293, 333
master - mean (294ms) : 284, 305
section CallTarget+Inlining+NGEN
This PR (8133) - mean (1,050ms) : crit, 997, 1104
master - mean (979ms) : 906, 1053
HttpMessageHandler (.NET 8)gantt
title Execution time (ms) HttpMessageHandler (.NET 8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8133) - mean (302ms) : 289, 314
master - mean (291ms) : 281, 300
section Bailout
This PR (8133) - mean (307ms) : 286, 329
master - mean (292ms) : 281, 302
section CallTarget+Inlining+NGEN
This PR (8133) - mean (972ms) : 862, 1083
master - mean (993ms) : 877, 1108
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
## Summary of changes Updates `/analyze-error` to not update all `Log.Error` calls in the to `Log.ErrorSkipTelemetry` as it did in #8133 Added some more generic logging considerations to `AGENTS.md` ## Reason for change The `/analyze-error` command was instructing to update all instances of `Log.Error` instead of just the correct / expected one. AGENTS.md had some logging information but I think it lacked some specificity, so updating very basically with some "best practices" ## Implementation details ## Test coverage ## Other details <!-- Fixes #{issue} --> I reran the `/analyze-error` command with these changes again on the same error I did with #8133 and with both of these it changed to this following diff. It isn't _perfect_ but at least it isn't super wrong, it also changed the `ToString` used to a generic call. Note that the initial change was 5ab5b10 ```diff diff --git a/tracer/src/Datadog.Trace/OpenTelemetry/Metrics/OtlpExporter.cs b/tracer/src/Datadog.Trace/OpenTelemetry/Metrics/OtlpExporter.cs index e90bee8..8c2acd3c2 100644 --- a/tracer/src/Datadog.Trace/OpenTelemetry/Metrics/OtlpExporter.cs +++ b/tracer/src/Datadog.Trace/OpenTelemetry/Metrics/OtlpExporter.cs @@ -353,21 +353,27 @@ namespace Datadog.Trace.OpenTelemetry.Metrics } catch (TaskCanceledException) when (attempt < maxRetries) { + Log.Debug<int>("OTLP request timed out (attempt {Attempt}), retrying...", attempt + 1); + retryDelay = TimeSpan.FromMilliseconds((long)(retryDelay.TotalMilliseconds * 2)); + await Task.Delay(retryDelay).ConfigureAwait(false); + } + catch (TaskCanceledException ex) + { + // Final timeout after all retries exhausted + Log.ErrorSkipTelemetry<Uri, int>(ex, "OTLP request to {Endpoint} timed out after {Attempt} attempts. See https://docs.datadoghq.com/tracing/troubleshooting/ for troubleshooting.", _endpoint, attempt + 1); + return false; + } + catch (Exception ex) when (attempt < maxRetries) + { + Log.Debug<int>(ex, "Error sending OTLP request (attempt {Attempt}), retrying...", attempt + 1); retryDelay = TimeSpan.FromMilliseconds((long)(retryDelay.TotalMilliseconds * 2)); await Task.Delay(retryDelay).ConfigureAwait(false); } catch (Exception ex) { - Log.Error(ex, "Error sending OTLP request (attempt {Attempt})", (attempt + 1).ToString()); - if (attempt < maxRetries) - { - retryDelay = TimeSpan.FromMilliseconds((long)(retryDelay.TotalMilliseconds * 2)); - await Task.Delay(retryDelay).ConfigureAwait(false); - } - else - { - return false; - } + // Final failure after all retries exhausted + Log.ErrorSkipTelemetry<Uri, int>(ex, "Failed to send OTLP request to {Endpoint} after {Attempt} attempts. See https://docs.datadoghq.com/tracing/troubleshooting/ for troubleshooting.", _endpoint, attempt + 1); + return false; } } ``` <!--⚠️ Note: Where possible, please obtain 2 approvals prior to merging. Unless CODEOWNERS specifies otherwise, for external teams it is typically best to have one review from a team member, and one review from apm-dotnet. Trivial changes do not require 2 reviews. MergeQueue is NOT enabled in this repository. If you have write access to the repo, the PR has 1-2 approvals (see above), and all of the required checks have passed, you can use the Squash and Merge button to merge the PR. If you don't have write access, or you need help, reach out in the #apm-dotnet channel in Slack. --> --------- Co-authored-by: Andrew Lock <andrew.lock@datadoghq.com>
Summary of changes
Changes
Log.ErrortoLog.ErrorSkipTelemetryfor the OtlpExporter errors as they all seemed to be just standard transient network errors.Reason for change
Noticed a lot of new errors but they just seem to be transient networking errors.
https://app.datadoghq.com/error-tracking/issue/443da854-fe18-11f0-ae87-da7ad0900002
Implementation details
Changed
Log.ErrortoLog.ErrorSkipTelemetryTest coverage
Other details
Maybe it is better to keep it as is for the time being since it is so new?
Unsure honestly, fine either way. Easy to just mark the error ignored