Direct log submission#1945
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
e380805 to
87409ee
Compare
| } | ||
|
|
||
| public async Task<IApiResponse> PostAsync(ArraySegment<byte> traces) | ||
| public async Task<IApiResponse> PostAsync(ArraySegment<byte> traces, string contentType) |
There was a problem hiding this comment.
Should we rename the traces parameter now that this isn't just traces?
| await memoryStream.FlushAsync().ConfigureAwait(false); | ||
| memoryStream.Seek(0, SeekOrigin.Begin); | ||
| var buffer = memoryStream.GetBuffer(); | ||
| _headers.Add("Content-Type", "application/json"); |
There was a problem hiding this comment.
I assume this was removed because it was redundant? and conflicted with the now dynamic contentType?
Edit: Oh I see it being set was moved to the PostSegmentAsync call.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
| /// <summary> | ||
| /// Configuration key for the transport to use for communicating with the trace agent. | ||
| /// Default value is <c>HTTP</c>. | ||
| /// Options available: <c>HTTP</c>,<c>TCP</c> |
There was a problem hiding this comment.
Where are the logs send? On the Logs side they have deprecated TCP, but maybe it makes sense in that case
There was a problem hiding this comment.
On the Logs side they have deprecated TCP
Oh, didn't realise that! I haven't actually added support for TCP yet anyway, so probably just best to remove all references to it.
| DirectLogSubmissionHost = source?.GetString(ConfigurationKeys.DirectLogSubmission.Host) | ||
| ?? HostMetadata.GetHost(); | ||
| DirectLogSubmissionSource = source?.GetString(ConfigurationKeys.DirectLogSubmission.Source) | ||
| ?? "csharp"; // default |
There was a problem hiding this comment.
[Nit] Out of curiosity, could be any language in .net, so why using "csharp" as a default?
There was a problem hiding this comment.
Anecdotally, csharp is about 99% of the market, so I think it makes sense 😉 It also mirrors the serilog sink behaviour that this was based on
There was a problem hiding this comment.
It's kind of a strange thing to even differentiate between the dotnet languages at this level as they can all intercall into eachother anyways... Like this code would have been from c#... client code could be any mix...
Is there any sort of precedent? Maybe in java? As they could have any number of JVM languages too.
There was a problem hiding this comment.
I totally agree in principal, I'm basing this on the logs documentation which describes the source as needing to be csharp https://docs.datadoghq.com/logs/log_collection/csharp/?tab=serilog#configure-your-datadog-agent I put it down to legacy choices way back!
87409ee to
d56d2bc
Compare
d56d2bc to
0fdde18
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
1201ddb to
f4992c2
Compare
| /// including the desire to send batches "when full" rather than continuing to buffer, and so-on. | ||
| /// </remarks> | ||
| internal class BatchedConnectionStatus | ||
| { |
There was a problem hiding this comment.
I don't think I am comfortable with the scale of the exponential backoff here.
We could be piling up for several minutes it seems like.
Minimally I think these retries should be capped to something like 4, and at best case they should be configurable.
There was a problem hiding this comment.
Yeah, 10 minutes does seem like a long time. That said, if there's no connection to be made, what's the best option? We have a size limit on the queue size, so we're not going to blow up indefinitely, and arguably those first logs when things started to go wrong might be more important than subsequent ones, so maybe better to hang onto those and drop new logs? Not sure what's best tbh, though agree configurable is a good way we can blame it on the user when it's wrong 😉
There was a problem hiding this comment.
Actually, looking through this again, there's significant overlap with the BatchedConnectionStatus and the LogsApi. Both are doing retries/exponential backoff, which we definitely don't want. That said, we're never throwing in DatadogSink.EmitBatch(), so none of the retry code in BatchedConnectionStatus will ever run
I'm inclined to do the following:
- Drop
BatchedConnectionStatusentirely - The
LogsApihas better knowledge about when a batch should be dropped etc, so I thinkSendLogsAsyncshould return a result with what to do (Success/DropBatch/DropQueue). That can bubble up viaEmitBatch()so we can removeBatchedConnectionStatus.
Alternatively, rip the whole thing out and do our own version 😬 The Serilog API is cleaner in some respects, as it separates the logic for when to retry etc from the implementors, but it's using exceptions for flow control, and we're handling that ourselves anyway.
What do you think?
- Rip out
BatchedConnectionStatusonly? - Tear it all down?
- Do 1. or 2. in a follow-up PR
There was a problem hiding this comment.
I'm okay with 1 coming out, do you think we should remove it now?
There was a problem hiding this comment.
Yeah, I can tear it out as part of this PR. I guess it just adds confusion for other people looking at it now anyway 👍
There was a problem hiding this comment.
Reworked the vast majority of the batching implementation in 2dfc44e.
- Dropped
BatchedConnectionStatus - Dropped
PeriodicTimer - Added
CircuitBreaker - Removed some of the complexity around starting/stopping the sink with a tcs. We could optimise it more, but seems sufficient for now to me.
In the previous version, everything was added to a giant (persistent) StringBuilder, we then call ToString(), and convert the string to a byte array (which is then thrown away). The string and utf8 array are ultimately garbage. In this version, we keep a smaller persistent StringBuilder (one for each log, instead of all logs), convert that to a string, and write the bytes to a persistent byte array. Only the intermediate string is garbage. The benefit should be that we have less chance of the string/byte array ending up on the LOH, and we can reuse the UTF8 array
…bled These are constants, and so really doesn't make sense to not be adding these automatically. Arguably, we could also be injecting span_id and trace_id, but that's probably something to think about if we end up replacing the logs injection in general.
It's not implemented anyway, and is now deprecated
As we don't just use it for sending traces any more
In case there are any issues loading the assemblies etc
Set the defaults just in tracer settings, to avoid conflicting values later
…egration. Shouldn't happen, but log a message, and write null instead.
Adding a circuit breaker, plus removing what feels like quite a lot of complexity
Helpful in upcoming changes, and why not
- Don't control log shipping via the "global" IntegrationIds. Having two ways to enable/disable the integration is confusing. Plus this change simplifies things, as we know the configuration at startup time. - Remove warning when logs injection disabled. We automatically add the values anyway, so doesn't matter - Cache the direct log submission settings and sink in the targets/appenders, seeing as none of the settings they use are currently configurable in code - Enabled some nullable ref types (as some things, e.g. the `LogFormatter` in ILogger, _may_ be null, so adds some sanity checks) - Add some try-catches around static initialization for belt-and-braces
We support NLog Logs injection from version 1.0.0.505, but currently only support direct log shipping from 2.1.0. This bumps the "default" package version to be 2.1.0 in the nlog tests
Recent benchmarking shows that we should prefer interfaces generally, so converted to that approach instead. Also tidied up some things, like removing unused properties, fixing namespaces etc.
The formatter test was actually testing Serilog, and it also wouldn't work correctly in some cases AFAICT.
bad556c to
64ef062
Compare
Code Coverage Report 📊✔️ Merging #1945 into master will not change line coverage
View the full report for further details: Datadog.Trace Breakdown ✔️
The following classes have significant coverage changes.
The following classes were added in #1945:
5 classes were removed from Datadog.Trace in #1945 View the full reports for further details: |
Benchmarks Report 🐌Benchmarks for #1945 compared to master:
The following thresholds were used for comparing the benchmark speeds:
Allocation changes below 0.5% are ignored. Benchmark detailsBenchmarks.Trace.AgentWriterBenchmark - Same speed ✔️ Fewer allocations 🎉
|
| Benchmark | Base Allocated | Diff Allocated | Change | Change % |
|---|---|---|---|---|
| Benchmarks.Trace.AgentWriterBenchmark.WriteAndFlushEnrichedTraces‑net472 | 3.17 KB | 3.16 KB | -3 B | -0.09% |
Raw results
| Branch | Method | Toolchain | Mean | StdError | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| master | WriteAndFlushEnrichedTraces |
net472 | 998μs | 2.28μs | 8.54μs | 0 | 0 | 0 | 3.17 KB |
| master | WriteAndFlushEnrichedTraces |
netcoreapp3.1 | 785μs | 1.7μs | 6.57μs | 0 | 0 | 0 | 2.57 KB |
| #1945 | WriteAndFlushEnrichedTraces |
net472 | 983μs | 3.25μs | 12.6μs | 0 | 0 | 0 | 3.16 KB |
| #1945 | WriteAndFlushEnrichedTraces |
netcoreapp3.1 | 774μs | 1.97μs | 7.39μs | 0 | 0 | 0 | 2.57 KB |
Benchmarks.Trace.AspNetCoreBenchmark - Same speed ✔️ Same allocations ✔️
Raw results
| Branch | Method | Toolchain | Mean | StdError | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| master | SendRequest |
net472 | 0ns | 0ns | 0ns | 0 | 0 | 0 | 0 b |
| master | SendRequest |
netcoreapp3.1 | 304μs | 1.61μs | 7.87μs | 0.147 | 0 | 0 | 19.59 KB |
| #1945 | SendRequest |
net472 | 0ns | 0ns | 0ns | 0 | 0 | 0 | 0 b |
| #1945 | SendRequest |
netcoreapp3.1 | 310μs | 1.26μs | 5.03μs | 0.15 | 0 | 0 | 19.59 KB |
Benchmarks.Trace.DbCommandBenchmark - Same speed ✔️ Same allocations ✔️
Raw results
| Branch | Method | Toolchain | Mean | StdError | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| master | ExecuteNonQuery |
net472 | 1.77μs | 8.66ns | 35.7ns | 0.0904 | 0.000895 | 0 | 570 B |
| master | ExecuteNonQuery |
netcoreapp3.1 | 1.61μs | 9.33ns | 84.5ns | 0.00858 | 0 | 0 | 608 B |
| #1945 | ExecuteNonQuery |
net472 | 1.76μs | 8.96ns | 52.2ns | 0.0901 | 0.000867 | 0 | 570 B |
| #1945 | ExecuteNonQuery |
netcoreapp3.1 | 1.7μs | 7.58ns | 29.4ns | 0.00857 | 0 | 0 | 608 B |
Benchmarks.Trace.ElasticsearchBenchmark - Same speed ✔️ Same allocations ✔️
Raw results
| Branch | Method | Toolchain | Mean | StdError | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| master | CallElasticsearch |
net472 | 2.65μs | 13ns | 55.2ns | 0.121 | 0 | 0 | 778 B |
| master | CallElasticsearch |
netcoreapp3.1 | 1.68μs | 6.55ns | 24.5ns | 0.0103 | 0 | 0 | 768 B |
| master | CallElasticsearchAsync |
net472 | 3.31μs | 17.5ns | 90.9ns | 0.142 | 0 | 0 | 915 B |
| master | CallElasticsearchAsync |
netcoreapp3.1 | 1.84μs | 8.37ns | 31.3ns | 0.0126 | 0 | 0 | 888 B |
| #1945 | CallElasticsearch |
net472 | 2.71μs | 13.8ns | 91.9ns | 0.121 | 0 | 0 | 778 B |
| #1945 | CallElasticsearch |
netcoreapp3.1 | 1.7μs | 9.19ns | 49.5ns | 0.0102 | 0 | 0 | 768 B |
| #1945 | CallElasticsearchAsync |
net472 | 3.32μs | 16.9ns | 75.8ns | 0.142 | 0 | 0 | 915 B |
| #1945 | CallElasticsearchAsync |
netcoreapp3.1 | 1.88μs | 7.64ns | 27.5ns | 0.0118 | 0 | 0 | 888 B |
Benchmarks.Trace.GraphQLBenchmark - Same speed ✔️ Same allocations ✔️
Raw results
| Branch | Method | Toolchain | Mean | StdError | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| master | ExecuteAsync |
net472 | 3.19μs | 15.7ns | 64.7ns | 0.162 | 0 | 0 | 1.04 KB |
| master | ExecuteAsync |
netcoreapp3.1 | 1.97μs | 8.13ns | 29.3ns | 0.0145 | 0 | 0 | 1.01 KB |
| #1945 | ExecuteAsync |
net472 | 3.25μs | 12.2ns | 47.1ns | 0.161 | 0 | 0 | 1.04 KB |
| #1945 | ExecuteAsync |
netcoreapp3.1 | 2.07μs | 6.8ns | 26.3ns | 0.0135 | 0 | 0 | 1.01 KB |
Benchmarks.Trace.HttpClientBenchmark - Same speed ✔️ Same allocations ✔️
Raw results
| Branch | Method | Toolchain | Mean | StdError | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| master | SendAsync |
net472 | 6.87μs | 37.5ns | 209ns | 0.351 | 0 | 0 | 2.24 KB |
| master | SendAsync |
netcoreapp3.1 | 4.57μs | 18.1ns | 70ns | 0.0295 | 0 | 0 | 2.14 KB |
| #1945 | SendAsync |
net472 | 6.84μs | 31.4ns | 122ns | 0.35 | 0 | 0 | 2.24 KB |
| #1945 | SendAsync |
netcoreapp3.1 | 4.51μs | 6.33ns | 22.8ns | 0.0297 | 0 | 0 | 2.14 KB |
Benchmarks.Trace.ILoggerBenchmark - Same speed ✔️ Same allocations ✔️
Raw results
| Branch | Method | Toolchain | Mean | StdError | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| master | EnrichedLog |
net472 | 4.15μs | 18.9ns | 70.8ns | 0.22 | 0 | 0 | 1.41 KB |
| master | EnrichedLog |
netcoreapp3.1 | 3.31μs | 15.3ns | 59.2ns | 0.0213 | 0 | 0 | 1.49 KB |
| #1945 | EnrichedLog |
net472 | 4.31μs | 15.2ns | 56.7ns | 0.221 | 0 | 0 | 1.41 KB |
| #1945 | EnrichedLog |
netcoreapp3.1 | 3.37μs | 9.94ns | 37.2ns | 0.0198 | 0 | 0 | 1.49 KB |
Benchmarks.Trace.Log4netBenchmark - Same speed ✔️ More allocations ⚠️
More allocations ⚠️ in #1945
Benchmark
Base Allocated
Diff Allocated
Change
Change %
Benchmarks.Trace.Log4netBenchmark.EnrichedLog‑netcoreapp3.1
4.17 KB
4.17 KB
1 B
0.02%
| Benchmark | Base Allocated | Diff Allocated | Change | Change % |
|---|---|---|---|---|
| Benchmarks.Trace.Log4netBenchmark.EnrichedLog‑netcoreapp3.1 | 4.17 KB | 4.17 KB | 1 B | 0.02% |
Raw results
| Branch | Method | Toolchain | Mean | StdError | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| master | EnrichedLog |
net472 | 179μs | 464ns | 1.74μs | 0.529 | 0.176 | 0 | 4.29 KB |
| master | EnrichedLog |
netcoreapp3.1 | 160μs | 734ns | 3.44μs | 0 | 0 | 0 | 4.17 KB |
| #1945 | EnrichedLog |
net472 | 180μs | 472ns | 1.7μs | 0.457 | 0.183 | 0 | 4.29 KB |
| #1945 | EnrichedLog |
netcoreapp3.1 | 156μs | 491ns | 1.9μs | 0 | 0 | 0 | 4.17 KB |
Benchmarks.Trace.NLogBenchmark - Same speed ✔️ Same allocations ✔️
Raw results
| Branch | Method | Toolchain | Mean | StdError | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| master | EnrichedLog |
net472 | 8.52μs | 30.3ns | 117ns | 0.499 | 0 | 0 | 3.19 KB |
| master | EnrichedLog |
netcoreapp3.1 | 7.1μs | 38.3ns | 213ns | 0.0479 | 0 | 0 | 3.56 KB |
| #1945 | EnrichedLog |
net472 | 8.38μs | 29.4ns | 114ns | 0.497 | 0 | 0 | 3.19 KB |
| #1945 | EnrichedLog |
netcoreapp3.1 | 7.03μs | 39.5ns | 271ns | 0.0502 | 0 | 0 | 3.56 KB |
Benchmarks.Trace.RedisBenchmark - Same speed ✔️ Same allocations ✔️
Raw results
| Branch | Method | Toolchain | Mean | StdError | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| master | SendReceive |
net472 | 2.59μs | 14ns | 81.8ns | 0.154 | 0 | 0 | 987 B |
| master | SendReceive |
netcoreapp3.1 | 2.09μs | 7.27ns | 34.1ns | 0.0137 | 0 | 0 | 984 B |
| #1945 | SendReceive |
net472 | 2.51μs | 8.79ns | 34ns | 0.154 | 0 | 0 | 987 B |
| #1945 | SendReceive |
netcoreapp3.1 | 2.13μs | 10.7ns | 50.4ns | 0.0138 | 0 | 0 | 984 B |
Benchmarks.Trace.SerilogBenchmark - Same speed ✔️ Same allocations ✔️
Raw results
| Branch | Method | Toolchain | Mean | StdError | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| master | EnrichedLog |
net472 | 6.89μs | 33.4ns | 138ns | 0.285 | 0 | 0 | 1.83 KB |
| master | EnrichedLog |
netcoreapp3.1 | 5.77μs | 18.5ns | 71.5ns | 0.02 | 0 | 0 | 1.45 KB |
| #1945 | EnrichedLog |
net472 | 7.05μs | 33.5ns | 134ns | 0.283 | 0 | 0 | 1.83 KB |
| #1945 | EnrichedLog |
netcoreapp3.1 | 5.76μs | 6.34ns | 22.8ns | 0.02 | 0 | 0 | 1.45 KB |
Benchmarks.Trace.SpanBenchmark - Faster 🎉 Same allocations ✔️
Faster 🎉 in #1945
Benchmark
base/diff
Base Median (ns)
Diff Median (ns)
Modality
Benchmarks.Trace.SpanBenchmark.StartFinishScope‑net472
1.176
1,483.56
1,261.07
| Benchmark | base/diff | Base Median (ns) | Diff Median (ns) | Modality |
|---|---|---|---|---|
| Benchmarks.Trace.SpanBenchmark.StartFinishScope‑net472 | 1.176 | 1,483.56 | 1,261.07 |
Raw results
| Branch | Method | Toolchain | Mean | StdError | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| master | StartFinishSpan |
net472 | 1.01μs | 4.98ns | 22.3ns | 0.0677 | 0 | 0 | 433 B |
| master | StartFinishSpan |
netcoreapp3.1 | 943ns | 3.63ns | 13.1ns | 0.00607 | 0 | 0 | 432 B |
| master | StartFinishScope |
net472 | 1.49μs | 7.6ns | 35.7ns | 0.08 | 0 | 0 | 514 B |
| master | StartFinishScope |
netcoreapp3.1 | 1.11μs | 5.12ns | 19.8ns | 0.00787 | 0 | 0 | 552 B |
| #1945 | StartFinishSpan |
net472 | 1.03μs | 3.27ns | 12.7ns | 0.0679 | 0 | 0 | 433 B |
| #1945 | StartFinishSpan |
netcoreapp3.1 | 977ns | 2.64ns | 10.2ns | 0.00591 | 0 | 0 | 432 B |
| #1945 | StartFinishScope |
net472 | 1.25μs | 4.94ns | 18.5ns | 0.0804 | 0 | 0 | 514 B |
| #1945 | StartFinishScope |
netcoreapp3.1 | 1.14μs | 3.12ns | 12.1ns | 0.00795 | 0 | 0 | 552 B |
|
Closing in favour of #2240, which is just a reworking of the same content to try and make it easier to review |
This PR adds support for directly shipping logs to the Datadog intake, so customers don't have to configure log tailing in the agent etc. Supports all our existing logs injection frameworks
Each of the commits is pretty well self contained, so may be best to review sequentially.
Approach
This essentially takes the same approach as the official Datadog Serilog Sink. In various places I deviated from their approach however, as there were various opportunities for performance improvements (not serializing too JSON, then back to Dynamic, and back to JSON etc).
I partially vendored-in the Serilog batching sink provider, removed the Serilog-isms, and created an abstraction around that. Similar to other components, we have a singleton
DatadogSink, (which implements the vendored batching sink) and is responsible for formatting and sending batches of logs. TheLogsApitype takes care of the transport (similar to theApiclass for traces). EDIT: much of the vendored code has been reworkedFor serialization, I used a similar formatting to the
CompactJsonFormatter, as that's supported in the logs documentation, serializing efficiently where possible. Without System.Text.Json's Utf8Writer, the best I could do to reduce memory impact was to reuse a stringbuilder, and have a persistent buffer for the UTF8 bytes. EDIT: Given System.Text.Json is built-in to netcore3.1+, I wonder if we should use that where available 🤔We use automatic instrumentation to add the
ILoggerProvider/SerilogSinkas required, which then farms off the logs to the singleton Serilog Sink.Current limitations
We could feasibly ship a PoC with that I think, but we may want to add TCP too. Shouldn't be too hard, but wasn't worth the effort right now I think.TCP has been deprecated, so that solves thatTheChanged with 2.0 with the introduction ofDirectLogSubmissioninstance is a singleton that's initialized with the profiler (as it requires some tracer settings etc, e.g. service names, disabled integrations). Much like runtime metrics, this cannot be reconfigured in code by creating a new Tracer, as that would cause significant additional complexity.TracerManagerThe batching options (batch size, buffer size etc) are currently hard coded but could easily be exposed as environment variablesAt Colin's request, made these configurable too.LogsApire-uses the same transports as the agent writer, but we're sending it over the public network, so there's a lot of failure points that could lead to timeouts being common. With the long timeouts we currently have, we could potentially end up buffering a lot of data. That's something we should probably address more generally though, rather than in this PR. EDIT: this has been partially addressed with the addition of theCircuitBreaker, though we should still reduce the HTTP timeouts IMOQuestions
DD_SITEsettingTesting
Added integration tests for each framework
Also did manual testing of each framework, and confirmed I can see the logs in the UI:
@DataDog/apm-dotnet