Direct log submission#2240
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
pierotibou
left a comment
There was a problem hiding this comment.
LGTM. Only a small copy paste issue in LogFormatter.cs and nitpicking comments
kevingosse
left a comment
There was a problem hiding this comment.
Note that I've only reviewed the logic, I didn't check that you're using the correct hook in the log libraries.
| [DuckReverseMethod] | ||
| public DirectSubmissionLogger CreateLogger(string categoryName) | ||
| { | ||
| return _loggers.GetOrAdd(categoryName, CreateLoggerImplementation); |
There was a problem hiding this comment.
I don't know how often this is called, but I suggest caching the CreateLoggerImplementation into a field (populated by the constructor) to save an allocation at each invocation
There was a problem hiding this comment.
As I understand it, this is already cached by the caller of this function:
Also, note that this is returning a DirectSubmissionLogger parameterised on categoryName, so we would have to use a concurrent dictionary keyed on categoryName...
| { | ||
| private const string MessageTemplateKey = "{OriginalFormat}"; | ||
|
|
||
| public static string? FormatLogEvent<T>(LogFormatter logFormatter, in LogEntry<T> logEntry) |
There was a problem hiding this comment.
Good call on using the in keyword here 👍
| var haveEnv = false; | ||
| string? messageTemplate = null; | ||
|
|
||
| if (logEntry.State is { } state) |
| } | ||
| } | ||
|
|
||
| if (logEntry.ScopeProvider is { } scopeProvider) |
There was a problem hiding this comment.
#okstartingtofeelabitashamednow
| if (success) | ||
| { | ||
| status = _circuitBreaker.MarkSuccess(); | ||
| haveMultipleBatchesToSend = _waitingBatch.Count >= _batchSizeLimit; |
There was a problem hiding this comment.
_waitingBatch.Count can be equal to _batchSizeLimit but never bigger. I suppose you used >= as precaution.
There was a problem hiding this comment.
Yeah, exactly (actually this was part of the original Serilog code, but seemed a reasonable precaution to keep)
| return _circuitBreaker.MarkFailure(); | ||
| } | ||
| } | ||
| while (haveMultipleBatchesToSend); |
There was a problem hiding this comment.
Note that, if the application produces _batchSizeLimit logs faster than it takes to send them, we can get in a situation where we keep sending logs without ever waiting for _flushPeriod. I don't know if that's a problem (probably not).
There was a problem hiding this comment.
I'd argue that's a feature rather than a bug - if we need to constantly send logs to keep up with the app, then the alternative would effectively be a memory leak as we buffer more and more
| _queueLimit = queueLimit ?? Unbounded; | ||
| } | ||
|
|
||
| public int Count => _queue.Count; |
There was a problem hiding this comment.
Getting the count of a ConcurrentQueue is costly, we should remove the property so we're not tempted to use it
There was a problem hiding this comment.
Done in 8dc2c8d, and expose the _queue instead, so the tests still compile
| { } | ||
| finally | ||
| { | ||
| // prevent state corrupt while aborting |
There was a problem hiding this comment.
That's probably something I'll want to remove eventually. We already know that our tracer is not resilient to aborted threads, so that's really not bringing anything (and just preventing inlining because of the try/finally block)
| exception: null, | ||
| (s, ex) => $"This is {s}"); | ||
|
|
||
| var log = sink.Events.Should().ContainSingle().Subject; |
cf65229 to
1dcfa42
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
b441218 to
70f3b3f
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
70f3b3f to
59841f6
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
98eefec to
5676fd5
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
5676fd5 to
52b41eb
Compare
| { | ||
| internal static class LogsTransportStrategy | ||
| { | ||
| private static readonly IDatadogLogger Log = DatadogLogging.GetLoggerFor<Tracer>(); |
|
|
||
| try | ||
| { | ||
| // TODO: Metrics/Telemetry? |
There was a problem hiding this comment.
Did you want to act on this TODO?
| return true; | ||
| } | ||
|
|
||
| shouldRetry = response.StatusCode switch |
| } | ||
|
|
||
| private Task HandleCircuitStatus(CircuitStatus status) | ||
| { |
| { | ||
| public LockedTracerManager() | ||
| : base(null, null, null, null, null, null, null) | ||
| : base(null, null, null, null, null, null, null, null) |
Previous logic didn't guarantee we would allocate an array large enough
With the previous implementation, we were allocating a List<object>() for every struct, even if the list was never used (it's only used when there are non-dictionary scope values added using e.g. `ILogger.BeginScope("SomeValue")`)
With the new approach, we lazily allocate the list on first usage instead
52b41eb to
5ea0b65
Compare
Code Coverage Report 📊✔️ Merging #2240 into master will not change line coverage
View the full report for further details: Datadog.Trace Breakdown ✔️
The following classes have significant coverage changes.
The following classes were added in #2240: View the full reports for further details: |
Benchmarks Report 🐌Benchmarks for #2240 compared to master:
The following thresholds were used for comparing the benchmark speeds:
Allocation changes below 0.5% are ignored. Benchmark detailsBenchmarks.Trace.AgentWriterBenchmark - Same speed ✔️ Same allocations ✔️Raw results
Benchmarks.Trace.AspNetCoreBenchmark - Same speed ✔️ Same allocations ✔️Raw results
Benchmarks.Trace.DbCommandBenchmark - Same speed ✔️ Same allocations ✔️Raw results
Benchmarks.Trace.ElasticsearchBenchmark - Slower
|
| Benchmark | diff/base | Base Median (ns) | Diff Median (ns) | Modality |
|---|---|---|---|---|
| Benchmarks.Trace.ElasticsearchBenchmark.CallElasticsearch‑net472 | 1.209 | 2,503.40 | 3,025.55 |
Raw results
| Branch | Method | Toolchain | Mean | StdError | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| master | CallElasticsearch |
net472 | 2.53μs | 14.2ns | 89.7ns | 0.122 | 0 | 0 | 778 B |
| master | CallElasticsearch |
netcoreapp3.1 | 1.75μs | 5.86ns | 21.1ns | 0.0103 | 0 | 0 | 768 B |
| master | CallElasticsearchAsync |
net472 | 2.74μs | 7.56ns | 27.3ns | 0.142 | 0 | 0 | 915 B |
| master | CallElasticsearchAsync |
netcoreapp3.1 | 1.97μs | 9.15ns | 34.2ns | 0.0121 | 0 | 0 | 888 B |
| #2240 | CallElasticsearch |
net472 | 3.04μs | 13.8ns | 53.6ns | 0.12 | 0 | 0 | 778 B |
| #2240 | CallElasticsearch |
netcoreapp3.1 | 1.74μs | 6.02ns | 22.5ns | 0.0112 | 0 | 0 | 768 B |
| #2240 | CallElasticsearchAsync |
net472 | 2.9μs | 13.1ns | 50.8ns | 0.142 | 0 | 0 | 915 B |
| #2240 | CallElasticsearchAsync |
netcoreapp3.1 | 1.91μs | 5.8ns | 21.7ns | 0.0121 | 0 | 0 | 888 B |
Benchmarks.Trace.GraphQLBenchmark - Slower ⚠️ Same allocations ✔️
Slower ⚠️ in #2240
Benchmark
diff/base
Base Median (ns)
Diff Median (ns)
Modality
Benchmarks.Trace.GraphQLBenchmark.ExecuteAsync‑net472
1.413
3,372.64
4,763.98
several?
| Benchmark | diff/base | Base Median (ns) | Diff Median (ns) | Modality |
|---|---|---|---|---|
| Benchmarks.Trace.GraphQLBenchmark.ExecuteAsync‑net472 | 1.413 | 3,372.64 | 4,763.98 | several? |
Raw results
| Branch | Method | Toolchain | Mean | StdError | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| master | ExecuteAsync |
net472 | 3.38μs | 17.4ns | 107ns | 0.161 | 0 | 0 | 1.04 KB |
| master | ExecuteAsync |
netcoreapp3.1 | 2.03μs | 9.4ns | 36.4ns | 0.014 | 0 | 0 | 1.01 KB |
| #2240 | ExecuteAsync |
net472 | 4.69μs | 36.8ns | 368ns | 0.162 | 0 | 0 | 1.04 KB |
| #2240 | ExecuteAsync |
netcoreapp3.1 | 2.12μs | 12.2ns | 92.9ns | 0.0148 | 0 | 0 | 1.01 KB |
Benchmarks.Trace.HttpClientBenchmark - Same speed ✔️ Same allocations ✔️
Raw results
| Branch | Method | Toolchain | Mean | StdError | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| master | SendAsync |
net472 | 6.85μs | 38.8ns | 298ns | 0.348 | 0 | 0 | 2.24 KB |
| master | SendAsync |
netcoreapp3.1 | 4.83μs | 26.6ns | 206ns | 0.0289 | 0 | 0 | 2.14 KB |
| #2240 | SendAsync |
net472 | 6.55μs | 34.5ns | 165ns | 0.348 | 0 | 0 | 2.24 KB |
| #2240 | SendAsync |
netcoreapp3.1 | 4.89μs | 23.2ns | 93ns | 0.0296 | 0 | 0 | 2.14 KB |
Benchmarks.Trace.ILoggerBenchmark - Same speed ✔️ Same allocations ✔️
Raw results
| Branch | Method | Toolchain | Mean | StdError | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| master | EnrichedLog |
net472 | 3.6μs | 17.4ns | 69.7ns | 0.22 | 0 | 0 | 1.41 KB |
| master | EnrichedLog |
netcoreapp3.1 | 3.18μs | 11.5ns | 41.6ns | 0.0209 | 0 | 0 | 1.49 KB |
| #2240 | EnrichedLog |
net472 | 3.82μs | 17.1ns | 66.4ns | 0.22 | 0 | 0 | 1.41 KB |
| #2240 | EnrichedLog |
netcoreapp3.1 | 3.25μs | 13.8ns | 53.3ns | 0.0212 | 0 | 0 | 1.49 KB |
Benchmarks.Trace.Log4netBenchmark - Same speed ✔️ Same allocations ✔️
Raw results
| Branch | Method | Toolchain | Mean | StdError | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| master | EnrichedLog |
net472 | 216μs | 911ns | 3.53μs | 0.422 | 0.105 | 0 | 4.29 KB |
| master | EnrichedLog |
netcoreapp3.1 | 191μs | 495ns | 1.85μs | 0 | 0 | 0 | 4.17 KB |
| #2240 | EnrichedLog |
net472 | 218μs | 888ns | 3.66μs | 0.43 | 0.108 | 0 | 4.29 KB |
| #2240 | EnrichedLog |
netcoreapp3.1 | 200μs | 1.09μs | 5.95μs | 0 | 0 | 0 | 4.17 KB |
Benchmarks.Trace.NLogBenchmark - Same speed ✔️ Same allocations ✔️
Raw results
| Branch | Method | Toolchain | Mean | StdError | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| master | EnrichedLog |
net472 | 8.29μs | 36.1ns | 135ns | 0.497 | 0 | 0 | 3.19 KB |
| master | EnrichedLog |
netcoreapp3.1 | 6.84μs | 28.3ns | 110ns | 0.0508 | 0 | 0 | 3.56 KB |
| #2240 | EnrichedLog |
net472 | 8.12μs | 43.9ns | 240ns | 0.496 | 0 | 0 | 3.19 KB |
| #2240 | EnrichedLog |
netcoreapp3.1 | 7.42μs | 31.1ns | 121ns | 0.0484 | 0 | 0 | 3.56 KB |
Benchmarks.Trace.RedisBenchmark - Same speed ✔️ Same allocations ✔️
Raw results
| Branch | Method | Toolchain | Mean | StdError | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| master | SendReceive |
net472 | 2.42μs | 10.3ns | 35.8ns | 0.155 | 0 | 0 | 987 B |
| master | SendReceive |
netcoreapp3.1 | 2.18μs | 6.83ns | 25.6ns | 0.0136 | 0 | 0 | 984 B |
| #2240 | SendReceive |
net472 | 2.32μs | 8.56ns | 34.2ns | 0.154 | 0 | 0 | 987 B |
| #2240 | SendReceive |
netcoreapp3.1 | 2.2μs | 12.7ns | 105ns | 0.0141 | 0 | 0 | 984 B |
Benchmarks.Trace.SerilogBenchmark - Same speed ✔️ Same allocations ✔️
Raw results
| Branch | Method | Toolchain | Mean | StdError | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| master | EnrichedLog |
net472 | 6.95μs | 30ns | 112ns | 0.284 | 0 | 0 | 1.83 KB |
| master | EnrichedLog |
netcoreapp3.1 | 5.77μs | 33.2ns | 276ns | 0.0192 | 0 | 0 | 1.45 KB |
| #2240 | EnrichedLog |
net472 | 6.86μs | 38.3ns | 251ns | 0.285 | 0 | 0 | 1.83 KB |
| #2240 | EnrichedLog |
netcoreapp3.1 | 5.42μs | 23.5ns | 88ns | 0.0188 | 0 | 0 | 1.45 KB |
Benchmarks.Trace.SpanBenchmark - Same speed ✔️ Same allocations ✔️
Raw results
| Branch | Method | Toolchain | Mean | StdError | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| master | StartFinishSpan |
net472 | 995ns | 5.54ns | 33.2ns | 0.0676 | 0 | 0 | 433 B |
| master | StartFinishSpan |
netcoreapp3.1 | 928ns | 4.25ns | 15.9ns | 0.00596 | 0 | 0 | 432 B |
| master | StartFinishScope |
net472 | 1.17μs | 5.55ns | 21.5ns | 0.0804 | 0 | 0 | 514 B |
| master | StartFinishScope |
netcoreapp3.1 | 1.06μs | 4.09ns | 15.3ns | 0.00741 | 0 | 0 | 552 B |
| #2240 | StartFinishSpan |
net472 | 974ns | 5.13ns | 25.6ns | 0.068 | 0 | 0 | 433 B |
| #2240 | StartFinishSpan |
netcoreapp3.1 | 977ns | 4.59ns | 17.8ns | 0.00575 | 0 | 0 | 432 B |
| #2240 | StartFinishScope |
net472 | 1.2μs | 6.14ns | 26.8ns | 0.0802 | 0 | 0 | 514 B |
| #2240 | StartFinishScope |
netcoreapp3.1 | 1.16μs | 5.6ns | 21.7ns | 0.00778 | 0 | 0 | 552 B |
Add an integration for the [OpenAI library](https://github.com/openai/openai-python). This integration provides tracing for the completion, embeddings and chat completion endpoints along with cost estimation metrics and prompt/completion sampling logs. Each log, metric and trace are tagged with service, env, version, OpenAI model, OpenAI endpoint and OpenAI organization. [Docs preview](https://output.circle-artifacts.com/output/job/fe3599b8-952e-4ceb-ac4f-0f15503e9c0d/artifacts/0/tmp/docs/integrations.html#openai) ## Design ### Logs A new log writer implementation is added to submit logs. Logs are submitted direct to intake following a similar approach that [kyle-verhoog/datadog-python](https://github.com/Kyle-Verhoog/datadog-python/blob/main/datadog/_logging.py) and the [.NET tracer](DataDog/dd-trace-dotnet#2240) have taken already. ### Metrics A statsd client is used specifically for the OpenAI integration. ## Testing Testing is done using VCR to record requests made to OpenAI to ensure ease, consistency and reliability in test cases. Logs and metrics are tested using mocking of the clients. Several integration tests using snapshots and subprocess testing ensure that the integration works in a real world OpenAI application. A manual test app was also used: https://gist.github.com/Kyle-Verhoog/1f263ed0aade076b313167d1ba3bfa16 ## Risk - Currently, logs, traces and metrics are all collected, buffered and sent individually through their respective pipelines. Due to this, there is risk that disparity occurs between the tagging and submission of the data. There is also a performance risk as this data is not aggregated or batched when submitted. - Prompts and completions are captured on spans by default with a default limit on the length of the data. This limit only applies to each prompt/completion individually but requests can contain several prompts and completions. If there are many prompts and completions of great length then there is a risk of performance overhead of encoding and transmitting the data. - Logs and metrics clients are specified specifically for this integration. If another integration were to introduce logs then there would be a need for another log writer. Having several log writers could induce thread contention and high memory usage. Co-authored-by: Kyle Verhoog <kyle@verhoog.ca> Co-authored-by: Kari Halsted <12926135+kayayarai@users.noreply.github.com>
Add an integration for the [OpenAI library](https://github.com/openai/openai-python). This integration provides tracing for the completion, embeddings and chat completion endpoints along with cost estimation metrics and prompt/completion sampling logs. Each log, metric and trace are tagged with service, env, version, OpenAI model, OpenAI endpoint and OpenAI organization. [Docs preview](https://output.circle-artifacts.com/output/job/fe3599b8-952e-4ceb-ac4f-0f15503e9c0d/artifacts/0/tmp/docs/integrations.html#openai) ## Design ### Logs A new log writer implementation is added to submit logs. Logs are submitted direct to intake following a similar approach that [kyle-verhoog/datadog-python](https://github.com/Kyle-Verhoog/datadog-python/blob/main/datadog/_logging.py) and the [.NET tracer](DataDog/dd-trace-dotnet#2240) have taken already. ### Metrics A statsd client is used specifically for the OpenAI integration. ## Testing Testing is done using VCR to record requests made to OpenAI to ensure ease, consistency and reliability in test cases. Logs and metrics are tested using mocking of the clients. Several integration tests using snapshots and subprocess testing ensure that the integration works in a real world OpenAI application. A manual test app was also used: https://gist.github.com/Kyle-Verhoog/1f263ed0aade076b313167d1ba3bfa16 ## Risk - Currently, logs, traces and metrics are all collected, buffered and sent individually through their respective pipelines. Due to this, there is risk that disparity occurs between the tagging and submission of the data. There is also a performance risk as this data is not aggregated or batched when submitted. - Prompts and completions are captured on spans by default with a default limit on the length of the data. This limit only applies to each prompt/completion individually but requests can contain several prompts and completions. If there are many prompts and completions of great length then there is a risk of performance overhead of encoding and transmitting the data. - Logs and metrics clients are specified specifically for this integration. If another integration were to introduce logs then there would be a need for another log writer. Having several log writers could induce thread contention and high memory usage. Co-authored-by: Kyle Verhoog <kyle@verhoog.ca> Co-authored-by: Kari Halsted <12926135+kayayarai@users.noreply.github.com>
Add an integration for the [OpenAI library](https://github.com/openai/openai-python). This integration provides tracing for the completion, embeddings and chat completion endpoints along with cost estimation metrics and prompt/completion sampling logs. Each log, metric and trace are tagged with service, env, version, OpenAI model, OpenAI endpoint and OpenAI organization. [Docs preview](https://output.circle-artifacts.com/output/job/fe3599b8-952e-4ceb-ac4f-0f15503e9c0d/artifacts/0/tmp/docs/integrations.html#openai) ## Design ### Logs A new log writer implementation is added to submit logs. Logs are submitted direct to intake following a similar approach that [kyle-verhoog/datadog-python](https://github.com/Kyle-Verhoog/datadog-python/blob/main/datadog/_logging.py) and the [.NET tracer](DataDog/dd-trace-dotnet#2240) have taken already. ### Metrics A statsd client is used specifically for the OpenAI integration. ## Testing Testing is done using VCR to record requests made to OpenAI to ensure ease, consistency and reliability in test cases. Logs and metrics are tested using mocking of the clients. Several integration tests using snapshots and subprocess testing ensure that the integration works in a real world OpenAI application. A manual test app was also used: https://gist.github.com/Kyle-Verhoog/1f263ed0aade076b313167d1ba3bfa16 ## Risk - Currently, logs, traces and metrics are all collected, buffered and sent individually through their respective pipelines. Due to this, there is risk that disparity occurs between the tagging and submission of the data. There is also a performance risk as this data is not aggregated or batched when submitted. - Prompts and completions are captured on spans by default with a default limit on the length of the data. This limit only applies to each prompt/completion individually but requests can contain several prompts and completions. If there are many prompts and completions of great length then there is a risk of performance overhead of encoding and transmitting the data. - Logs and metrics clients are specified specifically for this integration. If another integration were to introduce logs then there would be a need for another log writer. Having several log writers could induce thread contention and high memory usage. Co-authored-by: lievan <42917263+lievan@users.noreply.github.com> Co-authored-by: Kari Halsted <12926135+kayayarai@users.noreply.github.com> Co-authored-by: Federico Mon <federico.mon@datadoghq.com>
This PR adds support for directly shipping logs to the Datadog intake, so customers don't have to configure log tailing in the agent etc. Supports all our existing logs injection frameworks
Approach
This takes the same overall approach as the official Datadog Serilog Sink, though it deviates in various places for performance and ergonomic reasons.
I initially partially vendored-in the Serilog batching sink provider, however subsequently reworked this quite significantly after feedback in #1945 (thanks Colin!), in particular to add circuit-breaker features to handle configuration errors. Similar to other components, we have a singleton
DatadogSink, (which implements the vendoredBatchingSink) and is responsible for formatting and sending batches of logs. TheLogsApitype takes care of the transport (similar to theApiclass for traces).For serialization, I used a similar formatting to the
CompactJsonFormatter, as that's supported in the logs documentation, serializing efficiently where possible. Without System.Text.Json'sUtf8Writer, the best I could do to reduce memory impact was to reuse aStringBuilder, and have a persistent buffer for the UTF8 bytes. Given System.Text.Json is built-in tonetcore3.1+, I wonder if we should use that where available 🤔We use automatic instrumentation to add the sinks as required, which then farms off the logs to the singleton
DatadogSink.Testing
Added integration tests for each framework
Also did manual testing of each framework, and confirmed I can see the logs in the UI:
@DataDog/apm-dotnet