Direct log submission by andrewlock · Pull Request #2240 · DataDog/dd-trace-dotnet

andrewlock · 2021-12-24T16:04:04Z

This PR is a refactoring of #1945, to try and make it easier to review. Each commit is self contained and (hopefully) understandable, so should make it easier to review them in order!🤞

This PR adds support for directly shipping logs to the Datadog intake, so customers don't have to configure log tailing in the agent etc. Supports all our existing logs injection frameworks

Microsoft.Extensions.Logging.ILogger
Serilog
log4Net
NLog 2.1+ (This is more restrictive than for logs injection, but I don't think it's worth the effort to go earlier, until we have telemetry that says otherwise)

Approach

This takes the same overall approach as the official Datadog Serilog Sink, though it deviates in various places for performance and ergonomic reasons.

I initially partially vendored-in the Serilog batching sink provider, however subsequently reworked this quite significantly after feedback in #1945 (thanks Colin!), in particular to add circuit-breaker features to handle configuration errors. Similar to other components, we have a singleton DatadogSink, (which implements the vendored BatchingSink) and is responsible for formatting and sending batches of logs. The LogsApi type takes care of the transport (similar to the Api class for traces).

For serialization, I used a similar formatting to the CompactJsonFormatter, as that's supported in the logs documentation, serializing efficiently where possible. Without System.Text.Json's Utf8Writer, the best I could do to reduce memory impact was to reuse a StringBuilder, and have a persistent buffer for the UTF8 bytes. Given System.Text.Json is built-in to netcore3.1+, I wonder if we should use that where available 🤔

We use automatic instrumentation to add the sinks as required, which then farms off the logs to the singleton DatadogSink.

Testing

Added integration tests for each framework

Tested that enabling log shipping has no impact on logs injection
Can receive logs, and they have the TWoL tags added correctly (where expected)

Also did manual testing of each framework, and confirmed I can see the logs in the UI:

@DataDog/apm-dotnet

pierotibou

LGTM. Only a small copy paste issue in LogFormatter.cs and nitpicking comments

kevingosse

Note that I've only reviewed the logic, I didn't check that you're using the correct hook in the log libraries.

kevingosse · 2021-12-30T10:52:24Z

+        [DuckReverseMethod]
+        public DirectSubmissionLogger CreateLogger(string categoryName)
+        {
+            return _loggers.GetOrAdd(categoryName, CreateLoggerImplementation);


I don't know how often this is called, but I suggest caching the CreateLoggerImplementation into a field (populated by the constructor) to save an allocation at each invocation

As I understand it, this is already cached by the caller of this function:

https://github.com/dotnet/runtime/blob/57bfe474518ab5b7cfe6bf7424a79ce3af9d6657/src/libraries/Microsoft.Extensions.Logging/src/LoggerFactory.cs#L128

Also, note that this is returning a DirectSubmissionLogger parameterised on categoryName, so we would have to use a concurrent dictionary keyed on categoryName...

kevingosse · 2021-12-30T10:57:20Z

+    {
+        private const string MessageTemplateKey = "{OriginalFormat}";
+
+        public static string? FormatLogEvent<T>(LogFormatter logFormatter, in LogEntry<T> logEntry)


Good call on using the in keyword here 👍

kevingosse · 2021-12-30T11:08:44Z

+            var haveEnv = false;
+            string? messageTemplate = null;
+
+            if (logEntry.State is { } state)


#sorrynotsorry

kevingosse · 2021-12-30T11:23:23Z

+                }
+            }
+
+            if (logEntry.ScopeProvider is { } scopeProvider)


:staaaaare:

#okstartingtofeelabitashamednow

kevingosse · 2021-12-30T14:20:44Z

+                    if (success)
+                    {
+                        status = _circuitBreaker.MarkSuccess();
+                        haveMultipleBatchesToSend = _waitingBatch.Count >= _batchSizeLimit;


_waitingBatch.Count can be equal to _batchSizeLimit but never bigger. I suppose you used >= as precaution.

Yeah, exactly (actually this was part of the original Serilog code, but seemed a reasonable precaution to keep)

kevingosse · 2021-12-30T14:24:21Z

+                        return _circuitBreaker.MarkFailure();
+                    }
+                }
+                while (haveMultipleBatchesToSend);


Note that, if the application produces _batchSizeLimit logs faster than it takes to send them, we can get in a situation where we keep sending logs without ever waiting for _flushPeriod. I don't know if that's a problem (probably not).

I'd argue that's a feature rather than a bug - if we need to constantly send logs to keep up with the app, then the alternative would effectively be a memory leak as we buffer more and more

kevingosse · 2021-12-30T14:27:24Z

+            _queueLimit = queueLimit ?? Unbounded;
+        }
+
+        public int Count => _queue.Count;


Getting the count of a ConcurrentQueue is costly, we should remove the property so we're not tempted to use it

Done in 8dc2c8d, and expose the _queue instead, so the tests still compile

kevingosse · 2021-12-30T14:33:28Z

+            { }
+            finally
+            {
+                // prevent state corrupt while aborting


That's probably something I'll want to remove eventually. We already know that our tracer is not resilient to aborted threads, so that's really not bringing anything (and just preventing inlining because of the try/finally block)

Removed in 8dc2c8d

kevingosse · 2021-12-30T14:55:46Z

+                exception: null,
+                (s, ex) => $"This is {s}");
+
+            var log = sink.Events.Should().ContainSingle().Subject;


colin-higgins · 2022-01-11T15:09:48Z

+{
+    internal static class LogsTransportStrategy
+    {
+        private static readonly IDatadogLogger Log = DatadogLogging.GetLoggerFor<Tracer>();


Logger for Tracer?

colin-higgins · 2022-01-11T15:13:04Z

+
+                    try
+                    {
+                        // TODO: Metrics/Telemetry?


Did you want to act on this TODO?

Not yet 😉

colin-higgins · 2022-01-11T15:13:33Z

+                            return true;
+                        }
+
+                        shouldRetry = response.StatusCode switch


I love this, nice job

colin-higgins · 2022-01-11T15:16:19Z

+        }
+
+        private Task HandleCircuitStatus(CircuitStatus status)
+        {


colin-higgins · 2022-01-11T15:18:21Z

        {
            public LockedTracerManager()
-                : base(null, null, null, null, null, null, null)
+                : base(null, null, null, null, null, null, null, null)


MY EYYYYEEES

Previous logic didn't guarantee we would allocate an array large enough

With the previous implementation, we were allocating a List<object>() for every struct, even if the list was never used (it's only used when there are non-dictionary scope values added using e.g. `ILogger.BeginScope("SomeValue")`) With the new approach, we lazily allocate the list on first usage instead

andrewlock · 2022-01-12T10:32:13Z

Code Coverage Report 📊

✔️ Merging #2240 into master will not change line coverage
⚠️ Merging #2240 into master will will decrease branch coverage by 1%
⛔ Merging #2240 into master will will increase complexity by 733

	master	#2240	Change
Lines	`9635` / `13524`	`10542` / `14815`
Lines %	`71%`	`71%`	`0%` ✔️
Branches	`4877` / `7011`	`5290` / `7667`
Branches %	`70%`	`69%`	`-1%` ⚠️
Complexity	`8173`	`8906`	`733` ⛔

View the full report for further details:

Datadog.Trace Breakdown ✔️

	master	#2240	Change
Lines %	`71%`	`71%`	`0%` ✔️
Branches %	`70%`	`69%`	`-1%` ⚠️
Complexity	`8173`	`8906`	`733` ⛔

The following classes have significant coverage changes.

File	Line coverage change	Branch coverage change	Complexity change
Datadog.Trace.ClrProfiler.AutoInstrumentation.Couchbase.IIOServiceExecuteIntegrationBis	`-100%` ⛔	`0%` ✔️	`0` ✔️
Datadog.Trace.ClrProfiler.AutoInstrumentation.Testing.MsTestV2.TestMethodRunnerExecuteIntegration	`-33%` ⛔	`-19%` ⛔	`0` ✔️
Datadog.Trace.Ci.Agent.CIAgentWriter	`-10%` ⛔	`-21%` ⛔	`0` ✔️
Datadog.Trace.ClrProfiler.AutoInstrumentation.Testing.NUnit.NUnitCompositeWorkItemSkipChildrenIntegration	`-8%` ⛔	`-8%` ⛔	`0` ✔️
Datadog.Trace.TracerManager	`0%` ✔️	`-6%` ⛔	`6` ⛔
[Datadog.Trace.ClrProfiler.CallTarget.Handlers.BeginMethodHandler`5](https://github.com/DataDog/dd-trace-dotnet/tree/f3cbc4242cf96ef0c7d27cd9d64bd987810d3452/tracer/src/Datadog.Trace/ClrProfiler/CallTarget/Handlers/BeginMethodHandler`3.cs)	`8%` ✔️	`10%` ✔️	`0` ✔️
Datadog.Trace.Util.ThrowHelper	`9%` ✔️	`0%` ✔️	`0` ✔️

The following classes were added in #2240:

File	Line coverage	Branch coverage	Complexity
Datadog.Trace.ClrProfiler.AutoInstrumentation.Logging.ILogger.DirectSubmission.DirectSubmissionLogger	`96%`	`88%`	`12`
Datadog.Trace.ClrProfiler.AutoInstrumentation.Logging.ILogger.DirectSubmission.DirectSubmissionLoggerProvider	`77%`	`100%`	`6`
Datadog.Trace.ClrProfiler.AutoInstrumentation.Logging.ILogger.DirectSubmission.Formatting.LogEntry`1	`100%`	`100%`	`1`
Datadog.Trace.ClrProfiler.AutoInstrumentation.Logging.ILogger.DirectSubmission.Formatting.LoggerLogFormatter	`95%`	`87%`	`41`
Datadog.Trace.ClrProfiler.AutoInstrumentation.Logging.ILogger.DirectSubmission.LoggerDatadogLogEvent	`100%`	`100%`	`2`
...And 36 more

View the full reports for further details:

andrewlock · 2022-01-12T11:04:16Z

Benchmarks Report 🐌

Benchmarks for #2240 compared to master:

2 benchmarks are slower, with geometric mean 1.307
All benchmarks have the same allocations

The following thresholds were used for comparing the benchmark speeds:

Mann–Whitney U test with statistical test for significance of 5%
Only results indicating a difference greater than 10% and 0.3 ns are considered.

Allocation changes below 0.5% are ignored.

Benchmark details

Benchmarks.Trace.AgentWriterBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Allocated
master	`WriteAndFlushEnrichedTraces`	net472	910μs	4.48μs	19μs	3.16 KB
master	`WriteAndFlushEnrichedTraces`	netcoreapp3.1	693μs	2.14μs	7.99μs	2.57 KB
#2240	`WriteAndFlushEnrichedTraces`	net472	919μs	4.57μs	19.4μs	3.16 KB
#2240	`WriteAndFlushEnrichedTraces`	netcoreapp3.1	754μs	3.77μs	16μs	2.57 KB

Benchmarks.Trace.AspNetCoreBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`SendRequest`	net472	0.00519ns	0.00391ns	0.0241ns	0	0 b
master	`SendRequest`	netcoreapp3.1	340μs	1.23μs	4.77μs	0.168	19.59 KB
#2240	`SendRequest`	net472	0ns	0ns	0ns	0	0 b
#2240	`SendRequest`	netcoreapp3.1	350μs	1.68μs	6.49μs	0.174	19.59 KB

Benchmarks.Trace.DbCommandBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Gen 1	Allocated
master	`ExecuteNonQuery`	net472	1.79μs	10.4ns	88ns	0.09	0.000978	570 B
master	`ExecuteNonQuery`	netcoreapp3.1	1.61μs	5.38ns	18.6ns	0.00801	0	608 B
#2240	`ExecuteNonQuery`	net472	1.69μs	7.94ns	30.8ns	0.09	0.000833	570 B
#2240	`ExecuteNonQuery`	netcoreapp3.1	1.54μs	5.88ns	22.8ns	0.00823	0	608 B

Benchmarks.Trace.ElasticsearchBenchmark - Slower ⚠️ Same allocations ✔️

Slower ⚠️ in #2240

Benchmark	diff/base	Base Median (ns)	Diff Median (ns)	Modality
Benchmarks.Trace.ElasticsearchBenchmark.CallElasticsearch‑net472	1.209	2,503.40	3,025.55

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`CallElasticsearch`	net472	2.53μs	14.2ns	89.7ns	0.122	778 B
master	`CallElasticsearch`	netcoreapp3.1	1.75μs	5.86ns	21.1ns	0.0103	768 B
master	`CallElasticsearchAsync`	net472	2.74μs	7.56ns	27.3ns	0.142	915 B
master	`CallElasticsearchAsync`	netcoreapp3.1	1.97μs	9.15ns	34.2ns	0.0121	888 B
#2240	`CallElasticsearch`	net472	3.04μs	13.8ns	53.6ns	0.12	778 B
#2240	`CallElasticsearch`	netcoreapp3.1	1.74μs	6.02ns	22.5ns	0.0112	768 B
#2240	`CallElasticsearchAsync`	net472	2.9μs	13.1ns	50.8ns	0.142	915 B
#2240	`CallElasticsearchAsync`	netcoreapp3.1	1.91μs	5.8ns	21.7ns	0.0121	888 B

Benchmarks.Trace.GraphQLBenchmark - Slower ⚠️ Same allocations ✔️

Slower ⚠️ in #2240

Benchmark	diff/base	Base Median (ns)	Diff Median (ns)	Modality
Benchmarks.Trace.GraphQLBenchmark.ExecuteAsync‑net472	1.413	3,372.64	4,763.98	several?

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`ExecuteAsync`	net472	3.38μs	17.4ns	107ns	0.161	1.04 KB
master	`ExecuteAsync`	netcoreapp3.1	2.03μs	9.4ns	36.4ns	0.014	1.01 KB
#2240	`ExecuteAsync`	net472	4.69μs	36.8ns	368ns	0.162	1.04 KB
#2240	`ExecuteAsync`	netcoreapp3.1	2.12μs	12.2ns	92.9ns	0.0148	1.01 KB

Benchmarks.Trace.HttpClientBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`SendAsync`	net472	6.85μs	38.8ns	298ns	0.348	2.24 KB
master	`SendAsync`	netcoreapp3.1	4.83μs	26.6ns	206ns	0.0289	2.14 KB
#2240	`SendAsync`	net472	6.55μs	34.5ns	165ns	0.348	2.24 KB
#2240	`SendAsync`	netcoreapp3.1	4.89μs	23.2ns	93ns	0.0296	2.14 KB

Benchmarks.Trace.ILoggerBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`EnrichedLog`	net472	3.6μs	17.4ns	69.7ns	0.22	1.41 KB
master	`EnrichedLog`	netcoreapp3.1	3.18μs	11.5ns	41.6ns	0.0209	1.49 KB
#2240	`EnrichedLog`	net472	3.82μs	17.1ns	66.4ns	0.22	1.41 KB
#2240	`EnrichedLog`	netcoreapp3.1	3.25μs	13.8ns	53.3ns	0.0212	1.49 KB

Benchmarks.Trace.Log4netBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Gen 1	Allocated
master	`EnrichedLog`	net472	216μs	911ns	3.53μs	0.422	0.105	4.29 KB
master	`EnrichedLog`	netcoreapp3.1	191μs	495ns	1.85μs	0	0	4.17 KB
#2240	`EnrichedLog`	net472	218μs	888ns	3.66μs	0.43	0.108	4.29 KB
#2240	`EnrichedLog`	netcoreapp3.1	200μs	1.09μs	5.95μs	0	0	4.17 KB

Benchmarks.Trace.NLogBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`EnrichedLog`	net472	8.29μs	36.1ns	135ns	0.497	3.19 KB
master	`EnrichedLog`	netcoreapp3.1	6.84μs	28.3ns	110ns	0.0508	3.56 KB
#2240	`EnrichedLog`	net472	8.12μs	43.9ns	240ns	0.496	3.19 KB
#2240	`EnrichedLog`	netcoreapp3.1	7.42μs	31.1ns	121ns	0.0484	3.56 KB

Benchmarks.Trace.RedisBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`SendReceive`	net472	2.42μs	10.3ns	35.8ns	0.155	987 B
master	`SendReceive`	netcoreapp3.1	2.18μs	6.83ns	25.6ns	0.0136	984 B
#2240	`SendReceive`	net472	2.32μs	8.56ns	34.2ns	0.154	987 B
#2240	`SendReceive`	netcoreapp3.1	2.2μs	12.7ns	105ns	0.0141	984 B

Benchmarks.Trace.SerilogBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`EnrichedLog`	net472	6.95μs	30ns	112ns	0.284	1.83 KB
master	`EnrichedLog`	netcoreapp3.1	5.77μs	33.2ns	276ns	0.0192	1.45 KB
#2240	`EnrichedLog`	net472	6.86μs	38.3ns	251ns	0.285	1.83 KB
#2240	`EnrichedLog`	netcoreapp3.1	5.42μs	23.5ns	88ns	0.0188	1.45 KB

Benchmarks.Trace.SpanBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`StartFinishSpan`	net472	995ns	5.54ns	33.2ns	0.0676	433 B
master	`StartFinishSpan`	netcoreapp3.1	928ns	4.25ns	15.9ns	0.00596	432 B
master	`StartFinishScope`	net472	1.17μs	5.55ns	21.5ns	0.0804	514 B
master	`StartFinishScope`	netcoreapp3.1	1.06μs	4.09ns	15.3ns	0.00741	552 B
#2240	`StartFinishSpan`	net472	974ns	5.13ns	25.6ns	0.068	433 B
#2240	`StartFinishSpan`	netcoreapp3.1	977ns	4.59ns	17.8ns	0.00575	432 B
#2240	`StartFinishScope`	net472	1.2μs	6.14ns	26.8ns	0.0802	514 B
#2240	`StartFinishScope`	netcoreapp3.1	1.16μs	5.6ns	21.7ns	0.00778	552 B

Add an integration for the [OpenAI library](https://github.com/openai/openai-python). This integration provides tracing for the completion, embeddings and chat completion endpoints along with cost estimation metrics and prompt/completion sampling logs. Each log, metric and trace are tagged with service, env, version, OpenAI model, OpenAI endpoint and OpenAI organization. [Docs preview](https://output.circle-artifacts.com/output/job/fe3599b8-952e-4ceb-ac4f-0f15503e9c0d/artifacts/0/tmp/docs/integrations.html#openai) ## Design ### Logs A new log writer implementation is added to submit logs. Logs are submitted direct to intake following a similar approach that [kyle-verhoog/datadog-python](https://github.com/Kyle-Verhoog/datadog-python/blob/main/datadog/_logging.py) and the [.NET tracer](DataDog/dd-trace-dotnet#2240) have taken already. ### Metrics A statsd client is used specifically for the OpenAI integration. ## Testing Testing is done using VCR to record requests made to OpenAI to ensure ease, consistency and reliability in test cases. Logs and metrics are tested using mocking of the clients. Several integration tests using snapshots and subprocess testing ensure that the integration works in a real world OpenAI application. A manual test app was also used: https://gist.github.com/Kyle-Verhoog/1f263ed0aade076b313167d1ba3bfa16 ## Risk - Currently, logs, traces and metrics are all collected, buffered and sent individually through their respective pipelines. Due to this, there is risk that disparity occurs between the tagging and submission of the data. There is also a performance risk as this data is not aggregated or batched when submitted. - Prompts and completions are captured on spans by default with a default limit on the length of the data. This limit only applies to each prompt/completion individually but requests can contain several prompts and completions. If there are many prompts and completions of great length then there is a risk of performance overhead of encoding and transmitting the data. - Logs and metrics clients are specified specifically for this integration. If another integration were to introduce logs then there would be a need for another log writer. Having several log writers could induce thread contention and high memory usage. Co-authored-by: Kyle Verhoog <kyle@verhoog.ca> Co-authored-by: Kari Halsted <12926135+kayayarai@users.noreply.github.com>

Add an integration for the [OpenAI library](https://github.com/openai/openai-python). This integration provides tracing for the completion, embeddings and chat completion endpoints along with cost estimation metrics and prompt/completion sampling logs. Each log, metric and trace are tagged with service, env, version, OpenAI model, OpenAI endpoint and OpenAI organization. [Docs preview](https://output.circle-artifacts.com/output/job/fe3599b8-952e-4ceb-ac4f-0f15503e9c0d/artifacts/0/tmp/docs/integrations.html#openai) ## Design ### Logs A new log writer implementation is added to submit logs. Logs are submitted direct to intake following a similar approach that [kyle-verhoog/datadog-python](https://github.com/Kyle-Verhoog/datadog-python/blob/main/datadog/_logging.py) and the [.NET tracer](DataDog/dd-trace-dotnet#2240) have taken already. ### Metrics A statsd client is used specifically for the OpenAI integration. ## Testing Testing is done using VCR to record requests made to OpenAI to ensure ease, consistency and reliability in test cases. Logs and metrics are tested using mocking of the clients. Several integration tests using snapshots and subprocess testing ensure that the integration works in a real world OpenAI application. A manual test app was also used: https://gist.github.com/Kyle-Verhoog/1f263ed0aade076b313167d1ba3bfa16 ## Risk - Currently, logs, traces and metrics are all collected, buffered and sent individually through their respective pipelines. Due to this, there is risk that disparity occurs between the tagging and submission of the data. There is also a performance risk as this data is not aggregated or batched when submitted. - Prompts and completions are captured on spans by default with a default limit on the length of the data. This limit only applies to each prompt/completion individually but requests can contain several prompts and completions. If there are many prompts and completions of great length then there is a risk of performance overhead of encoding and transmitting the data. - Logs and metrics clients are specified specifically for this integration. If another integration were to introduce logs then there would be a need for another log writer. Having several log writers could induce thread contention and high memory usage. Co-authored-by: lievan <42917263+lievan@users.noreply.github.com> Co-authored-by: Kari Halsted <12926135+kayayarai@users.noreply.github.com> Co-authored-by: Federico Mon <federico.mon@datadoghq.com>

andrewlock requested review from a team as code owners December 24, 2021 16:04

andrewlock added area:automatic-instrumentation Automatic instrumentation managed C# code (Datadog.Trace.ClrProfiler.Managed) area:logs-injection needs-docs-update type:new-feature labels Dec 24, 2021

andrewlock mentioned this pull request Dec 24, 2021

Direct log submission #1945

Closed