Skip to content

Direct log submission#2240

Merged
andrewlock merged 28 commits into
masterfrom
andrew/direct-log-submission-refactor
Jan 12, 2022
Merged

Direct log submission#2240
andrewlock merged 28 commits into
masterfrom
andrew/direct-log-submission-refactor

Conversation

@andrewlock

Copy link
Copy Markdown
Member

This PR is a refactoring of #1945, to try and make it easier to review. Each commit is self contained and (hopefully) understandable, so should make it easier to review them in order!🤞

This PR adds support for directly shipping logs to the Datadog intake, so customers don't have to configure log tailing in the agent etc. Supports all our existing logs injection frameworks

  • Microsoft.Extensions.Logging.ILogger
  • Serilog
  • log4Net
  • NLog 2.1+ (This is more restrictive than for logs injection, but I don't think it's worth the effort to go earlier, until we have telemetry that says otherwise)

Approach

This takes the same overall approach as the official Datadog Serilog Sink, though it deviates in various places for performance and ergonomic reasons.

I initially partially vendored-in the Serilog batching sink provider, however subsequently reworked this quite significantly after feedback in #1945 (thanks Colin!), in particular to add circuit-breaker features to handle configuration errors. Similar to other components, we have a singleton DatadogSink, (which implements the vendored BatchingSink) and is responsible for formatting and sending batches of logs. The LogsApi type takes care of the transport (similar to the Api class for traces).

For serialization, I used a similar formatting to the CompactJsonFormatter, as that's supported in the logs documentation, serializing efficiently where possible. Without System.Text.Json's Utf8Writer, the best I could do to reduce memory impact was to reuse a StringBuilder, and have a persistent buffer for the UTF8 bytes. Given System.Text.Json is built-in to netcore3.1+, I wonder if we should use that where available 🤔

We use automatic instrumentation to add the sinks as required, which then farms off the logs to the singleton DatadogSink.

Testing

Added integration tests for each framework

  • Tested that enabling log shipping has no impact on logs injection
  • Can receive logs, and they have the TWoL tags added correctly (where expected)

Also did manual testing of each framework, and confirmed I can see the logs in the UI:

image

@DataDog/apm-dotnet

@andrewlock andrewlock requested review from a team as code owners December 24, 2021 16:04
@andrewlock andrewlock added area:automatic-instrumentation Automatic instrumentation managed C# code (Datadog.Trace.ClrProfiler.Managed) area:logs-injection needs-docs-update type:new-feature labels Dec 24, 2021
@andrewlock

This comment has been minimized.

@andrewlock

This comment has been minimized.

@pierotibou pierotibou left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Only a small copy paste issue in LogFormatter.cs and nitpicking comments

Comment thread tracer/src/Datadog.Trace/Configuration/ImmutableTracerSettings.cs Outdated
Comment thread tracer/src/Datadog.Trace/Logging/DirectSubmission/Formatting/LogFormatter.cs Outdated

@kevingosse kevingosse left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I've only reviewed the logic, I didn't check that you're using the correct hook in the log libraries.

[DuckReverseMethod]
public DirectSubmissionLogger CreateLogger(string categoryName)
{
return _loggers.GetOrAdd(categoryName, CreateLoggerImplementation);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how often this is called, but I suggest caching the CreateLoggerImplementation into a field (populated by the constructor) to save an allocation at each invocation

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, this is already cached by the caller of this function:

https://github.com/dotnet/runtime/blob/57bfe474518ab5b7cfe6bf7424a79ce3af9d6657/src/libraries/Microsoft.Extensions.Logging/src/LoggerFactory.cs#L128

Also, note that this is returning a DirectSubmissionLogger parameterised on categoryName, so we would have to use a concurrent dictionary keyed on categoryName...

{
private const string MessageTemplateKey = "{OriginalFormat}";

public static string? FormatLogEvent<T>(LogFormatter logFormatter, in LogEntry<T> logEntry)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call on using the in keyword here 👍

var haveEnv = false;
string? messageTemplate = null;

if (logEntry.State is { } state)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:stare:

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#sorrynotsorry

}
}

if (logEntry.ScopeProvider is { } scopeProvider)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:staaaaare:

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#okstartingtofeelabitashamednow

if (success)
{
status = _circuitBreaker.MarkSuccess();
haveMultipleBatchesToSend = _waitingBatch.Count >= _batchSizeLimit;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_waitingBatch.Count can be equal to _batchSizeLimit but never bigger. I suppose you used >= as precaution.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, exactly (actually this was part of the original Serilog code, but seemed a reasonable precaution to keep)

return _circuitBreaker.MarkFailure();
}
}
while (haveMultipleBatchesToSend);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that, if the application produces _batchSizeLimit logs faster than it takes to send them, we can get in a situation where we keep sending logs without ever waiting for _flushPeriod. I don't know if that's a problem (probably not).

@andrewlock andrewlock Dec 30, 2021

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue that's a feature rather than a bug - if we need to constantly send logs to keep up with the app, then the alternative would effectively be a memory leak as we buffer more and more

_queueLimit = queueLimit ?? Unbounded;
}

public int Count => _queue.Count;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting the count of a ConcurrentQueue is costly, we should remove the property so we're not tempted to use it

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 8dc2c8d, and expose the _queue instead, so the tests still compile

{ }
finally
{
// prevent state corrupt while aborting

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably something I'll want to remove eventually. We already know that our tracer is not resilient to aborted threads, so that's really not bringing anything (and just preventing inlining because of the try/finally block)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 8dc2c8d

exception: null,
(s, ex) => $"This is {s}");

var log = sink.Events.Should().ContainSingle().Subject;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice trick

@andrewlock andrewlock force-pushed the andrew/direct-log-submission-refactor branch 2 times, most recently from cf65229 to 1dcfa42 Compare December 30, 2021 18:06
@andrewlock

This comment has been minimized.

@andrewlock

This comment has been minimized.

@andrewlock andrewlock force-pushed the andrew/direct-log-submission-refactor branch 2 times, most recently from b441218 to 70f3b3f Compare January 4, 2022 09:53
@andrewlock

This comment has been minimized.

@andrewlock

This comment has been minimized.

@andrewlock andrewlock force-pushed the andrew/direct-log-submission-refactor branch from 70f3b3f to 59841f6 Compare January 5, 2022 11:18
@andrewlock

This comment has been minimized.

@andrewlock

This comment has been minimized.

@andrewlock andrewlock force-pushed the andrew/direct-log-submission-refactor branch 2 times, most recently from 98eefec to 5676fd5 Compare January 10, 2022 10:27
@andrewlock

This comment has been minimized.

@andrewlock

This comment has been minimized.

Comment thread tracer/src/Datadog.Trace/Configuration/ImmutableTracerSettings.cs Outdated
@andrewlock andrewlock force-pushed the andrew/direct-log-submission-refactor branch from 5676fd5 to 52b41eb Compare January 11, 2022 15:08
{
internal static class LogsTransportStrategy
{
private static readonly IDatadogLogger Log = DatadogLogging.GetLoggerFor<Tracer>();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logger for Tracer?


try
{
// TODO: Metrics/Telemetry?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want to act on this TODO?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet 😉

return true;
}

shouldRetry = response.StatusCode switch

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this, nice job

}

private Task HandleCircuitStatus(CircuitStatus status)
{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice

{
public LockedTracerManager()
: base(null, null, null, null, null, null, null)
: base(null, null, null, null, null, null, null, null)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MY EYYYYEEES

Previous logic didn't guarantee we would allocate an array large enough
With the previous implementation, we were allocating a List<object>() for every struct, even if the list was never used (it's only used when there are non-dictionary scope values added using e.g. `ILogger.BeginScope("SomeValue")`)

With the new approach, we lazily allocate the list on first usage instead
@andrewlock andrewlock force-pushed the andrew/direct-log-submission-refactor branch from 52b41eb to 5ea0b65 Compare January 12, 2022 09:23
@github-actions github-actions Bot added area:benchmarks Benchmarks, throughput tests, Crank, Bombardier, etc area:ci-visibility area:test-apps apps used to test integrations area:tests unit tests, integration tests area:tracer The core tracer library (Datadog.Trace, does not include OpenTracing, native code, or integrations) labels Jan 12, 2022
@andrewlock

Copy link
Copy Markdown
Member Author

Code Coverage Report 📊

✔️ Merging #2240 into master will not change line coverage
⚠️ Merging #2240 into master will will decrease branch coverage by 1%
⛔ Merging #2240 into master will will increase complexity by 733

master #2240 Change
Lines 9635 / 13524 10542 / 14815
Lines % 71% 71% 0% ✔️
Branches 4877 / 7011 5290 / 7667
Branches % 70% 69% -1% ⚠️
Complexity 8173 8906 733

View the full report for further details:

Datadog.Trace Breakdown ✔️

master #2240 Change
Lines % 71% 71% 0% ✔️
Branches % 70% 69% -1% ⚠️
Complexity 8173 8906 733

The following classes have significant coverage changes.

File Line coverage change Branch coverage change Complexity change
Datadog.Trace.ClrProfiler.AutoInstrumentation.Couchbase.IIOServiceExecuteIntegrationBis -100% 0% ✔️ 0 ✔️
Datadog.Trace.ClrProfiler.AutoInstrumentation.Testing.MsTestV2.TestMethodRunnerExecuteIntegration -33% -19% 0 ✔️
Datadog.Trace.Ci.Agent.CIAgentWriter -10% -21% 0 ✔️
Datadog.Trace.ClrProfiler.AutoInstrumentation.Testing.NUnit.NUnitCompositeWorkItemSkipChildrenIntegration -8% -8% 0 ✔️
Datadog.Trace.TracerManager 0% ✔️ -6% 6
[Datadog.Trace.ClrProfiler.CallTarget.Handlers.BeginMethodHandler5](https://github.com/DataDog/dd-trace-dotnet/tree/f3cbc4242cf96ef0c7d27cd9d64bd987810d3452/tracer/src/Datadog.Trace/ClrProfiler/CallTarget/Handlers/BeginMethodHandler3.cs) 8% ✔️ 10% ✔️ 0 ✔️
Datadog.Trace.Util.ThrowHelper 9% ✔️ 0% ✔️ 0 ✔️

The following classes were added in #2240:

File Line coverage Branch coverage Complexity
Datadog.Trace.ClrProfiler.AutoInstrumentation.Logging.ILogger.DirectSubmission.DirectSubmissionLogger 96% 88% 12
Datadog.Trace.ClrProfiler.AutoInstrumentation.Logging.ILogger.DirectSubmission.DirectSubmissionLoggerProvider 77% 100% 6
Datadog.Trace.ClrProfiler.AutoInstrumentation.Logging.ILogger.DirectSubmission.Formatting.LogEntry`1 100% 100% 1
Datadog.Trace.ClrProfiler.AutoInstrumentation.Logging.ILogger.DirectSubmission.Formatting.LoggerLogFormatter 95% 87% 41
Datadog.Trace.ClrProfiler.AutoInstrumentation.Logging.ILogger.DirectSubmission.LoggerDatadogLogEvent 100% 100% 2
...And 36 more

View the full reports for further details:

@andrewlock

Copy link
Copy Markdown
Member Author

Benchmarks Report 🐌

Benchmarks for #2240 compared to master:

  • 2 benchmarks are slower, with geometric mean 1.307
  • All benchmarks have the same allocations

The following thresholds were used for comparing the benchmark speeds:

  • Mann–Whitney U test with statistical test for significance of 5%
  • Only results indicating a difference greater than 10% and 0.3 ns are considered.

Allocation changes below 0.5% are ignored.

Benchmark details

Benchmarks.Trace.AgentWriterBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master WriteAndFlushEnrichedTraces net472 910μs 4.48μs 19μs 0 0 0 3.16 KB
master WriteAndFlushEnrichedTraces netcoreapp3.1 693μs 2.14μs 7.99μs 0 0 0 2.57 KB
#2240 WriteAndFlushEnrichedTraces net472 919μs 4.57μs 19.4μs 0 0 0 3.16 KB
#2240 WriteAndFlushEnrichedTraces netcoreapp3.1 754μs 3.77μs 16μs 0 0 0 2.57 KB
Benchmarks.Trace.AspNetCoreBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master SendRequest net472 0.00519ns 0.00391ns 0.0241ns 0 0 0 0 b
master SendRequest netcoreapp3.1 340μs 1.23μs 4.77μs 0.168 0 0 19.59 KB
#2240 SendRequest net472 0ns 0ns 0ns 0 0 0 0 b
#2240 SendRequest netcoreapp3.1 350μs 1.68μs 6.49μs 0.174 0 0 19.59 KB
Benchmarks.Trace.DbCommandBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master ExecuteNonQuery net472 1.79μs 10.4ns 88ns 0.09 0.000978 0 570 B
master ExecuteNonQuery netcoreapp3.1 1.61μs 5.38ns 18.6ns 0.00801 0 0 608 B
#2240 ExecuteNonQuery net472 1.69μs 7.94ns 30.8ns 0.09 0.000833 0 570 B
#2240 ExecuteNonQuery netcoreapp3.1 1.54μs 5.88ns 22.8ns 0.00823 0 0 608 B
Benchmarks.Trace.ElasticsearchBenchmark - Slower ⚠️ Same allocations ✔️

Slower ⚠️ in #2240

Benchmark diff/base Base Median (ns) Diff Median (ns) Modality
Benchmarks.Trace.ElasticsearchBenchmark.CallElasticsearch‑net472 1.209 2,503.40 3,025.55

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master CallElasticsearch net472 2.53μs 14.2ns 89.7ns 0.122 0 0 778 B
master CallElasticsearch netcoreapp3.1 1.75μs 5.86ns 21.1ns 0.0103 0 0 768 B
master CallElasticsearchAsync net472 2.74μs 7.56ns 27.3ns 0.142 0 0 915 B
master CallElasticsearchAsync netcoreapp3.1 1.97μs 9.15ns 34.2ns 0.0121 0 0 888 B
#2240 CallElasticsearch net472 3.04μs 13.8ns 53.6ns 0.12 0 0 778 B
#2240 CallElasticsearch netcoreapp3.1 1.74μs 6.02ns 22.5ns 0.0112 0 0 768 B
#2240 CallElasticsearchAsync net472 2.9μs 13.1ns 50.8ns 0.142 0 0 915 B
#2240 CallElasticsearchAsync netcoreapp3.1 1.91μs 5.8ns 21.7ns 0.0121 0 0 888 B
Benchmarks.Trace.GraphQLBenchmark - Slower ⚠️ Same allocations ✔️

Slower ⚠️ in #2240

Benchmark diff/base Base Median (ns) Diff Median (ns) Modality
Benchmarks.Trace.GraphQLBenchmark.ExecuteAsync‑net472 1.413 3,372.64 4,763.98 several?

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master ExecuteAsync net472 3.38μs 17.4ns 107ns 0.161 0 0 1.04 KB
master ExecuteAsync netcoreapp3.1 2.03μs 9.4ns 36.4ns 0.014 0 0 1.01 KB
#2240 ExecuteAsync net472 4.69μs 36.8ns 368ns 0.162 0 0 1.04 KB
#2240 ExecuteAsync netcoreapp3.1 2.12μs 12.2ns 92.9ns 0.0148 0 0 1.01 KB
Benchmarks.Trace.HttpClientBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master SendAsync net472 6.85μs 38.8ns 298ns 0.348 0 0 2.24 KB
master SendAsync netcoreapp3.1 4.83μs 26.6ns 206ns 0.0289 0 0 2.14 KB
#2240 SendAsync net472 6.55μs 34.5ns 165ns 0.348 0 0 2.24 KB
#2240 SendAsync netcoreapp3.1 4.89μs 23.2ns 93ns 0.0296 0 0 2.14 KB
Benchmarks.Trace.ILoggerBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master EnrichedLog net472 3.6μs 17.4ns 69.7ns 0.22 0 0 1.41 KB
master EnrichedLog netcoreapp3.1 3.18μs 11.5ns 41.6ns 0.0209 0 0 1.49 KB
#2240 EnrichedLog net472 3.82μs 17.1ns 66.4ns 0.22 0 0 1.41 KB
#2240 EnrichedLog netcoreapp3.1 3.25μs 13.8ns 53.3ns 0.0212 0 0 1.49 KB
Benchmarks.Trace.Log4netBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master EnrichedLog net472 216μs 911ns 3.53μs 0.422 0.105 0 4.29 KB
master EnrichedLog netcoreapp3.1 191μs 495ns 1.85μs 0 0 0 4.17 KB
#2240 EnrichedLog net472 218μs 888ns 3.66μs 0.43 0.108 0 4.29 KB
#2240 EnrichedLog netcoreapp3.1 200μs 1.09μs 5.95μs 0 0 0 4.17 KB
Benchmarks.Trace.NLogBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master EnrichedLog net472 8.29μs 36.1ns 135ns 0.497 0 0 3.19 KB
master EnrichedLog netcoreapp3.1 6.84μs 28.3ns 110ns 0.0508 0 0 3.56 KB
#2240 EnrichedLog net472 8.12μs 43.9ns 240ns 0.496 0 0 3.19 KB
#2240 EnrichedLog netcoreapp3.1 7.42μs 31.1ns 121ns 0.0484 0 0 3.56 KB
Benchmarks.Trace.RedisBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master SendReceive net472 2.42μs 10.3ns 35.8ns 0.155 0 0 987 B
master SendReceive netcoreapp3.1 2.18μs 6.83ns 25.6ns 0.0136 0 0 984 B
#2240 SendReceive net472 2.32μs 8.56ns 34.2ns 0.154 0 0 987 B
#2240 SendReceive netcoreapp3.1 2.2μs 12.7ns 105ns 0.0141 0 0 984 B
Benchmarks.Trace.SerilogBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master EnrichedLog net472 6.95μs 30ns 112ns 0.284 0 0 1.83 KB
master EnrichedLog netcoreapp3.1 5.77μs 33.2ns 276ns 0.0192 0 0 1.45 KB
#2240 EnrichedLog net472 6.86μs 38.3ns 251ns 0.285 0 0 1.83 KB
#2240 EnrichedLog netcoreapp3.1 5.42μs 23.5ns 88ns 0.0188 0 0 1.45 KB
Benchmarks.Trace.SpanBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master StartFinishSpan net472 995ns 5.54ns 33.2ns 0.0676 0 0 433 B
master StartFinishSpan netcoreapp3.1 928ns 4.25ns 15.9ns 0.00596 0 0 432 B
master StartFinishScope net472 1.17μs 5.55ns 21.5ns 0.0804 0 0 514 B
master StartFinishScope netcoreapp3.1 1.06μs 4.09ns 15.3ns 0.00741 0 0 552 B
#2240 StartFinishSpan net472 974ns 5.13ns 25.6ns 0.068 0 0 433 B
#2240 StartFinishSpan netcoreapp3.1 977ns 4.59ns 17.8ns 0.00575 0 0 432 B
#2240 StartFinishScope net472 1.2μs 6.14ns 26.8ns 0.0802 0 0 514 B
#2240 StartFinishScope netcoreapp3.1 1.16μs 5.6ns 21.7ns 0.00778 0 0 552 B

@andrewlock andrewlock removed area:tests unit tests, integration tests area:test-apps apps used to test integrations area:benchmarks Benchmarks, throughput tests, Crank, Bombardier, etc area:ci-visibility labels Jan 12, 2022
@andrewlock andrewlock merged commit 6a56d0c into master Jan 12, 2022
@andrewlock andrewlock deleted the andrew/direct-log-submission-refactor branch January 12, 2022 11:07
@github-actions github-actions Bot added this to the vNext milestone Jan 12, 2022
Kyle-Verhoog added a commit to DataDog/dd-trace-py that referenced this pull request Apr 26, 2023
Add an integration for the [OpenAI library](https://github.com/openai/openai-python). 
This integration provides tracing for the completion, embeddings and chat completion
endpoints along with cost estimation metrics and prompt/completion
sampling logs.

Each log, metric and trace are tagged with service, env, version, OpenAI
model, OpenAI endpoint and OpenAI organization.

[Docs
preview](https://output.circle-artifacts.com/output/job/fe3599b8-952e-4ceb-ac4f-0f15503e9c0d/artifacts/0/tmp/docs/integrations.html#openai)

## Design

### Logs

A new log writer implementation is added to submit logs. Logs are
submitted direct to intake following a similar approach that
[kyle-verhoog/datadog-python](https://github.com/Kyle-Verhoog/datadog-python/blob/main/datadog/_logging.py)
and the [.NET tracer](DataDog/dd-trace-dotnet#2240) have taken
already.

### Metrics

A statsd client is used specifically for the OpenAI integration.


## Testing

Testing is done using VCR to record requests made to OpenAI to ensure
ease, consistency and reliability in test cases.

Logs and metrics are tested using mocking of the clients.

Several integration tests using snapshots and subprocess testing ensure
that the integration works in a real world OpenAI application.

A manual test app was also used:
https://gist.github.com/Kyle-Verhoog/1f263ed0aade076b313167d1ba3bfa16


## Risk

- Currently, logs, traces and metrics are all collected, buffered and
sent individually through their respective pipelines. Due to this, there
is risk that disparity occurs between the tagging and submission of the
data. There is also a performance risk as this data is not aggregated or
batched when submitted.
- Prompts and completions are captured on spans by default with a
default limit on the length of the data. This limit only applies to each
prompt/completion individually but requests can contain several prompts
and completions. If there are many prompts and completions of great
length then there is a risk of performance overhead of encoding and
transmitting the data.
- Logs and metrics clients are specified specifically for this
integration. If another integration were to introduce logs then there
would be a need for another log writer. Having several log writers could
induce thread contention and high memory usage.


Co-authored-by: Kyle Verhoog <kyle@verhoog.ca>
Co-authored-by: Kari Halsted <12926135+kayayarai@users.noreply.github.com>
Kyle-Verhoog added a commit to DataDog/dd-trace-py that referenced this pull request Apr 26, 2023
Add an integration for the [OpenAI library](https://github.com/openai/openai-python). 
This integration provides tracing for the completion, embeddings and chat completion
endpoints along with cost estimation metrics and prompt/completion
sampling logs.

Each log, metric and trace are tagged with service, env, version, OpenAI
model, OpenAI endpoint and OpenAI organization.

[Docs
preview](https://output.circle-artifacts.com/output/job/fe3599b8-952e-4ceb-ac4f-0f15503e9c0d/artifacts/0/tmp/docs/integrations.html#openai)

## Design

### Logs

A new log writer implementation is added to submit logs. Logs are
submitted direct to intake following a similar approach that
[kyle-verhoog/datadog-python](https://github.com/Kyle-Verhoog/datadog-python/blob/main/datadog/_logging.py)
and the [.NET tracer](DataDog/dd-trace-dotnet#2240) have taken
already.

### Metrics

A statsd client is used specifically for the OpenAI integration.


## Testing

Testing is done using VCR to record requests made to OpenAI to ensure
ease, consistency and reliability in test cases.

Logs and metrics are tested using mocking of the clients.

Several integration tests using snapshots and subprocess testing ensure
that the integration works in a real world OpenAI application.

A manual test app was also used:
https://gist.github.com/Kyle-Verhoog/1f263ed0aade076b313167d1ba3bfa16


## Risk

- Currently, logs, traces and metrics are all collected, buffered and
sent individually through their respective pipelines. Due to this, there
is risk that disparity occurs between the tagging and submission of the
data. There is also a performance risk as this data is not aggregated or
batched when submitted.
- Prompts and completions are captured on spans by default with a
default limit on the length of the data. This limit only applies to each
prompt/completion individually but requests can contain several prompts
and completions. If there are many prompts and completions of great
length then there is a risk of performance overhead of encoding and
transmitting the data.
- Logs and metrics clients are specified specifically for this
integration. If another integration were to introduce logs then there
would be a need for another log writer. Having several log writers could
induce thread contention and high memory usage.


Co-authored-by: Kyle Verhoog <kyle@verhoog.ca>
Co-authored-by: Kari Halsted <12926135+kayayarai@users.noreply.github.com>
Kyle-Verhoog added a commit to DataDog/dd-trace-py that referenced this pull request Apr 27, 2023
Add an integration for the [OpenAI
library](https://github.com/openai/openai-python). This integration
provides tracing for the completion, embeddings and chat completion
endpoints along with cost estimation metrics and prompt/completion
sampling logs.

Each log, metric and trace are tagged with service, env, version, OpenAI
model, OpenAI endpoint and OpenAI organization.

[Docs

preview](https://output.circle-artifacts.com/output/job/fe3599b8-952e-4ceb-ac4f-0f15503e9c0d/artifacts/0/tmp/docs/integrations.html#openai)

## Design

### Logs

A new log writer implementation is added to submit logs. Logs are
submitted direct to intake following a similar approach that
[kyle-verhoog/datadog-python](https://github.com/Kyle-Verhoog/datadog-python/blob/main/datadog/_logging.py)
and the [.NET
tracer](DataDog/dd-trace-dotnet#2240) have taken
already.

### Metrics

A statsd client is used specifically for the OpenAI integration.


## Testing

Testing is done using VCR to record requests made to OpenAI to ensure
ease, consistency and reliability in test cases.

Logs and metrics are tested using mocking of the clients.

Several integration tests using snapshots and subprocess testing ensure
that the integration works in a real world OpenAI application.

A manual test app was also used:
https://gist.github.com/Kyle-Verhoog/1f263ed0aade076b313167d1ba3bfa16


## Risk

- Currently, logs, traces and metrics are all collected, buffered and
sent individually through their respective pipelines. Due to this, there
is risk that disparity occurs between the tagging and submission of the
data. There is also a performance risk as this data is not aggregated or
batched when submitted.
- Prompts and completions are captured on spans by default with a
default limit on the length of the data. This limit only applies to each
prompt/completion individually but requests can contain several prompts
and completions. If there are many prompts and completions of great
length then there is a risk of performance overhead of encoding and
transmitting the data.
- Logs and metrics clients are specified specifically for this
integration. If another integration were to introduce logs then there
would be a need for another log writer. Having several log writers could
induce thread contention and high memory usage.


Co-authored-by: lievan <42917263+lievan@users.noreply.github.com>
Co-authored-by: Kari Halsted <12926135+kayayarai@users.noreply.github.com>
Co-authored-by: Federico Mon <federico.mon@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:automatic-instrumentation Automatic instrumentation managed C# code (Datadog.Trace.ClrProfiler.Managed) area:direct-log-submission area:logs-injection area:tracer The core tracer library (Datadog.Trace, does not include OpenTracing, native code, or integrations) needs-docs-update type:new-feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants