[Feature Flags] Add flag evaluation metrics via OpenFeature hook#8367
Conversation
78be069 to
f240989
Compare
Execution-Time Benchmarks Report ⏱️Execution-time results for samples comparing This PR (8367) and master. ✅ No regressions detected - check the details below Full Metrics ComparisonFakeDbCommand
HttpMessageHandler
Comparison explanationExecution-time benchmarks measure the whole time it takes to execute a program, and are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are highlighted in **red**. The following thresholds were used for comparing the execution times:
Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard. Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph). Duration chartsFakeDbCommand (.NET Framework 4.8)gantt
title Execution time (ms) FakeDbCommand (.NET Framework 4.8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8367) - mean (73ms) : 70, 76
master - mean (73ms) : 70, 76
section Bailout
This PR (8367) - mean (78ms) : 75, 82
master - mean (77ms) : 75, 80
section CallTarget+Inlining+NGEN
This PR (8367) - mean (1,127ms) : 1072, 1181
master - mean (1,135ms) : 1090, 1180
FakeDbCommand (.NET Core 3.1)gantt
title Execution time (ms) FakeDbCommand (.NET Core 3.1)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8367) - mean (116ms) : 111, 122
master - mean (114ms) : 109, 119
section Bailout
This PR (8367) - mean (117ms) : 111, 123
master - mean (119ms) : 113, 125
section CallTarget+Inlining+NGEN
This PR (8367) - mean (815ms) : 782, 847
master - mean (812ms) : 791, 834
FakeDbCommand (.NET 6)gantt
title Execution time (ms) FakeDbCommand (.NET 6)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8367) - mean (103ms) : 98, 107
master - mean (104ms) : 98, 110
section Bailout
This PR (8367) - mean (104ms) : 99, 109
master - mean (102ms) : 99, 105
section CallTarget+Inlining+NGEN
This PR (8367) - mean (948ms) : 905, 991
master - mean (950ms) : 902, 998
FakeDbCommand (.NET 8)gantt
title Execution time (ms) FakeDbCommand (.NET 8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8367) - mean (101ms) : 97, 104
master - mean (103ms) : 97, 109
section Bailout
This PR (8367) - mean (106ms) : 99, 112
master - mean (104ms) : 99, 110
section CallTarget+Inlining+NGEN
This PR (8367) - mean (838ms) : 774, 901
master - mean (834ms) : 786, 881
HttpMessageHandler (.NET Framework 4.8)gantt
title Execution time (ms) HttpMessageHandler (.NET Framework 4.8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8367) - mean (204ms) : 195, 214
master - mean (215ms) : 189, 240
section Bailout
This PR (8367) - mean (210ms) : 200, 219
master - mean (216ms) : 198, 234
section CallTarget+Inlining+NGEN
This PR (8367) - mean (1,272ms) : 1228, 1316
master - mean (1,322ms) : 1267, 1378
HttpMessageHandler (.NET Core 3.1)gantt
title Execution time (ms) HttpMessageHandler (.NET Core 3.1)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8367) - mean (297ms) : 275, 318
master - mean (308ms) : 277, 339
section Bailout
This PR (8367) - mean (301ms) : 285, 317
master - mean (309ms) : 283, 335
section CallTarget+Inlining+NGEN
This PR (8367) - mean (1,018ms) : 988, 1049
master - mean (1,043ms) : 1000, 1086
HttpMessageHandler (.NET 6)gantt
title Execution time (ms) HttpMessageHandler (.NET 6)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8367) - mean (294ms) : 278, 310
master - mean (304ms) : 279, 330
section Bailout
This PR (8367) - mean (295ms) : 275, 315
master - mean (304ms) : 279, 330
section CallTarget+Inlining+NGEN
This PR (8367) - mean (1,188ms) : 1137, 1239
master - mean (1,207ms) : 1129, 1285
HttpMessageHandler (.NET 8)gantt
title Execution time (ms) HttpMessageHandler (.NET 8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8367) - mean (292ms) : 278, 306
master - mean (305ms) : 274, 335
section Bailout
This PR (8367) - mean (295ms) : 272, 317
master - mean (304ms) : 276, 332
section CallTarget+Inlining+NGEN
This PR (8367) - mean (1,082ms) : 985, 1179
master - mean (1,130ms) : 1024, 1236
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
BenchmarksBenchmark execution time: 2026-05-08 01:42:34 Comparing candidate commit 2a37cf3 in PR branch Some scenarios are present only in baseline or only in candidate runs. If you didn't create or remove some scenarios in your branch, this maybe a sign of crashed benchmarks 💥💥💥 Scenarios present only in baseline:
Found 3 performance improvements and 3 performance regressions! Performance is the same for 48 metrics, 18 unstable metrics, 87 known flaky benchmarks, 39 flaky benchmarks without significant changes.
|
a73acc0 to
ea7bc67
Compare
ea7bc67 to
8865c3f
Compare
| <!-- OpenFeature 2.3.0 has transitive dependencies that warn about net6.0 support --> | ||
| <SuppressTfmSupportBuildWarnings>true</SuppressTfmSupportBuildWarnings> |
There was a problem hiding this comment.
We needed to upgrade OpenFeature from 2.0.0 to 2.3.0 because it changed the FinallyAsync hook signature to include FlagEvaluationDetails<T> which is needed to record flag evaluation metrics and this matches how it was implemented in the Go SDK
| OpenFeature version | FinallyAsync signature |
|---|---|
| 2.0.0 | ValueTask FinallyAsync<T>(HookContext<T> context, ...) |
| 2.3.0 | ValueTask FinallyAsync<T>(HookContext<T> context, FlagEvaluationDetails<T> details, ...) |
I needed to suppress the warning to get the bump_package_versions CI to pass, and given the presence of SuppressTfmSupportBuildWarnings in other places in the repo I'm guessing this is acceptable. The warning comes from transitive dependencies (Microsoft.Extensions.*, System.Collections.Immutable, etc.) that ship .NET 9.0 versions and emit a build-time warning saying they haven't been "tested" with net6.0.
There was a problem hiding this comment.
AS mentioned in the comment above, the TargetFrameworks used here should match the version used by the referenced OpenFeature version, so adding SuppressTfmSupportBuildWarnings here likely isn't the right fix - instead you should make (breaking) changes to the TargetFrameworks.
However, I would really try to work out if you can avoid bumping the OpenFeature version. This is a breaking change (as it drop support for TFMs in exactly the way you're seeing here). Is there another approach we could use? Should we instead use progressive enhancement to add these hooks?
What is the documented support policy here? Because breaking changes like this are generally not how we update things in the core tracer.
There was a problem hiding this comment.
Since I'm currently leaning towards accepting the breaking change in OpenFeature https://github.com/DataDog/dd-trace-dotnet/pull/8367/changes#r3075858040, I updated the TFMs to match: net462;netstandard2.0;net8.0;net9.0. This drops net6.0 (EOL since Nov 2024) and adds net9.0. Removed SuppressTfmSupportBuildWarnings. Customers on .NET 6 can't use OpenFeature 2.3.0 anyway, so this aligns with their supported platforms.
b2040ea to
c448add
Compare
Remove irrelevant skips for: - Test_FFE_Eval_Metric_Parse_Error_Invalid_Regex - Test_FFE_Eval_No_Config_Loaded The .NET SDK uses a managed evaluator (not libdatadog), and now correctly returns PARSE_ERROR for invalid regex patterns and PROVIDER_NOT_READY when no config is loaded. Related: DataDog/dd-trace-dotnet#8367
c448add to
9c81be3
Compare
Update test assertions to match semantic evaluation reason logic: - CreateSimpleFlag has shards → expect Split (not TargetingMatch) - CreateExposureFlag has no rules/shards → expect Static - CreateTimeBasedFlagWithDates has no rules/shards → expect Static The evaluation reason logic is: - hasShards → Split - hasRules → TargetingMatch - neither → Static This aligns with system-tests fixtures and OpenFeature conventions.
- Wrap ArgumentException from invalid regex patterns in FormatException to produce PARSE_ERROR instead of GENERAL error - Return "PROVIDER_NOT_READY" error string when evaluator is null for proper OpenFeature error mapping - Handle RC reset (empty config list) by clearing evaluator to trigger PROVIDER_NOT_READY on subsequent evaluations - Add unit test for invalid regex PARSE_ERROR behavior
- Error property now contains human-readable messages for debugging - FlagMetadata["errorCode"] contains OpenFeature codes for programmatic handling - ToErrorType() reads from metadata instead of error string - Updated unit tests to verify both error and errorCode This pattern is consistent with how Go and Python SDKs handle errors.
ef1e68e to
56e2d16
Compare
- Add ToOpenFeatureReason() to map our enum to OpenFeature Reason constants - Simplify ReasonToString() to just ToLowerInvariant() since OpenFeature already uses UPPER_SNAKE_CASE format - Fix test assertions to check human-readable Error messages and errorCode in FlagMetadata separately
56e2d16 to
136880a
Compare
The Datadog.FeatureFlags.OpenFeature package requires OpenFeature >= 2.3.0 for the Reason constants and FinallyAsync with details parameter used in the flag evaluation metrics hook. Update PackageVersionsGeneratorDefinitions.json: - MinVersion: 2.0.0 -> 2.3.0 - SpecificVersions: Replace 2.0.0 with 2.3.0
439210c to
49262b1
Compare
Run ./tracer/build.sh GeneratePackageVersions to update the generated .props and .cs files after changing MinVersion in PackageVersionsGeneratorDefinitions.json. See docs/development/AutomaticInstrumentation.md for the process.
| defaultValue, | ||
| EvaluationReason.Error, | ||
| error: "PROVIDER_NOT_READY", | ||
| error: "No config loaded", |
There was a problem hiding this comment.
Using human readable messages here is consistent with the other SDKs:
Errorproperty → Human-readable message for logging/debuggingFlagMetadata["errorCode"]→ OpenFeature error code for programmatic handling
The Go SDK uses the same pattern - human-readable Go errors that get mapped to OpenFeature ResolutionError
And the Python SDK also has a human-readable message alongside an error code.
There remain some small work to do, but I think this is overall a step towards being more consistent
| OpenFeature Code | .NET Human-Readable | Go Human-Readable | Python Human-Readable |
|---|---|---|---|
| PROVIDER_NOT_READY | "No config loaded" | N/A (different flow) | "No FFE configuration loaded" |
| FLAG_NOT_FOUND | "Flag not found" | "flag not found" | "Flag not found" |
| TYPE_MISMATCH | "Type mismatch" | "type mismatch" | "Type mismatch" |
| PARSE_ERROR | (exception message) | (exception message) | (exception message) |
| TARGETING_KEY_MISSING | "Targeting key missing" | N/A | N/A |
| GENERAL | (exception message) | N/A | N/A |
|
|
||
| /// <inheritdoc/> | ||
| public override ValueTask FinallyAsync<T>( | ||
| HookContext<T> context, |
There was a problem hiding this comment.
Unhandled exception in hook
FinallyAsync calls _metrics.Record unconditionally; if Record throws (e.g. TagList overflow or counter disposed), the exception propagates out of the OpenFeature hook pipeline with no logging or recovery, silently breaking flag evaluation for the caller.
Consider wrapping the body of FinallyAsync in a try/catch that logs the exception at Debug level and returns default, so a metrics failure never surfaces to the application.
| <Version>2.0.1</Version> | ||
| <!-- These target frameworks should match the values exposed in the OpenFeature package referenced below--> | ||
| <TargetFrameworks>net462;netstandard2.0;net6.0;net8.0</TargetFrameworks> | ||
| <TargetFrameworks>net462;netstandard2.0;net8.0;net9.0</TargetFrameworks> |
There was a problem hiding this comment.
Missing test coverage for metrics hook
Consider adding tests for the un-tested logic in the hook — specifically the branching in FinallyAsync (error-type extraction, allocation-key extraction from metadata, unknown-reason fallback). A MeterListener or IMeterFactory test double could assert Counter<long> increments with expected TagList values.
There was a problem hiding this comment.
I ran into issues initially adding tests with MeterListener. Both Datadog.Trace and Datadog.FeatureFlags.OpenFeature define ValueType and EvaluationReason (via shared source files), causing CS0433 ambiguity errors when both are referenced in the test project Datadog.Trace.Tests.
error CS0433: The type 'ValueType' exists in both 'Datadog.FeatureFlags.OpenFeature' and 'Datadog.Trace'
error CS0433: The type 'EvaluationReason' exists in both 'Datadog.FeatureFlags.OpenFeature' and 'Datadog.Trace'
One workaround is to create a separate Datadog.FeatureFlags.OpenFeature.Tests project (without referencing Datadog.Trace.Tests), so I went ahead and did that: fafbc0d
There was a problem hiding this comment.
I'd rather we didn't create a new test project if possible, especially at this stage, as it generally slows down CI which is an issue we're having more and more 🙁
An alternative approach (which we uses for other tests, is to specify an alias when referencing the project like this:
and then you can reference it like this:
There was a problem hiding this comment.
I've decided to remove the tests
- They're minimally valuable: The 3 unit tests cover simple transformations (null checks, a switch statement).
- They're challenging to set up correctly: The proposed alias approach caused dependency conflicts: OpenFeature 2.3.0 transitively pulls in Microsoft.Extensions.* 9.0.0, which conflicts with the existing Microsoft.AspNetCore 2.2.0 packages in Datadog.Trace.Tests (CS0433 type ambiguity errors).
- And we already have equivalent test coverage with test_flag_eval_metrics in system tests
tests/ffe/test_flag_eval_metrics.py::Test_FFE_Eval_Reason_*- tests reason valuestests/ffe/test_flag_eval_metrics.py::Test_FFE_Eval_Error_*- tests error type extraction- Allocation key extraction is tested via the full metrics pipeline
typotter
left a comment
There was a problem hiding this comment.
generally, LGTM. Couple of missing test suggestions and an issue with the SPLIT reason not taking precedence.
- Create dedicated test project to avoid CS0433 type conflicts - Test error-type extraction, allocation-key extraction, unknown-reason fallback - Uses MeterListener to verify counter increments with expected tags
Addresses typotter's review feedback (Comment 3) on PR #8367: - Tests that RegisterOnNewConfigEventHandler callback is invoked when config is removed - Tests that Evaluate returns PROVIDER_NOT_READY after RC reset Uses RcmSubscriptionManagerMock to simulate remote config changes.
andrewlock
left a comment
There was a problem hiding this comment.
Thanks for the updates - I have a few minor suggestions, but the main one is to move the FlagEvalMetricsHookTests back into Datadog.Trace.Tests, and use aliases to avoid the ambiguity if required 🙂
| "MinVersion": "2.0.0", | ||
| "MinVersion": "2.3.0", | ||
| "MaxVersionExclusive": "3.0.0", | ||
| "SpecificVersions": [ | ||
| "2.0.0", | ||
| "2.3.0", | ||
| "2.10.0" |
There was a problem hiding this comment.
I'm fine with doing 3 as long as we're explicit about it, which we're being here, so I think that's ok 🙂 And also a good point about being experimental 👍
| <Version>2.0.1</Version> | ||
| <!-- These target frameworks should match the values exposed in the OpenFeature package referenced below--> | ||
| <TargetFrameworks>net462;netstandard2.0;net6.0;net8.0</TargetFrameworks> | ||
| <TargetFrameworks>net462;netstandard2.0;net8.0;net9.0</TargetFrameworks> |
There was a problem hiding this comment.
I'd rather we didn't create a new test project if possible, especially at this stage, as it generally slows down CI which is an issue we're having more and more 🙁
An alternative approach (which we uses for other tests, is to specify an alias when referencing the project like this:
and then you can reference it like this:
| return new TracerSettings(new NameValueConfigurationSource(collection)); | ||
| } | ||
|
|
||
| private class RcmSubscriptionManagerMock : IRcmSubscriptionManager |
There was a problem hiding this comment.
This code already exists as a nested type in more than one class. I think it's time to extract it to its own standalone type and remove the duplication 🙂
There was a problem hiding this comment.
I see 3 implementations with slightly different behaviors:
DynamicInstrumentationTests- manages ProductKeys on Replace/UnsubscribeSymbolUploaderTest- supports multiple subscriptions + hasUpdate()methodFeatureFlagsModuleTests- simplest, just stores LastSubscription
I went with the simplest 4726a7d assuming we can refactor again later when we need to unify more
| // - TargetingMatch: Allocation had targeting rules that matched | ||
| // - Split: No rules, but resolved via percentage split (shards) | ||
| // - Static: No rules, no shards - simple static value | ||
| var reason = hadRules ? EvaluationReason.TargetingMatch |
…gs.OpenFeature.csproj Co-authored-by: Andrew Lock <andrew.lock@datadoghq.com>
…NG_MATCH Per the FFE spec, SPLIT should override TARGETING_MATCH when both rules and shards are present. Updated the reason determination logic to check hadShards first.
Per review feedback - preserves stack trace for debugging.
Clarify that the NET8+ constraint is due to MeterListener (NET6+), not OpenFeature which supports all frameworks.
Reverts fafbc0d and 1d78ab2. The separate test project added CI overhead for minimal value: - Only 3 unit tests covering simple transformations (null checks, switch statement) - System-tests already provide equivalent coverage: - tests/ffe/test_flag_eval_metrics.py::Test_FFE_Eval_Reason_* tests reason values - tests/ffe/test_flag_eval_metrics.py::Test_FFE_Eval_Error_* tests error type extraction - Allocation key extraction is tested via the full metrics pipeline Additionally, moving these tests to Datadog.Trace.Tests (per review feedback) would cause dependency conflicts - OpenFeature 2.3.0's transitive deps conflict with the existing Microsoft.AspNetCore 2.2.0 packages.
The "Log at debug level and swallow the exception" comment was redundant given the code is self-explanatory.
0963903 to
4726a7d
Compare
andrewlock
left a comment
There was a problem hiding this comment.
LGTM, thanks all the work on this!
| catch (Exception ex) | ||
| { | ||
| // Metrics recording should never break flag evaluation. | ||
| System.Diagnostics.Debug.WriteLine($"[Datadog] FlagEvalMetricsHook.FinallyAsync failed: {ex}"); |
There was a problem hiding this comment.
Just so you're aware, we'll basically never know if this is happening in production... There's an enhancement we could/should make which is to have an explicit method in Datadog.Trace.Manual which is intended specifically for reporting errors to our telemetry backend (where the errors are our errors, not errors in API calling).
I'm undecided whether we should do it straight away to catch any issues here, but either way, not for this PR!
There was a problem hiding this comment.
Yep understood. These logs are only observable in dev with debug builds and agreed it would be nice in the future to have visibility in prod.
Appreciate the careful review and all the feedback!
Summary of changes
Adds flag evaluation metrics support to the Datadog OpenFeature provider. When a feature flag is evaluated, a counter metric is emitted with relevant attributes.
Reason for change
Part of the Feature Flags Evaluation (FFE) initiative to provide visibility into flag evaluations across all Datadog server SDKs. This brings parity with implementations in dd-trace-py and dd-trace-go.
Implementation details
FlagEvalMetrics.cs: Core metrics recording class using
System.Diagnostics.MetricsDatadog.FeatureFlags.OpenFeaturefeature_flag.evaluations(Counter){evaluation}feature_flag.key,feature_flag.result.variant,feature_flag.result.reason,error.type(optional),feature_flag.result.allocation_key(optional)FlagEvalMetricsHook.cs: OpenFeature hook that implements
FinallyAsyncto record metrics after each evaluation completesErrorTypeenum to snake_case stringsDatadogProvider.cs: Registers the metrics hook with the OpenFeature API
Updated OpenFeature SDK dependency from 2.0.0 to 2.3.0
Package version bumped to 2.3.0 to match OpenFeature SDK dependency version. This signals the breaking changes from OpenFeature 2.3.0:
finallysignature changed to include evaluation detailsConditional compilation: Only enabled for .NET 6+ (
#if NET6_0_OR_GREATER) sinceSystem.Diagnostics.Metricsrequires .NET 6+.Test coverage
tests/ffe/test_flag_eval_metrics.py)Datadog.FeatureFlags.OpenFeatureandDatadog.TraceassembliesOther details
Jira: https://datadoghq.atlassian.net/browse/FFL-1946
Related PRs: