Revert 'Reapply "Revert Load the tracer/profiler after guardrails checks" (#6959)' by andrewlock · Pull Request #6986 · DataDog/dd-trace-dotnet

andrewlock · 2025-05-16T16:08:48Z

Summary of changes

This reverts commit c66626c.

Reason for change

We're seeing crashes in arm64 on fedora 35 only, when we initialize the WAF. Reverting this for now while we investigate to unblock master

Implementation details

Revert the reapplied revert

Test coverage

Covered by existing

Other details

) This reverts commit c66626c.

andrewlock · 2025-05-16T16:42:27Z

Execution-Time Benchmarks Report ⏱️

Execution-time results for samples comparing the following branches/commits:

Execution-time benchmarks measure the whole time it takes to execute a program. And are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are shown in red. The following thresholds were used for comparing the execution times:

Welch test with statistical test for significance of 5%
Only results indicating a difference greater than 5% and 5 ms are considered.

Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard.

Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph).

gantt
    title Execution time (ms) FakeDbCommand (.NET Framework 4.8) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6986) - mean (73ms)  : 68, 78
     .   : milestone, 73,
    master - mean (68ms)  : 65, 70
     .   : milestone, 68,

    section CallTarget+Inlining+NGEN
    This PR (6986) - mean (1,039ms)  : 994, 1085
     .   : milestone, 1039,
    master - mean (1,007ms)  : 982, 1031
     .   : milestone, 1007,

gantt
    title Execution time (ms) FakeDbCommand (.NET Core 3.1) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6986) - mean (107ms)  : 105, 110
     .   : milestone, 107,
    master - mean (102ms)  : 99, 104
     .   : milestone, 102,

    section CallTarget+Inlining+NGEN
    This PR (6986) - mean (728ms)  : 694, 763
     .   : milestone, 728,
    master - mean (691ms)  : 673, 710
     .   : milestone, 691,

gantt
    title Execution time (ms) FakeDbCommand (.NET 6) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6986) - mean (93ms)  : 90, 96
     .   : milestone, 93,
    master - mean (88ms)  : 86, 91
     .   : milestone, 88,

    section CallTarget+Inlining+NGEN
    This PR (6986) - mean (684ms)  : 639, 729
     .   : milestone, 684,
    master - mean (655ms)  : 634, 677
     .   : milestone, 655,

gantt
    title Execution time (ms) HttpMessageHandler (.NET Framework 4.8) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6986) - mean (189ms)  : 181, 197
     .   : milestone, 189,
    master - mean (189ms)  : 186, 192
     .   : milestone, 189,

    section CallTarget+Inlining+NGEN
    This PR (6986) - mean (1,110ms)  : 1086, 1134
     .   : milestone, 1110,
    master - mean (1,111ms)  : 1077, 1145
     .   : milestone, 1111,

gantt
    title Execution time (ms) HttpMessageHandler (.NET Core 3.1) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6986) - mean (268ms)  : 263, 272
     .   : milestone, 268,
    master - mean (268ms)  : 263, 272
     .   : milestone, 268,

    section CallTarget+Inlining+NGEN
    This PR (6986) - mean (875ms)  : 845, 905
     .   : milestone, 875,
    master - mean (876ms)  : 845, 907
     .   : milestone, 876,

gantt
    title Execution time (ms) HttpMessageHandler (.NET 6) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6986) - mean (261ms)  : 258, 264
     .   : milestone, 261,
    master - mean (261ms)  : 257, 264
     .   : milestone, 261,

    section CallTarget+Inlining+NGEN
    This PR (6986) - mean (865ms)  : 841, 890
     .   : milestone, 865,
    master - mean (867ms)  : 836, 898
     .   : milestone, 867,

datadog-datadog-prod-us1 · 2025-05-16T16:42:41Z

Datadog Report

All test runs 5ff205f 🔗

✅ 2 Total Test Services: 0 Failed, 2 Passed
⌛ 1 with Regressions

Test Services

Service Name	Failed	Known Flaky	New Flaky	Passed	Skipped	Total Time	Performance Regressions	Test Service View
`dd-trace-dotnet`	0	0	0	264947	2357	17h 51m 24.39s	6	Link
`exploration_tests`	0	0	0	22085	3	2m 12.95s	0	Link

⌛ Performance Regressions vs Default Branch (6)

This report shows up to 5 performance regressions.

Baseline - Samples.FakeDbCommand.windows.net48.json.scenarios 72.95ms (+5.01ms, +7%) - Details
CallTarget+Inlining+NGEN - Samples.FakeDbCommand.windows.net48.json.scenarios 1.04s (+35.07ms, +3%) - Details
CallTarget+Inlining+NGEN - Samples.FakeDbCommand.windows.netcoreapp31.json.scenarios 728.32ms (+35.31ms, +5%) - Details
Baseline - Samples.FakeDbCommand.windows.netcoreapp31.json.scenarios 107.37ms (+5.4ms, +5%) - Details
CallTarget+Inlining+NGEN - Samples.FakeDbCommand.windows.net60.json.scenarios 684.19ms (+29.78ms, +5%) - Details

gleocadie

LGTM

…ails checks" (#6959)' (#6986)" This reverts commit cc799fb.

## Summary of changes - Revert "Revert 'Reapply "Revert Load the tracer/profiler after guardrails checks" (#6959)' (#6986)" #7051 - Avoid calling `dlclose` on glibc 2.34-2.36 ## Reason for change We want to make sure we only load the tracer/profiler libraries _after_ we have run guardrails checks. However, repeated attempts to do this were causing crashes in our smoke tests on Fedora 35, on arm64 only. After an (extensive) investigation, we finally tracked this down to a bug in glibc itself. The full explanation is given below, but the upshot is that calling `dlclose` on this buggy glibc version can cause crashes at some later point. ## Implementation details - Includes the "original" implementation that moves the tracer/profiler after guardrails checks - Before calling `dlclose`, check to see if we're on a buggy version of glibc - This is complicated by the fact we run the _same_ binary on glibc and musl, so we have to make the glibc call _dynamically_, instead of linking directly against the `gnu_get_libc_version` method. - We avoid trying to load libc if we detect that we're on Alpine by checking for `/etc/alpine-release`. This is slightly annoying, but required in the native loader code. - If we _do_ manage to call the glibc version, we check against the blocklist. If the version is on it, we avoid ever calling dlclose. ## Test coverage We have run repeated tests against the previously crashing tests, and this has resolved the issue. Nevertheless, our recommendation for customers should definitely be to upgrade to a stable version of glibc wherever possible ## Other details Overview of the bug: In certain versions of glibc, there is a TLS-reuse bug that can cause crashes when unloading shared libraries. The bug was introduced in 2.34, fixed in 2.36 on x86-64, and fixed in 2.37 on aarch64. See [the bug here](https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3921c5b40f293c57cb326f58713c924b0662ef59) or [the explanation of the bug on Fedora](https://bugzilla.redhat.com/show_bug.cgi?id=2251557) (which is where we saw the crashing issue). glibc 2.34 shipped with a regression: after a `dlclose()` of a library that carried dynamic-TLS, the loader could reuse the same "module-ID" for a different library _without_ first clearing the associated DTV (Dynamic Thread Vector) entry. The next time any code accessed that TLS slot it could read or write an unmapped address, which cases a `SIGSEGV`. This manifested as a crash in the WAF when we called `ddwaf_context_info` on arm64. It explicitly happens on arm64 when we unload the continuous profiler shortly after loading it (which is normal because it's not supported). It manifests in this scenario because `ddwaf_context_init` starts like this: ```assembly +128 bl __tls_get_addr ; ask glibc for the TLS slot for libddwaf +136 mrs x11, TPIDR_EL0 ; TLS base for this thread +140 ldrb w9, [x11, x0] ; <–– boom if x0 points to a stale DTV entry ``` When we unload the continuous profiler and call `dlcose`, it causes the glibc loader to hand out a recycled module-ID to `libddwaf`. When `ddwaf_context_init` tries to access TLS in the `ldrb` instruction, `x11 + x0` is outside every mapped version, and so it crashes. Note that although calling `dlclose` with the continuous profiler _may_ trigger the issue (the actual crash is flaky depending on load/unload timing and address layout), unloading _any_ library that is is built with `__thread`/`thread_local` data could trigger the crash. To minimize the risk of hitting this issue, we avoid calling `dlclose` entirely on the flaky glibc versions For more details [see doc](https://docs.google.com/document/d/1aptwmprnd83VTZMKxrrTqmBF6eOayjCL36kj9LujCE8/edit?tab=t.0#heading=h.nytiofltvdb5) Affected distros: | Distribution | Version / Release | glibc Version | | ----------------- | ----------------------------- | ------------- | | Fedora | 35 | 2.34 | | Fedora | 37 | 2.36 | | Ubuntu | 21.10 ("Impish Indri") | 2.34 | | Ubuntu | 22.04 LTS ("Jammy Jellyfish") | 2.35 | | Debian | 12 ("Bookworm") | 2.36 | | RHEL | 9 | 2.34 | | CentOS | 9 | 2.34 | | Amazon Linux 2023 | default | 2.34 |

…cks" (#6959)' (#6986) ## Summary of changes This reverts commit c66626c. ## Reason for change We're seeing crashes in arm64 on fedora 35 only, when we initialize the WAF. Reverting this for now while we investigate to unblock `master` ## Implementation details Revert the reapplied revert ## Test coverage Covered by existing ## Other details

## Summary of changes - Revert "Revert 'Reapply "Revert Load the tracer/profiler after guardrails checks" (#6959)' (#6986)" #7051 - Avoid calling `dlclose` on glibc 2.34-2.36 ## Reason for change We want to make sure we only load the tracer/profiler libraries _after_ we have run guardrails checks. However, repeated attempts to do this were causing crashes in our smoke tests on Fedora 35, on arm64 only. After an (extensive) investigation, we finally tracked this down to a bug in glibc itself. The full explanation is given below, but the upshot is that calling `dlclose` on this buggy glibc version can cause crashes at some later point. ## Implementation details - Includes the "original" implementation that moves the tracer/profiler after guardrails checks - Before calling `dlclose`, check to see if we're on a buggy version of glibc - This is complicated by the fact we run the _same_ binary on glibc and musl, so we have to make the glibc call _dynamically_, instead of linking directly against the `gnu_get_libc_version` method. - We avoid trying to load libc if we detect that we're on Alpine by checking for `/etc/alpine-release`. This is slightly annoying, but required in the native loader code. - If we _do_ manage to call the glibc version, we check against the blocklist. If the version is on it, we avoid ever calling dlclose. ## Test coverage We have run repeated tests against the previously crashing tests, and this has resolved the issue. Nevertheless, our recommendation for customers should definitely be to upgrade to a stable version of glibc wherever possible ## Other details Overview of the bug: In certain versions of glibc, there is a TLS-reuse bug that can cause crashes when unloading shared libraries. The bug was introduced in 2.34, fixed in 2.36 on x86-64, and fixed in 2.37 on aarch64. See [the bug here](https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3921c5b40f293c57cb326f58713c924b0662ef59) or [the explanation of the bug on Fedora](https://bugzilla.redhat.com/show_bug.cgi?id=2251557) (which is where we saw the crashing issue). glibc 2.34 shipped with a regression: after a `dlclose()` of a library that carried dynamic-TLS, the loader could reuse the same "module-ID" for a different library _without_ first clearing the associated DTV (Dynamic Thread Vector) entry. The next time any code accessed that TLS slot it could read or write an unmapped address, which cases a `SIGSEGV`. This manifested as a crash in the WAF when we called `ddwaf_context_info` on arm64. It explicitly happens on arm64 when we unload the continuous profiler shortly after loading it (which is normal because it's not supported). It manifests in this scenario because `ddwaf_context_init` starts like this: ```assembly +128 bl __tls_get_addr ; ask glibc for the TLS slot for libddwaf +136 mrs x11, TPIDR_EL0 ; TLS base for this thread +140 ldrb w9, [x11, x0] ; <–– boom if x0 points to a stale DTV entry ``` When we unload the continuous profiler and call `dlcose`, it causes the glibc loader to hand out a recycled module-ID to `libddwaf`. When `ddwaf_context_init` tries to access TLS in the `ldrb` instruction, `x11 + x0` is outside every mapped version, and so it crashes. Note that although calling `dlclose` with the continuous profiler _may_ trigger the issue (the actual crash is flaky depending on load/unload timing and address layout), unloading _any_ library that is is built with `__thread`/`thread_local` data could trigger the crash. To minimize the risk of hitting this issue, we avoid calling `dlclose` entirely on the flaky glibc versions For more details [see doc](https://docs.google.com/document/d/1aptwmprnd83VTZMKxrrTqmBF6eOayjCL36kj9LujCE8/edit?tab=t.0#heading=h.nytiofltvdb5) Affected distros: | Distribution | Version / Release | glibc Version | | ----------------- | ----------------------------- | ------------- | | Fedora | 35 | 2.34 | | Fedora | 37 | 2.36 | | Ubuntu | 21.10 ("Impish Indri") | 2.34 | | Ubuntu | 22.04 LTS ("Jammy Jellyfish") | 2.35 | | Debian | 12 ("Bookworm") | 2.36 | | RHEL | 9 | 2.34 | | CentOS | 9 | 2.34 | | Amazon Linux 2023 | default | 2.34 |

Reapply "Revert Load the tracer/profiler after guardrails checks" (#6959

5ff205f

) This reverts commit c66626c.

andrewlock marked this pull request as ready for review May 22, 2025 13:54

andrewlock requested review from a team as code owners May 22, 2025 13:54

gleocadie approved these changes May 22, 2025

View reviewed changes

lucaspimentel approved these changes May 22, 2025

View reviewed changes

andrewlock merged commit cc799fb into master May 22, 2025
128 of 129 checks passed

andrewlock deleted the andrew/revert-revert-revert branch May 22, 2025 14:46

github-actions Bot added this to the vNext-v3 milestone May 22, 2025

andrewlock added area:native-library Automatic instrumentation native C++ code (Datadog.Trace.ClrProfiler.Native) area:shared-components labels May 23, 2025

andrewlock added a commit that referenced this pull request Jun 3, 2025

Revert "Revert 'Reapply "Revert Load the tracer/profiler after guardr…

3001163

…ails checks" (#6959)' (#6986)" This reverts commit cc799fb.

andrewlock added a commit that referenced this pull request Jun 3, 2025

Revert "Revert 'Reapply "Revert Load the tracer/profiler after guardr…

fa0e15a

…ails checks" (#6959)' (#6986)" This reverts commit cc799fb.

bouwkast mentioned this pull request Jun 3, 2025

Revert "Revert 'Reapply "Revert Load the tracer/profiler after guardrails checks" (#6959)' (#6986)" #7051

Closed

andrewlock added a commit that referenced this pull request Jun 6, 2025

Revert "Revert 'Reapply "Revert Load the tracer/profiler after guardr…

8366caa

…ails checks" (#6959)' (#6986)" This reverts commit cc799fb.

andrewlock added a commit that referenced this pull request Jun 9, 2025

Revert "Revert 'Reapply "Revert Load the tracer/profiler after guardr…

4699c73

…ails checks" (#6959)' (#6986)" This reverts commit cc799fb.

andrewlock added a commit that referenced this pull request Jun 9, 2025

Revert "Revert 'Reapply "Revert Load the tracer/profiler after guardr…

1e774a8

…ails checks" (#6959)' (#6986)" This reverts commit cc799fb.

andrewlock added a commit that referenced this pull request Jun 10, 2025

Revert "Revert 'Reapply "Revert Load the tracer/profiler after guardr…

7f1ec21

…ails checks" (#6959)' (#6986)" This reverts commit cc799fb.

andrewlock added a commit that referenced this pull request Jun 11, 2025

Revert "Revert 'Reapply "Revert Load the tracer/profiler after guardr…

d442aa9

…ails checks" (#6959)' (#6986)" This reverts commit cc799fb.

andrewlock added a commit that referenced this pull request Jun 11, 2025

Revert "Revert 'Reapply "Revert Load the tracer/profiler after guardr…

d15f68d

…ails checks" (#6959)' (#6986)" This reverts commit cc799fb.

andrewlock added a commit that referenced this pull request Jun 12, 2025

Revert "Revert 'Reapply "Revert Load the tracer/profiler after guardr…

d356620

…ails checks" (#6959)' (#6986)" This reverts commit cc799fb.

andrewlock added a commit that referenced this pull request Jun 19, 2025

Revert "Revert 'Reapply "Revert Load the tracer/profiler after guardr…

f871a50

…ails checks" (#6959)' (#6986)" This reverts commit cc799fb.

andrewlock added a commit that referenced this pull request Jun 20, 2025

Revert "Revert 'Reapply "Revert Load the tracer/profiler after guardr…

400f586

…ails checks" (#6959)' (#6986)" This reverts commit cc799fb.

andrewlock mentioned this pull request Jun 23, 2025

Avoid dlclose on glibc 2.34-2.36 #7125

Merged

andrewlock added a commit that referenced this pull request Jun 23, 2025

Revert "Revert 'Reapply "Revert Load the tracer/profiler after guardr…

99e1580

…ails checks" (#6959)' (#6986)" This reverts commit cc799fb.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert 'Reapply "Revert Load the tracer/profiler after guardrails checks" (#6959)'#6986

Revert 'Reapply "Revert Load the tracer/profiler after guardrails checks" (#6959)'#6986
andrewlock merged 1 commit into
masterfrom
andrew/revert-revert-revert

andrewlock commented May 16, 2025 •

edited

Loading

Uh oh!

andrewlock commented May 16, 2025

Uh oh!

datadog-datadog-prod-us1 Bot commented May 16, 2025 •

edited

Loading

Uh oh!

gleocadie left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

andrewlock commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of changes

Reason for change

Implementation details

Test coverage

Other details

Uh oh!

andrewlock commented May 16, 2025

Execution-Time Benchmarks Report ⏱️

Uh oh!

datadog-datadog-prod-us1 Bot commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Datadog Report

⌛ Performance Regressions vs Default Branch (6)

Uh oh!

gleocadie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andrewlock commented May 16, 2025 •

edited

Loading

datadog-datadog-prod-us1 Bot commented May 16, 2025 •

edited

Loading