Skip to content

Revert 'Reapply "Revert Load the tracer/profiler after guardrails checks" (#6959)'#6986

Merged
andrewlock merged 1 commit into
masterfrom
andrew/revert-revert-revert
May 22, 2025
Merged

Revert 'Reapply "Revert Load the tracer/profiler after guardrails checks" (#6959)'#6986
andrewlock merged 1 commit into
masterfrom
andrew/revert-revert-revert

Conversation

@andrewlock

@andrewlock andrewlock commented May 16, 2025

Copy link
Copy Markdown
Member

Summary of changes

This reverts commit c66626c.

Reason for change

We're seeing crashes in arm64 on fedora 35 only, when we initialize the WAF. Reverting this for now while we investigate to unblock master

Implementation details

Revert the reapplied revert

Test coverage

Covered by existing

Other details

@andrewlock

Copy link
Copy Markdown
Member Author

Execution-Time Benchmarks Report ⏱️

Execution-time results for samples comparing the following branches/commits:

Execution-time benchmarks measure the whole time it takes to execute a program. And are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are shown in red. The following thresholds were used for comparing the execution times:

  • Welch test with statistical test for significance of 5%
  • Only results indicating a difference greater than 5% and 5 ms are considered.

Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard.

Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph).

gantt
    title Execution time (ms) FakeDbCommand (.NET Framework 4.8) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6986) - mean (73ms)  : 68, 78
     .   : milestone, 73,
    master - mean (68ms)  : 65, 70
     .   : milestone, 68,

    section CallTarget+Inlining+NGEN
    This PR (6986) - mean (1,039ms)  : 994, 1085
     .   : milestone, 1039,
    master - mean (1,007ms)  : 982, 1031
     .   : milestone, 1007,

Loading
gantt
    title Execution time (ms) FakeDbCommand (.NET Core 3.1) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6986) - mean (107ms)  : 105, 110
     .   : milestone, 107,
    master - mean (102ms)  : 99, 104
     .   : milestone, 102,

    section CallTarget+Inlining+NGEN
    This PR (6986) - mean (728ms)  : 694, 763
     .   : milestone, 728,
    master - mean (691ms)  : 673, 710
     .   : milestone, 691,

Loading
gantt
    title Execution time (ms) FakeDbCommand (.NET 6) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6986) - mean (93ms)  : 90, 96
     .   : milestone, 93,
    master - mean (88ms)  : 86, 91
     .   : milestone, 88,

    section CallTarget+Inlining+NGEN
    This PR (6986) - mean (684ms)  : 639, 729
     .   : milestone, 684,
    master - mean (655ms)  : 634, 677
     .   : milestone, 655,

Loading
gantt
    title Execution time (ms) HttpMessageHandler (.NET Framework 4.8) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6986) - mean (189ms)  : 181, 197
     .   : milestone, 189,
    master - mean (189ms)  : 186, 192
     .   : milestone, 189,

    section CallTarget+Inlining+NGEN
    This PR (6986) - mean (1,110ms)  : 1086, 1134
     .   : milestone, 1110,
    master - mean (1,111ms)  : 1077, 1145
     .   : milestone, 1111,

Loading
gantt
    title Execution time (ms) HttpMessageHandler (.NET Core 3.1) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6986) - mean (268ms)  : 263, 272
     .   : milestone, 268,
    master - mean (268ms)  : 263, 272
     .   : milestone, 268,

    section CallTarget+Inlining+NGEN
    This PR (6986) - mean (875ms)  : 845, 905
     .   : milestone, 875,
    master - mean (876ms)  : 845, 907
     .   : milestone, 876,

Loading
gantt
    title Execution time (ms) HttpMessageHandler (.NET 6) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6986) - mean (261ms)  : 258, 264
     .   : milestone, 261,
    master - mean (261ms)  : 257, 264
     .   : milestone, 261,

    section CallTarget+Inlining+NGEN
    This PR (6986) - mean (865ms)  : 841, 890
     .   : milestone, 865,
    master - mean (867ms)  : 836, 898
     .   : milestone, 867,

Loading

@datadog-datadog-prod-us1

datadog-datadog-prod-us1 Bot commented May 16, 2025

Copy link
Copy Markdown

Datadog Report

All test runs 5ff205f 🔗

2 Total Test Services: 0 Failed, 2 Passed
1 with Regressions

Test Services
Service Name Failed Known Flaky New Flaky Passed Skipped Total Time Performance Regressions Test Service View
dd-trace-dotnet 0 0 0 264947 2357 17h 51m 24.39s 6 Link
exploration_tests 0 0 0 22085 3 2m 12.95s 0 Link

⌛ Performance Regressions vs Default Branch (6)

This report shows up to 5 performance regressions.

  • Baseline - Samples.FakeDbCommand.windows.net48.json.scenarios 72.95ms (+5.01ms, +7%) - Details
  • CallTarget+Inlining+NGEN - Samples.FakeDbCommand.windows.net48.json.scenarios 1.04s (+35.07ms, +3%) - Details
  • CallTarget+Inlining+NGEN - Samples.FakeDbCommand.windows.netcoreapp31.json.scenarios 728.32ms (+35.31ms, +5%) - Details
  • Baseline - Samples.FakeDbCommand.windows.netcoreapp31.json.scenarios 107.37ms (+5.4ms, +5%) - Details
  • CallTarget+Inlining+NGEN - Samples.FakeDbCommand.windows.net60.json.scenarios 684.19ms (+29.78ms, +5%) - Details

@andrewlock andrewlock marked this pull request as ready for review May 22, 2025 13:54
@andrewlock andrewlock requested review from a team as code owners May 22, 2025 13:54

@gleocadie gleocadie left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andrewlock andrewlock merged commit cc799fb into master May 22, 2025
128 of 129 checks passed
@andrewlock andrewlock deleted the andrew/revert-revert-revert branch May 22, 2025 14:46
@github-actions github-actions Bot added this to the vNext-v3 milestone May 22, 2025
@andrewlock andrewlock added area:native-library Automatic instrumentation native C++ code (Datadog.Trace.ClrProfiler.Native) area:shared-components labels May 23, 2025
andrewlock added a commit that referenced this pull request Jun 3, 2025
andrewlock added a commit that referenced this pull request Jun 3, 2025
andrewlock added a commit that referenced this pull request Jun 6, 2025
andrewlock added a commit that referenced this pull request Jun 9, 2025
andrewlock added a commit that referenced this pull request Jun 9, 2025
andrewlock added a commit that referenced this pull request Jun 10, 2025
andrewlock added a commit that referenced this pull request Jun 11, 2025
andrewlock added a commit that referenced this pull request Jun 11, 2025
andrewlock added a commit that referenced this pull request Jun 12, 2025
andrewlock added a commit that referenced this pull request Jun 19, 2025
andrewlock added a commit that referenced this pull request Jun 20, 2025
andrewlock added a commit that referenced this pull request Jun 23, 2025
andrewlock added a commit that referenced this pull request Jun 24, 2025
## Summary of changes

- Revert "Revert 'Reapply "Revert Load the tracer/profiler after
guardrails checks" (#6959)' (#6986)" #7051
- Avoid calling `dlclose` on glibc 2.34-2.36

## Reason for change

We want to make sure we only load the tracer/profiler libraries _after_
we have run guardrails checks. However, repeated attempts to do this
were causing crashes in our smoke tests on Fedora 35, on arm64 only.
After an (extensive) investigation, we finally tracked this down to a
bug in glibc itself. The full explanation is given below, but the upshot
is that calling `dlclose` on this buggy glibc version can cause crashes
at some later point.

## Implementation details

- Includes the "original" implementation that moves the tracer/profiler
after guardrails checks
- Before calling `dlclose`, check to see if we're on a buggy version of
glibc
- This is complicated by the fact we run the _same_ binary on glibc and
musl, so we have to make the glibc call _dynamically_, instead of
linking directly against the `gnu_get_libc_version` method.
- We avoid trying to load libc if we detect that we're on Alpine by
checking for `/etc/alpine-release`. This is slightly annoying, but
required in the native loader code.
- If we _do_ manage to call the glibc version, we check against the
blocklist. If the version is on it, we avoid ever calling dlclose.

## Test coverage

We have run repeated tests against the previously crashing tests, and
this has resolved the issue. Nevertheless, our recommendation for
customers should definitely be to upgrade to a stable version of glibc
wherever possible

## Other details

Overview of the bug: 
In certain versions of glibc, there is a TLS-reuse bug that can cause
crashes when unloading shared libraries. The bug was introduced in 2.34,
fixed in 2.36 on x86-64, and fixed in 2.37 on aarch64.
See [the bug
here](https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3921c5b40f293c57cb326f58713c924b0662ef59)
or [the explanation of the bug on
Fedora](https://bugzilla.redhat.com/show_bug.cgi?id=2251557) (which is
where we saw the crashing issue).

glibc 2.34 shipped with a regression: after a `dlclose()` of a library
that carried dynamic-TLS, the loader could reuse the same "module-ID"
for a different library _without_ first clearing the associated DTV
(Dynamic Thread Vector) entry. The next time any code accessed that TLS
slot it could read or write an unmapped address, which cases a
`SIGSEGV`.

This manifested as a crash in the WAF when we called
`ddwaf_context_info` on arm64. It explicitly happens on arm64 when we
unload the continuous profiler shortly after loading it (which is normal
because it's not supported).

It manifests in this scenario because `ddwaf_context_init` starts like
this:

```assembly
+128  bl   __tls_get_addr     ; ask glibc for the TLS slot for libddwaf
+136  mrs  x11, TPIDR_EL0     ; TLS base for this thread
+140  ldrb w9, [x11, x0]      ; <–– boom if x0 points to a stale DTV entry
```

When we unload the continuous profiler and call `dlcose`, it causes the
glibc loader to hand out a recycled module-ID to `libddwaf`. When
`ddwaf_context_init` tries to access TLS in the `ldrb` instruction, `x11
+ x0` is outside every mapped version, and so it crashes.

Note that although calling `dlclose` with the continuous profiler _may_
trigger the issue (the actual crash is flaky depending on load/unload
timing and address layout), unloading _any_ library that is is built
with `__thread`/`thread_local` data could trigger the crash. To minimize
the risk of hitting this issue, we avoid calling `dlclose` entirely on
the flaky glibc versions

For more details [see
doc](https://docs.google.com/document/d/1aptwmprnd83VTZMKxrrTqmBF6eOayjCL36kj9LujCE8/edit?tab=t.0#heading=h.nytiofltvdb5)

Affected distros: 

| Distribution      | Version / Release             | glibc Version |
| ----------------- | ----------------------------- | ------------- |
| Fedora            | 35                            | 2.34          |
| Fedora            | 37                            | 2.36          |
| Ubuntu            | 21.10 ("Impish Indri")        | 2.34          |
| Ubuntu            | 22.04 LTS ("Jammy Jellyfish") | 2.35          |
| Debian            | 12 ("Bookworm")               | 2.36          |
| RHEL              | 9                             | 2.34          |
| CentOS            | 9                             | 2.34          |
| Amazon Linux 2023 | default                       | 2.34          |
lucaspimentel pushed a commit that referenced this pull request Jul 1, 2025
## Summary of changes

- Revert "Revert 'Reapply "Revert Load the tracer/profiler after
guardrails checks" (#6959)' (#6986)" #7051
- Avoid calling `dlclose` on glibc 2.34-2.36

## Reason for change

We want to make sure we only load the tracer/profiler libraries _after_
we have run guardrails checks. However, repeated attempts to do this
were causing crashes in our smoke tests on Fedora 35, on arm64 only.
After an (extensive) investigation, we finally tracked this down to a
bug in glibc itself. The full explanation is given below, but the upshot
is that calling `dlclose` on this buggy glibc version can cause crashes
at some later point.

## Implementation details

- Includes the "original" implementation that moves the tracer/profiler
after guardrails checks
- Before calling `dlclose`, check to see if we're on a buggy version of
glibc
- This is complicated by the fact we run the _same_ binary on glibc and
musl, so we have to make the glibc call _dynamically_, instead of
linking directly against the `gnu_get_libc_version` method.
- We avoid trying to load libc if we detect that we're on Alpine by
checking for `/etc/alpine-release`. This is slightly annoying, but
required in the native loader code.
- If we _do_ manage to call the glibc version, we check against the
blocklist. If the version is on it, we avoid ever calling dlclose.

## Test coverage

We have run repeated tests against the previously crashing tests, and
this has resolved the issue. Nevertheless, our recommendation for
customers should definitely be to upgrade to a stable version of glibc
wherever possible

## Other details

Overview of the bug: 
In certain versions of glibc, there is a TLS-reuse bug that can cause
crashes when unloading shared libraries. The bug was introduced in 2.34,
fixed in 2.36 on x86-64, and fixed in 2.37 on aarch64.
See [the bug
here](https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3921c5b40f293c57cb326f58713c924b0662ef59)
or [the explanation of the bug on
Fedora](https://bugzilla.redhat.com/show_bug.cgi?id=2251557) (which is
where we saw the crashing issue).

glibc 2.34 shipped with a regression: after a `dlclose()` of a library
that carried dynamic-TLS, the loader could reuse the same "module-ID"
for a different library _without_ first clearing the associated DTV
(Dynamic Thread Vector) entry. The next time any code accessed that TLS
slot it could read or write an unmapped address, which cases a
`SIGSEGV`.

This manifested as a crash in the WAF when we called
`ddwaf_context_info` on arm64. It explicitly happens on arm64 when we
unload the continuous profiler shortly after loading it (which is normal
because it's not supported).

It manifests in this scenario because `ddwaf_context_init` starts like
this:

```assembly
+128  bl   __tls_get_addr     ; ask glibc for the TLS slot for libddwaf
+136  mrs  x11, TPIDR_EL0     ; TLS base for this thread
+140  ldrb w9, [x11, x0]      ; <–– boom if x0 points to a stale DTV entry
```

When we unload the continuous profiler and call `dlcose`, it causes the
glibc loader to hand out a recycled module-ID to `libddwaf`. When
`ddwaf_context_init` tries to access TLS in the `ldrb` instruction, `x11
+ x0` is outside every mapped version, and so it crashes.

Note that although calling `dlclose` with the continuous profiler _may_
trigger the issue (the actual crash is flaky depending on load/unload
timing and address layout), unloading _any_ library that is is built
with `__thread`/`thread_local` data could trigger the crash. To minimize
the risk of hitting this issue, we avoid calling `dlclose` entirely on
the flaky glibc versions

For more details [see
doc](https://docs.google.com/document/d/1aptwmprnd83VTZMKxrrTqmBF6eOayjCL36kj9LujCE8/edit?tab=t.0#heading=h.nytiofltvdb5)

Affected distros: 

| Distribution      | Version / Release             | glibc Version |
| ----------------- | ----------------------------- | ------------- |
| Fedora            | 35                            | 2.34          |
| Fedora            | 37                            | 2.36          |
| Ubuntu            | 21.10 ("Impish Indri")        | 2.34          |
| Ubuntu            | 22.04 LTS ("Jammy Jellyfish") | 2.35          |
| Debian            | 12 ("Bookworm")               | 2.36          |
| RHEL              | 9                             | 2.34          |
| CentOS            | 9                             | 2.34          |
| Amazon Linux 2023 | default                       | 2.34          |
chojomok pushed a commit that referenced this pull request Jul 15, 2025
…cks" (#6959)' (#6986)

## Summary of changes

This reverts commit c66626c.

## Reason for change

We're seeing crashes in arm64 on fedora 35 only, when we initialize the
WAF. Reverting this for now while we investigate to unblock `master`

## Implementation details

Revert the reapplied revert

## Test coverage

Covered by existing

## Other details
chojomok pushed a commit that referenced this pull request Jul 15, 2025
## Summary of changes

- Revert "Revert 'Reapply "Revert Load the tracer/profiler after
guardrails checks" (#6959)' (#6986)" #7051
- Avoid calling `dlclose` on glibc 2.34-2.36

## Reason for change

We want to make sure we only load the tracer/profiler libraries _after_
we have run guardrails checks. However, repeated attempts to do this
were causing crashes in our smoke tests on Fedora 35, on arm64 only.
After an (extensive) investigation, we finally tracked this down to a
bug in glibc itself. The full explanation is given below, but the upshot
is that calling `dlclose` on this buggy glibc version can cause crashes
at some later point.

## Implementation details

- Includes the "original" implementation that moves the tracer/profiler
after guardrails checks
- Before calling `dlclose`, check to see if we're on a buggy version of
glibc
- This is complicated by the fact we run the _same_ binary on glibc and
musl, so we have to make the glibc call _dynamically_, instead of
linking directly against the `gnu_get_libc_version` method.
- We avoid trying to load libc if we detect that we're on Alpine by
checking for `/etc/alpine-release`. This is slightly annoying, but
required in the native loader code.
- If we _do_ manage to call the glibc version, we check against the
blocklist. If the version is on it, we avoid ever calling dlclose.

## Test coverage

We have run repeated tests against the previously crashing tests, and
this has resolved the issue. Nevertheless, our recommendation for
customers should definitely be to upgrade to a stable version of glibc
wherever possible

## Other details

Overview of the bug: 
In certain versions of glibc, there is a TLS-reuse bug that can cause
crashes when unloading shared libraries. The bug was introduced in 2.34,
fixed in 2.36 on x86-64, and fixed in 2.37 on aarch64.
See [the bug
here](https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3921c5b40f293c57cb326f58713c924b0662ef59)
or [the explanation of the bug on
Fedora](https://bugzilla.redhat.com/show_bug.cgi?id=2251557) (which is
where we saw the crashing issue).

glibc 2.34 shipped with a regression: after a `dlclose()` of a library
that carried dynamic-TLS, the loader could reuse the same "module-ID"
for a different library _without_ first clearing the associated DTV
(Dynamic Thread Vector) entry. The next time any code accessed that TLS
slot it could read or write an unmapped address, which cases a
`SIGSEGV`.

This manifested as a crash in the WAF when we called
`ddwaf_context_info` on arm64. It explicitly happens on arm64 when we
unload the continuous profiler shortly after loading it (which is normal
because it's not supported).

It manifests in this scenario because `ddwaf_context_init` starts like
this:

```assembly
+128  bl   __tls_get_addr     ; ask glibc for the TLS slot for libddwaf
+136  mrs  x11, TPIDR_EL0     ; TLS base for this thread
+140  ldrb w9, [x11, x0]      ; <–– boom if x0 points to a stale DTV entry
```

When we unload the continuous profiler and call `dlcose`, it causes the
glibc loader to hand out a recycled module-ID to `libddwaf`. When
`ddwaf_context_init` tries to access TLS in the `ldrb` instruction, `x11
+ x0` is outside every mapped version, and so it crashes.

Note that although calling `dlclose` with the continuous profiler _may_
trigger the issue (the actual crash is flaky depending on load/unload
timing and address layout), unloading _any_ library that is is built
with `__thread`/`thread_local` data could trigger the crash. To minimize
the risk of hitting this issue, we avoid calling `dlclose` entirely on
the flaky glibc versions

For more details [see
doc](https://docs.google.com/document/d/1aptwmprnd83VTZMKxrrTqmBF6eOayjCL36kj9LujCE8/edit?tab=t.0#heading=h.nytiofltvdb5)

Affected distros: 

| Distribution      | Version / Release             | glibc Version |
| ----------------- | ----------------------------- | ------------- |
| Fedora            | 35                            | 2.34          |
| Fedora            | 37                            | 2.36          |
| Ubuntu            | 21.10 ("Impish Indri")        | 2.34          |
| Ubuntu            | 22.04 LTS ("Jammy Jellyfish") | 2.35          |
| Debian            | 12 ("Bookworm")               | 2.36          |
| RHEL              | 9                             | 2.34          |
| CentOS            | 9                             | 2.34          |
| Amazon Linux 2023 | default                       | 2.34          |
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:native-library Automatic instrumentation native C++ code (Datadog.Trace.ClrProfiler.Native) area:shared-components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants