Revert 'Reapply "Revert Load the tracer/profiler after guardrails checks" (#6959)'#6986
Conversation
) This reverts commit c66626c.
Execution-Time Benchmarks Report ⏱️Execution-time results for samples comparing the following branches/commits: Execution-time benchmarks measure the whole time it takes to execute a program. And are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are shown in red. The following thresholds were used for comparing the execution times:
Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard. Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph). gantt
title Execution time (ms) FakeDbCommand (.NET Framework 4.8)
dateFormat X
axisFormat %s
todayMarker off
section Baseline
This PR (6986) - mean (73ms) : 68, 78
. : milestone, 73,
master - mean (68ms) : 65, 70
. : milestone, 68,
section CallTarget+Inlining+NGEN
This PR (6986) - mean (1,039ms) : 994, 1085
. : milestone, 1039,
master - mean (1,007ms) : 982, 1031
. : milestone, 1007,
gantt
title Execution time (ms) FakeDbCommand (.NET Core 3.1)
dateFormat X
axisFormat %s
todayMarker off
section Baseline
This PR (6986) - mean (107ms) : 105, 110
. : milestone, 107,
master - mean (102ms) : 99, 104
. : milestone, 102,
section CallTarget+Inlining+NGEN
This PR (6986) - mean (728ms) : 694, 763
. : milestone, 728,
master - mean (691ms) : 673, 710
. : milestone, 691,
gantt
title Execution time (ms) FakeDbCommand (.NET 6)
dateFormat X
axisFormat %s
todayMarker off
section Baseline
This PR (6986) - mean (93ms) : 90, 96
. : milestone, 93,
master - mean (88ms) : 86, 91
. : milestone, 88,
section CallTarget+Inlining+NGEN
This PR (6986) - mean (684ms) : 639, 729
. : milestone, 684,
master - mean (655ms) : 634, 677
. : milestone, 655,
gantt
title Execution time (ms) HttpMessageHandler (.NET Framework 4.8)
dateFormat X
axisFormat %s
todayMarker off
section Baseline
This PR (6986) - mean (189ms) : 181, 197
. : milestone, 189,
master - mean (189ms) : 186, 192
. : milestone, 189,
section CallTarget+Inlining+NGEN
This PR (6986) - mean (1,110ms) : 1086, 1134
. : milestone, 1110,
master - mean (1,111ms) : 1077, 1145
. : milestone, 1111,
gantt
title Execution time (ms) HttpMessageHandler (.NET Core 3.1)
dateFormat X
axisFormat %s
todayMarker off
section Baseline
This PR (6986) - mean (268ms) : 263, 272
. : milestone, 268,
master - mean (268ms) : 263, 272
. : milestone, 268,
section CallTarget+Inlining+NGEN
This PR (6986) - mean (875ms) : 845, 905
. : milestone, 875,
master - mean (876ms) : 845, 907
. : milestone, 876,
gantt
title Execution time (ms) HttpMessageHandler (.NET 6)
dateFormat X
axisFormat %s
todayMarker off
section Baseline
This PR (6986) - mean (261ms) : 258, 264
. : milestone, 261,
master - mean (261ms) : 257, 264
. : milestone, 261,
section CallTarget+Inlining+NGEN
This PR (6986) - mean (865ms) : 841, 890
. : milestone, 865,
master - mean (867ms) : 836, 898
. : milestone, 867,
|
Datadog ReportAll test runs ✅ 2 Total Test Services: 0 Failed, 2 Passed Test Services
⌛ Performance Regressions vs Default Branch (6)
|
## Summary of changes - Revert "Revert 'Reapply "Revert Load the tracer/profiler after guardrails checks" (#6959)' (#6986)" #7051 - Avoid calling `dlclose` on glibc 2.34-2.36 ## Reason for change We want to make sure we only load the tracer/profiler libraries _after_ we have run guardrails checks. However, repeated attempts to do this were causing crashes in our smoke tests on Fedora 35, on arm64 only. After an (extensive) investigation, we finally tracked this down to a bug in glibc itself. The full explanation is given below, but the upshot is that calling `dlclose` on this buggy glibc version can cause crashes at some later point. ## Implementation details - Includes the "original" implementation that moves the tracer/profiler after guardrails checks - Before calling `dlclose`, check to see if we're on a buggy version of glibc - This is complicated by the fact we run the _same_ binary on glibc and musl, so we have to make the glibc call _dynamically_, instead of linking directly against the `gnu_get_libc_version` method. - We avoid trying to load libc if we detect that we're on Alpine by checking for `/etc/alpine-release`. This is slightly annoying, but required in the native loader code. - If we _do_ manage to call the glibc version, we check against the blocklist. If the version is on it, we avoid ever calling dlclose. ## Test coverage We have run repeated tests against the previously crashing tests, and this has resolved the issue. Nevertheless, our recommendation for customers should definitely be to upgrade to a stable version of glibc wherever possible ## Other details Overview of the bug: In certain versions of glibc, there is a TLS-reuse bug that can cause crashes when unloading shared libraries. The bug was introduced in 2.34, fixed in 2.36 on x86-64, and fixed in 2.37 on aarch64. See [the bug here](https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3921c5b40f293c57cb326f58713c924b0662ef59) or [the explanation of the bug on Fedora](https://bugzilla.redhat.com/show_bug.cgi?id=2251557) (which is where we saw the crashing issue). glibc 2.34 shipped with a regression: after a `dlclose()` of a library that carried dynamic-TLS, the loader could reuse the same "module-ID" for a different library _without_ first clearing the associated DTV (Dynamic Thread Vector) entry. The next time any code accessed that TLS slot it could read or write an unmapped address, which cases a `SIGSEGV`. This manifested as a crash in the WAF when we called `ddwaf_context_info` on arm64. It explicitly happens on arm64 when we unload the continuous profiler shortly after loading it (which is normal because it's not supported). It manifests in this scenario because `ddwaf_context_init` starts like this: ```assembly +128 bl __tls_get_addr ; ask glibc for the TLS slot for libddwaf +136 mrs x11, TPIDR_EL0 ; TLS base for this thread +140 ldrb w9, [x11, x0] ; <–– boom if x0 points to a stale DTV entry ``` When we unload the continuous profiler and call `dlcose`, it causes the glibc loader to hand out a recycled module-ID to `libddwaf`. When `ddwaf_context_init` tries to access TLS in the `ldrb` instruction, `x11 + x0` is outside every mapped version, and so it crashes. Note that although calling `dlclose` with the continuous profiler _may_ trigger the issue (the actual crash is flaky depending on load/unload timing and address layout), unloading _any_ library that is is built with `__thread`/`thread_local` data could trigger the crash. To minimize the risk of hitting this issue, we avoid calling `dlclose` entirely on the flaky glibc versions For more details [see doc](https://docs.google.com/document/d/1aptwmprnd83VTZMKxrrTqmBF6eOayjCL36kj9LujCE8/edit?tab=t.0#heading=h.nytiofltvdb5) Affected distros: | Distribution | Version / Release | glibc Version | | ----------------- | ----------------------------- | ------------- | | Fedora | 35 | 2.34 | | Fedora | 37 | 2.36 | | Ubuntu | 21.10 ("Impish Indri") | 2.34 | | Ubuntu | 22.04 LTS ("Jammy Jellyfish") | 2.35 | | Debian | 12 ("Bookworm") | 2.36 | | RHEL | 9 | 2.34 | | CentOS | 9 | 2.34 | | Amazon Linux 2023 | default | 2.34 |
## Summary of changes - Revert "Revert 'Reapply "Revert Load the tracer/profiler after guardrails checks" (#6959)' (#6986)" #7051 - Avoid calling `dlclose` on glibc 2.34-2.36 ## Reason for change We want to make sure we only load the tracer/profiler libraries _after_ we have run guardrails checks. However, repeated attempts to do this were causing crashes in our smoke tests on Fedora 35, on arm64 only. After an (extensive) investigation, we finally tracked this down to a bug in glibc itself. The full explanation is given below, but the upshot is that calling `dlclose` on this buggy glibc version can cause crashes at some later point. ## Implementation details - Includes the "original" implementation that moves the tracer/profiler after guardrails checks - Before calling `dlclose`, check to see if we're on a buggy version of glibc - This is complicated by the fact we run the _same_ binary on glibc and musl, so we have to make the glibc call _dynamically_, instead of linking directly against the `gnu_get_libc_version` method. - We avoid trying to load libc if we detect that we're on Alpine by checking for `/etc/alpine-release`. This is slightly annoying, but required in the native loader code. - If we _do_ manage to call the glibc version, we check against the blocklist. If the version is on it, we avoid ever calling dlclose. ## Test coverage We have run repeated tests against the previously crashing tests, and this has resolved the issue. Nevertheless, our recommendation for customers should definitely be to upgrade to a stable version of glibc wherever possible ## Other details Overview of the bug: In certain versions of glibc, there is a TLS-reuse bug that can cause crashes when unloading shared libraries. The bug was introduced in 2.34, fixed in 2.36 on x86-64, and fixed in 2.37 on aarch64. See [the bug here](https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3921c5b40f293c57cb326f58713c924b0662ef59) or [the explanation of the bug on Fedora](https://bugzilla.redhat.com/show_bug.cgi?id=2251557) (which is where we saw the crashing issue). glibc 2.34 shipped with a regression: after a `dlclose()` of a library that carried dynamic-TLS, the loader could reuse the same "module-ID" for a different library _without_ first clearing the associated DTV (Dynamic Thread Vector) entry. The next time any code accessed that TLS slot it could read or write an unmapped address, which cases a `SIGSEGV`. This manifested as a crash in the WAF when we called `ddwaf_context_info` on arm64. It explicitly happens on arm64 when we unload the continuous profiler shortly after loading it (which is normal because it's not supported). It manifests in this scenario because `ddwaf_context_init` starts like this: ```assembly +128 bl __tls_get_addr ; ask glibc for the TLS slot for libddwaf +136 mrs x11, TPIDR_EL0 ; TLS base for this thread +140 ldrb w9, [x11, x0] ; <–– boom if x0 points to a stale DTV entry ``` When we unload the continuous profiler and call `dlcose`, it causes the glibc loader to hand out a recycled module-ID to `libddwaf`. When `ddwaf_context_init` tries to access TLS in the `ldrb` instruction, `x11 + x0` is outside every mapped version, and so it crashes. Note that although calling `dlclose` with the continuous profiler _may_ trigger the issue (the actual crash is flaky depending on load/unload timing and address layout), unloading _any_ library that is is built with `__thread`/`thread_local` data could trigger the crash. To minimize the risk of hitting this issue, we avoid calling `dlclose` entirely on the flaky glibc versions For more details [see doc](https://docs.google.com/document/d/1aptwmprnd83VTZMKxrrTqmBF6eOayjCL36kj9LujCE8/edit?tab=t.0#heading=h.nytiofltvdb5) Affected distros: | Distribution | Version / Release | glibc Version | | ----------------- | ----------------------------- | ------------- | | Fedora | 35 | 2.34 | | Fedora | 37 | 2.36 | | Ubuntu | 21.10 ("Impish Indri") | 2.34 | | Ubuntu | 22.04 LTS ("Jammy Jellyfish") | 2.35 | | Debian | 12 ("Bookworm") | 2.36 | | RHEL | 9 | 2.34 | | CentOS | 9 | 2.34 | | Amazon Linux 2023 | default | 2.34 |
…cks" (#6959)' (#6986) ## Summary of changes This reverts commit c66626c. ## Reason for change We're seeing crashes in arm64 on fedora 35 only, when we initialize the WAF. Reverting this for now while we investigate to unblock `master` ## Implementation details Revert the reapplied revert ## Test coverage Covered by existing ## Other details
## Summary of changes - Revert "Revert 'Reapply "Revert Load the tracer/profiler after guardrails checks" (#6959)' (#6986)" #7051 - Avoid calling `dlclose` on glibc 2.34-2.36 ## Reason for change We want to make sure we only load the tracer/profiler libraries _after_ we have run guardrails checks. However, repeated attempts to do this were causing crashes in our smoke tests on Fedora 35, on arm64 only. After an (extensive) investigation, we finally tracked this down to a bug in glibc itself. The full explanation is given below, but the upshot is that calling `dlclose` on this buggy glibc version can cause crashes at some later point. ## Implementation details - Includes the "original" implementation that moves the tracer/profiler after guardrails checks - Before calling `dlclose`, check to see if we're on a buggy version of glibc - This is complicated by the fact we run the _same_ binary on glibc and musl, so we have to make the glibc call _dynamically_, instead of linking directly against the `gnu_get_libc_version` method. - We avoid trying to load libc if we detect that we're on Alpine by checking for `/etc/alpine-release`. This is slightly annoying, but required in the native loader code. - If we _do_ manage to call the glibc version, we check against the blocklist. If the version is on it, we avoid ever calling dlclose. ## Test coverage We have run repeated tests against the previously crashing tests, and this has resolved the issue. Nevertheless, our recommendation for customers should definitely be to upgrade to a stable version of glibc wherever possible ## Other details Overview of the bug: In certain versions of glibc, there is a TLS-reuse bug that can cause crashes when unloading shared libraries. The bug was introduced in 2.34, fixed in 2.36 on x86-64, and fixed in 2.37 on aarch64. See [the bug here](https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3921c5b40f293c57cb326f58713c924b0662ef59) or [the explanation of the bug on Fedora](https://bugzilla.redhat.com/show_bug.cgi?id=2251557) (which is where we saw the crashing issue). glibc 2.34 shipped with a regression: after a `dlclose()` of a library that carried dynamic-TLS, the loader could reuse the same "module-ID" for a different library _without_ first clearing the associated DTV (Dynamic Thread Vector) entry. The next time any code accessed that TLS slot it could read or write an unmapped address, which cases a `SIGSEGV`. This manifested as a crash in the WAF when we called `ddwaf_context_info` on arm64. It explicitly happens on arm64 when we unload the continuous profiler shortly after loading it (which is normal because it's not supported). It manifests in this scenario because `ddwaf_context_init` starts like this: ```assembly +128 bl __tls_get_addr ; ask glibc for the TLS slot for libddwaf +136 mrs x11, TPIDR_EL0 ; TLS base for this thread +140 ldrb w9, [x11, x0] ; <–– boom if x0 points to a stale DTV entry ``` When we unload the continuous profiler and call `dlcose`, it causes the glibc loader to hand out a recycled module-ID to `libddwaf`. When `ddwaf_context_init` tries to access TLS in the `ldrb` instruction, `x11 + x0` is outside every mapped version, and so it crashes. Note that although calling `dlclose` with the continuous profiler _may_ trigger the issue (the actual crash is flaky depending on load/unload timing and address layout), unloading _any_ library that is is built with `__thread`/`thread_local` data could trigger the crash. To minimize the risk of hitting this issue, we avoid calling `dlclose` entirely on the flaky glibc versions For more details [see doc](https://docs.google.com/document/d/1aptwmprnd83VTZMKxrrTqmBF6eOayjCL36kj9LujCE8/edit?tab=t.0#heading=h.nytiofltvdb5) Affected distros: | Distribution | Version / Release | glibc Version | | ----------------- | ----------------------------- | ------------- | | Fedora | 35 | 2.34 | | Fedora | 37 | 2.36 | | Ubuntu | 21.10 ("Impish Indri") | 2.34 | | Ubuntu | 22.04 LTS ("Jammy Jellyfish") | 2.35 | | Debian | 12 ("Bookworm") | 2.36 | | RHEL | 9 | 2.34 | | CentOS | 9 | 2.34 | | Amazon Linux 2023 | default | 2.34 |
Summary of changes
This reverts commit c66626c.
Reason for change
We're seeing crashes in arm64 on fedora 35 only, when we initialize the WAF. Reverting this for now while we investigate to unblock
masterImplementation details
Revert the reapplied revert
Test coverage
Covered by existing
Other details