Skip to content

fix(profiling): clean up containers post-fork#17042

Merged
gh-worker-dd-mergequeue-cf854d[bot] merged 1 commit into
mainfrom
dd/fix/profiling-fork-crash-lru-cache
Apr 24, 2026
Merged

fix(profiling): clean up containers post-fork#17042
gh-worker-dd-mergequeue-cf854d[bot] merged 1 commit into
mainfrom
dd/fix/profiling-fork-crash-lru-cache

Conversation

@KowalskiThomas

@KowalskiThomas KowalskiThomas commented Mar 20, 2026

Copy link
Copy Markdown
Contributor

Description

This PR fixes a crash in the Profiler that occurred when fork was called while the Sampling Thread was actively modifying the LRUCache for Frames (about 100 crashes per week).

Error UnixSignal: Process terminated with SI_TKILL (SIGABRT)
#0 0x0000ffffb293cb3c raise
#1 0x0000ffffb2927e00 abort
#2 0x0000ffffb2996f98 free
#3 0x0000ffffa3f47cc4 std::_List_base<std::pair<unsigned long, std::unique_ptr<Frame, std::default_delete<Frame> > >, std::allocator<std::pair<unsigned long, std::unique_ptr<Frame, std::default_delete<Frame> > > > >::_M_clear
#4 0x0000ffffa3f50ea4 Datadog::Sampler::postfork_child
#5 0x0000ffffa3f51cc0 _stack_atfork_child
#6 0x0000ffffb29c1cd4 __libc_fork
#7 0x0000aaaabbd4b998 os_fork_impl
#8 0x0000aaaabbdaff58 cfunction_vectorcall_NOARGS.llvm.1085662745629047260
#9 0x0000aaaabbe197d4 _PyEval_EvalFrameDefault
#10 0x0000aaaabbd87548 _PyObject_VectorcallTstate.llvm.15733451206353458062
#11 0x0000aaaabbd89df8 object_vacall.llvm.15733451206353458062
#12 0x0000aaaabbe7f378 PyObject_CallFunctionObjArgs
#13 0x0000ffffb1a4453c WraptFunctionWrapperBase_call (/project/src/wrapt/_wrappers.c:2455:14)
#14 0x0000aaaabbe15fe0 _PyEval_EvalFrameDefault
#15 0x0000aaaabbdcaa78 slot_tp_init
#16 0x0000aaaabbdc32b0 type_call
#17 0x0000aaaabbe15fe0 _PyEval_EvalFrameDefault
#18 0x0000aaaabbd8a464 method_vectorcall.llvm.15917642511948773313
#19 0x0000aaaabbe197d4 _PyEval_EvalFrameDefault
#20 0x0000aaaabbd8e354 gen_iternext
#21 0x0000aaaabbde9a58 builtin_next
#22 0x0000aaaabbe187d0 _PyEval_EvalFrameDefault
#23 0x0000aaaabbd8a4c4 method_vectorcall.llvm.15917642511948773313
#24 0x0000aaaabbe1a02c _PyEval_EvalFrameDefault
#25 0x0000aaaabbd8a4c4 method_vectorcall.llvm.15917642511948773313
#26 0x0000aaaabbe1a02c _PyEval_EvalFrameDefault
#27 0x0000aaaabbd89480 _PyObject_Call_Prepend
#28 0x0000aaaabbebba50 slot_tp_call
#29 0x0000aaaabbe15fe0 _PyEval_EvalFrameDefault
#30 0x0000aaaabbf15ea4 PyEval_EvalCode
#31 0x0000aaaabbf48b8c run_mod.llvm.17615948296688877498
#32 0x0000aaaabbcb34b4 pyrun_file
#33 0x0000aaaabbcb288c _PyRun_SimpleFileObject
#34 0x0000aaaabbcb2488 _PyRun_AnyFileObject
#35 0x0000aaaabbcbfc1c pymain_run_file_obj
#36 0x0000aaaabbcbf994 pymain_run_file
#37 0x0000aaaabbf60d2c Py_RunMain
#38 0x0000aaaabbf611f0 pymain_main.llvm.144270552765144348
#39 0x0000aaaabbe589d0 main
#40 0x0000ffffb2928598 __libc_start_main

The Sampling Thread is not stopped before forking. This means when fork is called, the Sampling Thread could be mutating frame_cache_ (backed by std::list + std::unordered_map) via lookup (splice) and store (emplace_front/pop_back). As a result, those data structures could end up in a corrupt state at the moment the child process is created. When the child process resumes work, calling frame_cache_.clear() post-fork, the crash can happen.

The most natural thing to do to avoid this would be to ask the Sampling Thread to stop at fork time, then making sure it is restarted both in the parent and the child post-fork (if it was running). However, this approach has a significant caveat: depending on the current state of adaptive sampling, doing this may result in blocking the fork for up to 100ms (worst case scenario) which is not an acceptable side effect for the user.
By not stopping the Thread but cleaning up in the child process, we keep state consistency without having to wait, at the cost of a small one-time leak (the data in the containers that we placement-new).

@datadog-prod-us1-5

This comment was marked as outdated.

@cit-pr-commenter-54b7da

cit-pr-commenter-54b7da Bot commented Mar 20, 2026

Copy link
Copy Markdown

Codeowners resolved as

ddtrace/internal/datadog/profiling/stack/echion/echion/cache.h          @DataDog/profiling-python
ddtrace/internal/datadog/profiling/stack/echion/echion/echion_sampler.h  @DataDog/profiling-python
ddtrace/internal/datadog/profiling/stack/include/sampler.hpp            @DataDog/profiling-python
ddtrace/internal/datadog/profiling/stack/src/sampler.cpp                @DataDog/profiling-python
releasenotes/notes/fix-profiling-fork-crash-lru-cache-b80e6574fc304037.yaml  @DataDog/apm-python

@KowalskiThomas KowalskiThomas added the Profiling Continous Profling label Mar 20, 2026
@KowalskiThomas KowalskiThomas force-pushed the dd/fix/profiling-fork-crash-lru-cache branch from c76a8d2 to 6eef3b3 Compare April 23, 2026 16:09
@KowalskiThomas KowalskiThomas added the identified-by:crashtracking Identified by Crash Tracking label Apr 23, 2026
@KowalskiThomas KowalskiThomas force-pushed the dd/fix/profiling-fork-crash-lru-cache branch 2 times, most recently from 15a0afe to c0027c0 Compare April 23, 2026 19:10
@KowalskiThomas KowalskiThomas changed the title fix(profiling): stop sampler thread before fork fix(profiling): clean up containers post-fork Apr 24, 2026
@KowalskiThomas KowalskiThomas force-pushed the dd/fix/profiling-fork-crash-lru-cache branch from c0027c0 to a494dba Compare April 24, 2026 08:47
@DataDog DataDog deleted a comment from datadog-official Bot Apr 24, 2026
@KowalskiThomas KowalskiThomas marked this pull request as ready for review April 24, 2026 09:28
@KowalskiThomas KowalskiThomas requested review from a team as code owners April 24, 2026 09:28

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a494dbab41

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread ddtrace/internal/datadog/profiling/stack/src/sampler.cpp Outdated
Co-authored-by: KowalskiThomas <14239160+KowalskiThomas@users.noreply.github.com>
@KowalskiThomas KowalskiThomas force-pushed the dd/fix/profiling-fork-crash-lru-cache branch from a494dba to bcdd325 Compare April 24, 2026 09:37
@KowalskiThomas

Copy link
Copy Markdown
Contributor Author

@codex review again

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Keep them coming!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@github-actions

Copy link
Copy Markdown
Contributor

This change is marked for backport to 4.6, but it conflicts with that branch.
Attempting to cherrypick this change to 4.6 yielded the following error:

Auto-merging ddtrace/internal/datadog/profiling/stack/echion/echion/echion_sampler.h
Auto-merging ddtrace/internal/datadog/profiling/stack/include/sampler.hpp
Auto-merging ddtrace/internal/datadog/profiling/stack/src/sampler.cpp
CONFLICT (content): Merge conflict in ddtrace/internal/datadog/profiling/stack/src/sampler.cpp
error: could not apply fe3af867b3... Merge bcdd3257d9267821f198690c0977d14003504fd6 into 791f85342ea8d32ffb561e08bc76bd819810f44b

The command used to test backporting was

git checkout 4.6 && git cherry-pick -x --mainline 1 fe3af867b3713726a726afdd0d6da10990b5ed20

@github-actions

Copy link
Copy Markdown
Contributor

This change is marked for backport to 4.5, but it conflicts with that branch.
Attempting to cherrypick this change to 4.5 yielded the following error:

Auto-merging ddtrace/internal/datadog/profiling/stack/echion/echion/echion_sampler.h
Auto-merging ddtrace/internal/datadog/profiling/stack/include/sampler.hpp
CONFLICT (content): Merge conflict in ddtrace/internal/datadog/profiling/stack/include/sampler.hpp
Auto-merging ddtrace/internal/datadog/profiling/stack/src/sampler.cpp
CONFLICT (content): Merge conflict in ddtrace/internal/datadog/profiling/stack/src/sampler.cpp
error: could not apply fe3af867b3... Merge bcdd3257d9267821f198690c0977d14003504fd6 into 791f85342ea8d32ffb561e08bc76bd819810f44b

The command used to test backporting was

git checkout 4.5 && git cherry-pick -x --mainline 1 fe3af867b3713726a726afdd0d6da10990b5ed20

@github-actions

Copy link
Copy Markdown
Contributor

This change is marked for backport to 4.7, but it conflicts with that branch.
Attempting to cherrypick this change to 4.7 yielded the following error:

Auto-merging ddtrace/internal/datadog/profiling/stack/echion/echion/echion_sampler.h
Auto-merging ddtrace/internal/datadog/profiling/stack/include/sampler.hpp
Auto-merging ddtrace/internal/datadog/profiling/stack/src/sampler.cpp
CONFLICT (content): Merge conflict in ddtrace/internal/datadog/profiling/stack/src/sampler.cpp
error: could not apply fe3af867b3... Merge bcdd3257d9267821f198690c0977d14003504fd6 into 791f85342ea8d32ffb561e08bc76bd819810f44b

The command used to test backporting was

git checkout 4.7 && git cherry-pick -x --mainline 1 fe3af867b3713726a726afdd0d6da10990b5ed20

@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d Bot merged commit 39a5f23 into main Apr 24, 2026
439 checks passed
@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d Bot deleted the dd/fix/profiling-fork-crash-lru-cache branch April 24, 2026 15:03
juanjux pushed a commit that referenced this pull request Apr 24, 2026
Backport of PR #17042 (commit 39a5f23)
to branch 4.7.

## Description

Fixes a crash in the Profiler (~100/week) when `fork` is called while
the Sampling Thread is actively modifying the `LRUCache` for Frames.

The root cause: `postfork_child` called `frame_cache_.clear()`, which
traverses the `std::list` to free nodes. If the sampling thread was
mid-operation (splice in `lookup`, `emplace_front`/`pop_back` in
`store`) at fork time, the list's internal pointers can be in a
corrupted state in the child, causing the crash.

Fix: replace `clear()` with `postfork_child()` which uses placement new
to construct fresh empty containers, abandoning the old data as an
intentional one-time leak.

Also adds `prefork()`/`postfork_parent()` atfork handlers registered via
`pthread_atfork` so that `restart_after_fork` uses a pre-saved
`was_running_at_fork_` flag instead of relying on `thread_seq_num`
parity (which `prefork` itself changes).

## Notes on conflict resolution

No semantic conflicts. The 4.7 branch uses the same `stack_`
function-name prefix as `main`, so the changes apply verbatim.
emmettbutler pushed a commit that referenced this pull request May 6, 2026
## Description

This PR fixes a crash in the Profiler that occurred when `fork` was called while the Sampling Thread was actively modifying the `LRUCache` for Frames (about 100 crashes per week).

```
Error UnixSignal: Process terminated with SI_TKILL (SIGABRT)
#0 0x0000ffffb293cb3c raise
#1 0x0000ffffb2927e00 abort
#2 0x0000ffffb2996f98 free
#3 0x0000ffffa3f47cc4 std::_List_base<std::pair<unsigned long, std::unique_ptr<Frame, std::default_delete<Frame> > >, std::allocator<std::pair<unsigned long, std::unique_ptr<Frame, std::default_delete<Frame> > > > >::_M_clear
#4 0x0000ffffa3f50ea4 Datadog::Sampler::postfork_child
#5 0x0000ffffa3f51cc0 _stack_atfork_child
#6 0x0000ffffb29c1cd4 __libc_fork
#7 0x0000aaaabbd4b998 os_fork_impl
#8 0x0000aaaabbdaff58 cfunction_vectorcall_NOARGS.llvm.1085662745629047260
#9 0x0000aaaabbe197d4 _PyEval_EvalFrameDefault
#10 0x0000aaaabbd87548 _PyObject_VectorcallTstate.llvm.15733451206353458062
#11 0x0000aaaabbd89df8 object_vacall.llvm.15733451206353458062
#12 0x0000aaaabbe7f378 PyObject_CallFunctionObjArgs
#13 0x0000ffffb1a4453c WraptFunctionWrapperBase_call (/project/src/wrapt/_wrappers.c:2455:14)
#14 0x0000aaaabbe15fe0 _PyEval_EvalFrameDefault
#15 0x0000aaaabbdcaa78 slot_tp_init
#16 0x0000aaaabbdc32b0 type_call
#17 0x0000aaaabbe15fe0 _PyEval_EvalFrameDefault
#18 0x0000aaaabbd8a464 method_vectorcall.llvm.15917642511948773313
#19 0x0000aaaabbe197d4 _PyEval_EvalFrameDefault
#20 0x0000aaaabbd8e354 gen_iternext
#21 0x0000aaaabbde9a58 builtin_next
#22 0x0000aaaabbe187d0 _PyEval_EvalFrameDefault
#23 0x0000aaaabbd8a4c4 method_vectorcall.llvm.15917642511948773313
#24 0x0000aaaabbe1a02c _PyEval_EvalFrameDefault
#25 0x0000aaaabbd8a4c4 method_vectorcall.llvm.15917642511948773313
#26 0x0000aaaabbe1a02c _PyEval_EvalFrameDefault
#27 0x0000aaaabbd89480 _PyObject_Call_Prepend
#28 0x0000aaaabbebba50 slot_tp_call
#29 0x0000aaaabbe15fe0 _PyEval_EvalFrameDefault
#30 0x0000aaaabbf15ea4 PyEval_EvalCode
#31 0x0000aaaabbf48b8c run_mod.llvm.17615948296688877498
#32 0x0000aaaabbcb34b4 pyrun_file
#33 0x0000aaaabbcb288c _PyRun_SimpleFileObject
#34 0x0000aaaabbcb2488 _PyRun_AnyFileObject
#35 0x0000aaaabbcbfc1c pymain_run_file_obj
#36 0x0000aaaabbcbf994 pymain_run_file
#37 0x0000aaaabbf60d2c Py_RunMain
#38 0x0000aaaabbf611f0 pymain_main.llvm.144270552765144348
#39 0x0000aaaabbe589d0 main
#40 0x0000ffffb2928598 __libc_start_main
```

The Sampling Thread is not stopped before forking. This means when `fork` is called, the Sampling Thread could be mutating `frame_cache_` (backed by `std::list` + `std::unordered_map`) via `lookup` (splice) and `store` (`emplace_front`/`pop_back`). As a result, those data structures could end up in a corrupt state at the moment the child process is created. When the child process resumes work, calling `frame_cache_.clear()` post-fork, the crash can happen. 

The most natural thing to do to avoid this would be to ask the Sampling Thread to stop at fork time, then making sure it is restarted both in the parent and the child post-fork (if it was running). However, this approach has a significant caveat: depending on the current state of adaptive sampling, doing this may result in blocking the fork for up to 100ms (worst case scenario) which is not an acceptable side effect for the user.  
By not stopping the Thread but cleaning up in the child process, we keep state consistency without having to wait, at the cost of a small one-time leak (the data in the containers that we placement-new).

Co-authored-by: thomas.kowalski <thomas.kowalski@datadoghq.com>
vlad-scherbich added a commit that referenced this pull request Jun 8, 2026
…s to CI

**Deterministic unit test** (`test_ddup_atfork_handler_registered_before_stack_sampler`):
- Mocks `ddup.start` and `StackCollector._init` to record their call order
  during profiler startup, then asserts ddup was first.
- POSIX guarantees FIFO ordering for post-fork child handlers, so whichever
  component calls pthread_atfork first gets its child handler run first.
  ddup must run first to rebuild profile state before the stack sampler
  clears and re-registers threads.
- Catches the regression deterministically without needing timing luck or
  fork() calls — a swap of the init order in profiler.py fails immediately.
- Regression guard for PRs #17183/#17042/#18063.

**CI: stress-ng CPU saturation + 20 iterations**:
- Runs `stress-ng --cpu $(nproc)` in the background during the fork loop to
  saturate all CPUs and widen the pthread_atfork timing race window — the
  same load profile that makes the bug manifest in production Gunicorn/gthread.
- Increased iterations from 10 to 20 to give more chances to surface the race.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants