fix(profiling): race condition in `StackChunk` by KowalskiThomas · Pull Request #16519 · DataDog/dd-trace-py

KowalskiThomas · 2026-02-16T16:09:02Z

Description

https://datadoghq.atlassian.net/browse/PROF-13774

This fixes a rare (but real, found in Crash Logs) race condition where we would see segmentation faults in Frame::read. After analysis, it seems this comes from a race condition where the number of bytes copied in the StackChunk did not match (was less than) the number of bytes in the StackChunk according to the header, if the StackChunk was updated by Python as we were reading it. When this happens, we would try to read from an invalid pointer (we would "believe" our copy of the StackChunk had more bytes than it actually did).

Example crash stack

Error UnixSignal: Process terminated with SEGV_MAPERR (SIGSEGV)
#0   0x00007fcde855cfc6 Frame::read 
#1   0x00007fcde855d114 unwind_frame 
#2   0x00007fcde855ec68 ThreadInfo::unwind 
#3   0x00007fcde855eda2 ThreadInfo::sample 
#4   0x00007fcde855f02e std::_Function_handler<void (_ts*, ThreadInfo&), Datadog::Sampler::sampling_thread(unsigned long)::{lambda(InterpreterInfo&)#1}::operator()(InterpreterInfo&) const::{lambda(_ts*, ThreadInfo&)#1}>::_M_invoke 
#5   0x00007fcde855f2c0 for_each_thread 
#6   0x00007fcde855f382 std::_Function_handler<void (InterpreterInfo&), Datadog::Sampler::sampling_thread(unsigned long)::{lambda(InterpreterInfo&)#1}>::_M_invoke 
#7   0x00007fcde855c2f9 for_each_interp 
#8   0x00007fcde855f6e7 Datadog::Sampler::sampling_thread 
#9   0x00007fcde855f853 call_sampling_thread

taegyunkim

Oh I thought we already handled this. Thanks for fixing this.

taegyunkim

Do you mind pasting relevant crash logs in the description for posterity?

cit-pr-commenter-54b7da · 2026-02-16T16:13:10Z

Codeowners resolved as

ddtrace/internal/datadog/profiling/stack/echion/echion/stack_chunk.h    @DataDog/profiling-python
ddtrace/internal/datadog/profiling/stack/src/echion/stack_chunk.cc      @DataDog/profiling-python
releasenotes/notes/profiling-race-condition-stack-chunk-b3efd548e57b1b8b.yaml  @DataDog/apm-python

datadog-official · 2026-02-16T16:39:39Z

⚠️ Tests

✨ Fix all issues with Cursor

⚠️ Warnings

❄️ 1 New flaky test detected

test_sample_count[py3.10] from test_sample_count.py (Datadog) (Fix with Cursor)

Expected status 0, got 1.
=== Captured STDOUT ===
=== End of captured STDOUT ===
=== Captured STDERR ===
Traceback (most recent call last):
  File "tests/profiling/collector/test_sample_count.py", line 62, in <module>
    assert internal_metadata["sample_count"] > 0
AssertionError
=== End of captured STDERR ===

ℹ️ Info

🧪 All tests passed

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 17dc386 | Docs | Datadog PR Page | Was this helpful? Give us feedback!}

KowalskiThomas · 2026-02-16T22:45:40Z

/merge

gh-worker-devflow-routing-ef8351 · 2026-02-16T22:45:44Z

View all feedbacks in Devflow UI.

2026-02-16 22:45:43 UTC ℹ️ Start processing command /merge

2026-02-16 22:45:50 UTC ℹ️ MergeQueue: waiting for PR to be ready

This pull request is not mergeable according to GitHub. Common reasons include pending required checks, missing approvals, or merge conflicts — but it could also be blocked by other repository rules or settings.
It will be added to the queue as soon as checks pass and/or get approvals. View in MergeQueue UI.
Note: if you pushed new commits since the last approval, you may need additional approval.
You can remove it from the waiting list with /remove command.

2026-02-16 23:37:05 UTC ℹ️ MergeQueue: merge request added to the queue

The expected merge time in main is approximately 5h (p90).

2026-02-17 00:45:02 UTC ℹ️ MergeQueue: Readding this merge request to the queue because another merge request processed with yours failed. No action is needed from your side.

2026-02-17 02:06:13 UTC ℹ️ MergeQueue: This merge request was merged

## Description This fixes a crash happening in `Frame::read` caused by stale `previous` `StackChunk` entries persisting across thread iterations during stack sampling. ### Root Cause When the Sampling Thread samples more than one Thread, it uses the same global `StackChunk` for each Thread's stack chain. `StackChunk::update_with_depth` recursively copies the linked list of `_PyStackChunk`'s. However, when a stack chunk has no previous chunk, we would not clear the old `previous` pointer. This left stale `StackChunk` entries from previously-sampled threads in the chain. When a subsequent Thread's frame address happened to fall within the remote address range of a stale chunk's `origin`, `StackChunk::resolve` would return a pointer into the stale local buffer. The stale data contained garbage field values, which would result in invalid accesses. This is the same crash signature as #16519 (which fixed a race condition on `copied_size`) and #16631 (which added full-frame bounds checking). The stale `previous` chain was an additional vector for the same class of bug. This is the crash we would see: ``` #0 0x00007fa8a3507fc6 Frame::read #1 0x00007fa8a3508114 unwind_frame #2 0x00007fa8a3509c68 ThreadInfo::unwind #3 0x00007fa8a3509da2 ThreadInfo::sample #4 0x00007fa8a350a02e std::_Function_handler<void (_ts*, ThreadInfo&), Datadog::Sampler::sampling_thread(unsigned long)::{lambda(InterpreterInfo&)#1}::operator()(InterpreterInfo&) const::{lambda(_ts*, ThreadInfo&)#1}>::_M_invoke #5 0x00007fa8a350a2c0 for_each_thread #6 0x00007fa8a350a382 std::_Function_handler<void (InterpreterInfo&), Datadog::Sampler::sampling_thread(unsigned long)::{lambda(InterpreterInfo&)#1}>::_M_invoke #7 0x00007fa8a35072f9 for_each_interp #8 0x00007fa8a350a6e7 Datadog::Sampler::sampling_thread #9 0x00007fa8a350a853 call_sampling_thread ``` Co-authored-by: thomas.kowalski <thomas.kowalski@datadoghq.com>

KowalskiThomas requested review from a team as code owners February 16, 2026 16:09

KowalskiThomas requested review from brettlangdon and vlad-scherbich February 16, 2026 16:09

fix(profiling): race condition in StackChunk

17dc386

KowalskiThomas force-pushed the kowalski/fix-profiling-race-condition-in-stackchunk branch from a30aee8 to 17dc386 Compare February 16, 2026 16:09

taegyunkim approved these changes Feb 16, 2026

View reviewed changes

taegyunkim reviewed Feb 16, 2026

View reviewed changes

gh-worker-dd-mergequeue-cf854d Bot merged commit 765a70d into main Feb 17, 2026
394 checks passed

gh-worker-dd-mergequeue-cf854d Bot deleted the kowalski/fix-profiling-race-condition-in-stackchunk branch February 17, 2026 02:06

KowalskiThomas mentioned this pull request Mar 20, 2026

fix(profiling): clear stale StackChunk::previous #17043

Merged

KowalskiThomas added identified-by:crashtracking Identified by Crash Tracking Profiling Continous Profling labels Apr 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(profiling): race condition in `StackChunk`#16519

fix(profiling): race condition in `StackChunk`#16519
gh-worker-dd-mergequeue-cf854d[bot] merged 1 commit into
mainfrom
kowalski/fix-profiling-race-condition-in-stackchunk

KowalskiThomas commented Feb 16, 2026 •

edited

Loading

Uh oh!

taegyunkim left a comment

Uh oh!

taegyunkim left a comment

Uh oh!

cit-pr-commenter-54b7da Bot commented Feb 16, 2026

Uh oh!

datadog-official Bot commented Feb 16, 2026 •

edited by datadog-datadog-prod-us1 Bot

Loading

Uh oh!

KowalskiThomas commented Feb 16, 2026

Uh oh!

gh-worker-devflow-routing-ef8351 Bot commented Feb 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KowalskiThomas commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

taegyunkim left a comment

Choose a reason for hiding this comment

Uh oh!

taegyunkim left a comment

Choose a reason for hiding this comment

Uh oh!

cit-pr-commenter-54b7da Bot commented Feb 16, 2026

Codeowners resolved as

Uh oh!

datadog-official Bot commented Feb 16, 2026 • edited by datadog-datadog-prod-us1 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

ℹ️ Info

Uh oh!

KowalskiThomas commented Feb 16, 2026

Uh oh!

gh-worker-devflow-routing-ef8351 Bot commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KowalskiThomas commented Feb 16, 2026 •

edited

Loading

datadog-official Bot commented Feb 16, 2026 •

edited by datadog-datadog-prod-us1 Bot

Loading

gh-worker-devflow-routing-ef8351 Bot commented Feb 16, 2026 •

edited

Loading