Skip to content

fix(profiling): race condition in StackChunk#16519

Merged
gh-worker-dd-mergequeue-cf854d[bot] merged 1 commit into
mainfrom
kowalski/fix-profiling-race-condition-in-stackchunk
Feb 17, 2026
Merged

fix(profiling): race condition in StackChunk#16519
gh-worker-dd-mergequeue-cf854d[bot] merged 1 commit into
mainfrom
kowalski/fix-profiling-race-condition-in-stackchunk

Conversation

@KowalskiThomas

@KowalskiThomas KowalskiThomas commented Feb 16, 2026

Copy link
Copy Markdown
Contributor

Description

https://datadoghq.atlassian.net/browse/PROF-13774

This fixes a rare (but real, found in Crash Logs) race condition where we would see segmentation faults in Frame::read. After analysis, it seems this comes from a race condition where the number of bytes copied in the StackChunk did not match (was less than) the number of bytes in the StackChunk according to the header, if the StackChunk was updated by Python as we were reading it. When this happens, we would try to read from an invalid pointer (we would "believe" our copy of the StackChunk had more bytes than it actually did).


Example crash stack

Error UnixSignal: Process terminated with SEGV_MAPERR (SIGSEGV)
#0   0x00007fcde855cfc6 Frame::read 
#1   0x00007fcde855d114 unwind_frame 
#2   0x00007fcde855ec68 ThreadInfo::unwind 
#3   0x00007fcde855eda2 ThreadInfo::sample 
#4   0x00007fcde855f02e std::_Function_handler<void (_ts*, ThreadInfo&), Datadog::Sampler::sampling_thread(unsigned long)::{lambda(InterpreterInfo&)#1}::operator()(InterpreterInfo&) const::{lambda(_ts*, ThreadInfo&)#1}>::_M_invoke 
#5   0x00007fcde855f2c0 for_each_thread 
#6   0x00007fcde855f382 std::_Function_handler<void (InterpreterInfo&), Datadog::Sampler::sampling_thread(unsigned long)::{lambda(InterpreterInfo&)#1}>::_M_invoke 
#7   0x00007fcde855c2f9 for_each_interp 
#8   0x00007fcde855f6e7 Datadog::Sampler::sampling_thread 
#9   0x00007fcde855f853 call_sampling_thread 

@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-profiling-race-condition-in-stackchunk branch from a30aee8 to 17dc386 Compare February 16, 2026 16:09

@taegyunkim taegyunkim left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I thought we already handled this. Thanks for fixing this.

@taegyunkim taegyunkim left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind pasting relevant crash logs in the description for posterity?

@cit-pr-commenter-54b7da

Copy link
Copy Markdown

Codeowners resolved as

ddtrace/internal/datadog/profiling/stack/echion/echion/stack_chunk.h    @DataDog/profiling-python
ddtrace/internal/datadog/profiling/stack/src/echion/stack_chunk.cc      @DataDog/profiling-python
releasenotes/notes/profiling-race-condition-stack-chunk-b3efd548e57b1b8b.yaml  @DataDog/apm-python

@datadog-official

datadog-official Bot commented Feb 16, 2026

Copy link
Copy Markdown
Contributor

⚠️ Tests

Fix all issues with Cursor

⚠️ Warnings

❄️ 1 New flaky test detected

test_sample_count[py3.10] from test_sample_count.py (Datadog) (Fix with Cursor)
Expected status 0, got 1.
=== Captured STDOUT ===
=== End of captured STDOUT ===
=== Captured STDERR ===
Traceback (most recent call last):
  File "tests/profiling/collector/test_sample_count.py", line 62, in <module>
    assert internal_metadata["sample_count"] > 0
AssertionError
=== End of captured STDERR ===

ℹ️ Info

🧪 All tests passed

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 17dc386 | Docs | Datadog PR Page | Was this helpful? Give us feedback!

@KowalskiThomas

Copy link
Copy Markdown
Contributor Author

/merge

@gh-worker-devflow-routing-ef8351

gh-worker-devflow-routing-ef8351 Bot commented Feb 16, 2026

Copy link
Copy Markdown

View all feedbacks in Devflow UI.

2026-02-16 22:45:43 UTC ℹ️ Start processing command /merge


2026-02-16 22:45:50 UTC ℹ️ MergeQueue: waiting for PR to be ready

This pull request is not mergeable according to GitHub. Common reasons include pending required checks, missing approvals, or merge conflicts — but it could also be blocked by other repository rules or settings.
It will be added to the queue as soon as checks pass and/or get approvals. View in MergeQueue UI.
Note: if you pushed new commits since the last approval, you may need additional approval.
You can remove it from the waiting list with /remove command.


2026-02-16 23:37:05 UTC ℹ️ MergeQueue: merge request added to the queue

The expected merge time in main is approximately 5h (p90).


2026-02-17 00:45:02 UTC ℹ️ MergeQueue: Readding this merge request to the queue because another merge request processed with yours failed. No action is needed from your side.


2026-02-17 02:06:13 UTC ℹ️ MergeQueue: This merge request was merged

@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d Bot merged commit 765a70d into main Feb 17, 2026
394 checks passed
@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d Bot deleted the kowalski/fix-profiling-race-condition-in-stackchunk branch February 17, 2026 02:06
gh-worker-dd-mergequeue-cf854d Bot pushed a commit that referenced this pull request Mar 20, 2026
<!-- dd-meta {"pullId":"4729b268-475f-44c8-bd80-994fac3bc0b8","source":"chat","resourceId":"ad20b0b3-bac2-4224-8b7b-5f42225c5803","workflowId":"94e37a14-9b8d-462d-aafd-dc7f0d9ddf36","codeChangeId":"94e37a14-9b8d-462d-aafd-dc7f0d9ddf36","sourceType":"chat"} -->
## Description

This fixes a crash happening in `Frame::read` caused by stale `previous` `StackChunk` entries persisting across thread iterations during stack sampling.

### Root Cause

When the Sampling Thread samples more than one Thread, it uses the same global `StackChunk` for each Thread's stack chain. `StackChunk::update_with_depth` recursively copies the linked list of `_PyStackChunk`'s.  
However, when a stack chunk has no previous chunk, we would not clear the old `previous` pointer. This left stale `StackChunk` entries from previously-sampled threads in the chain.

When a subsequent Thread's frame address happened to fall within the remote address range of a stale chunk's `origin`, `StackChunk::resolve` would return a pointer into the stale local buffer. The stale data contained garbage field values, which would result in invalid accesses. 

This is the same crash signature as #16519 (which fixed a race condition on `copied_size`) and #16631 (which added full-frame bounds checking). The stale `previous` chain was an additional vector for the same class of bug.

This is the crash we would see:

```
#0 0x00007fa8a3507fc6 Frame::read
#1 0x00007fa8a3508114 unwind_frame
#2 0x00007fa8a3509c68 ThreadInfo::unwind
#3 0x00007fa8a3509da2 ThreadInfo::sample
#4 0x00007fa8a350a02e std::_Function_handler<void (_ts*, ThreadInfo&), Datadog::Sampler::sampling_thread(unsigned long)::{lambda(InterpreterInfo&)#1}::operator()(InterpreterInfo&) const::{lambda(_ts*, ThreadInfo&)#1}>::_M_invoke
#5 0x00007fa8a350a2c0 for_each_thread
#6 0x00007fa8a350a382 std::_Function_handler<void (InterpreterInfo&), Datadog::Sampler::sampling_thread(unsigned long)::{lambda(InterpreterInfo&)#1}>::_M_invoke
#7 0x00007fa8a35072f9 for_each_interp
#8 0x00007fa8a350a6e7 Datadog::Sampler::sampling_thread
#9 0x00007fa8a350a853 call_sampling_thread
```


Co-authored-by: thomas.kowalski <thomas.kowalski@datadoghq.com>
@KowalskiThomas KowalskiThomas added identified-by:crashtracking Identified by Crash Tracking Profiling Continous Profling labels Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

identified-by:crashtracking Identified by Crash Tracking Profiling Continous Profling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants