fix(profiling): race condition in StackChunk#16519
Conversation
a30aee8 to
17dc386
Compare
taegyunkim
left a comment
There was a problem hiding this comment.
Oh I thought we already handled this. Thanks for fixing this.
taegyunkim
left a comment
There was a problem hiding this comment.
Do you mind pasting relevant crash logs in the description for posterity?
Codeowners resolved as |
|
|
/merge |
|
View all feedbacks in Devflow UI.
This pull request is not mergeable according to GitHub. Common reasons include pending required checks, missing approvals, or merge conflicts — but it could also be blocked by other repository rules or settings.
The expected merge time in
|
<!-- dd-meta {"pullId":"4729b268-475f-44c8-bd80-994fac3bc0b8","source":"chat","resourceId":"ad20b0b3-bac2-4224-8b7b-5f42225c5803","workflowId":"94e37a14-9b8d-462d-aafd-dc7f0d9ddf36","codeChangeId":"94e37a14-9b8d-462d-aafd-dc7f0d9ddf36","sourceType":"chat"} -->
## Description
This fixes a crash happening in `Frame::read` caused by stale `previous` `StackChunk` entries persisting across thread iterations during stack sampling.
### Root Cause
When the Sampling Thread samples more than one Thread, it uses the same global `StackChunk` for each Thread's stack chain. `StackChunk::update_with_depth` recursively copies the linked list of `_PyStackChunk`'s.
However, when a stack chunk has no previous chunk, we would not clear the old `previous` pointer. This left stale `StackChunk` entries from previously-sampled threads in the chain.
When a subsequent Thread's frame address happened to fall within the remote address range of a stale chunk's `origin`, `StackChunk::resolve` would return a pointer into the stale local buffer. The stale data contained garbage field values, which would result in invalid accesses.
This is the same crash signature as #16519 (which fixed a race condition on `copied_size`) and #16631 (which added full-frame bounds checking). The stale `previous` chain was an additional vector for the same class of bug.
This is the crash we would see:
```
#0 0x00007fa8a3507fc6 Frame::read
#1 0x00007fa8a3508114 unwind_frame
#2 0x00007fa8a3509c68 ThreadInfo::unwind
#3 0x00007fa8a3509da2 ThreadInfo::sample
#4 0x00007fa8a350a02e std::_Function_handler<void (_ts*, ThreadInfo&), Datadog::Sampler::sampling_thread(unsigned long)::{lambda(InterpreterInfo&)#1}::operator()(InterpreterInfo&) const::{lambda(_ts*, ThreadInfo&)#1}>::_M_invoke
#5 0x00007fa8a350a2c0 for_each_thread
#6 0x00007fa8a350a382 std::_Function_handler<void (InterpreterInfo&), Datadog::Sampler::sampling_thread(unsigned long)::{lambda(InterpreterInfo&)#1}>::_M_invoke
#7 0x00007fa8a35072f9 for_each_interp
#8 0x00007fa8a350a6e7 Datadog::Sampler::sampling_thread
#9 0x00007fa8a350a853 call_sampling_thread
```
Co-authored-by: thomas.kowalski <thomas.kowalski@datadoghq.com>
Description
https://datadoghq.atlassian.net/browse/PROF-13774
This fixes a rare (but real, found in Crash Logs) race condition where we would see segmentation faults in
Frame::read. After analysis, it seems this comes from a race condition where the number of bytes copied in theStackChunkdid not match (was less than) the number of bytes in theStackChunkaccording to the header, if theStackChunkwas updated by Python as we were reading it. When this happens, we would try to read from an invalid pointer (we would "believe" our copy of theStackChunkhad more bytes than it actually did).Example crash stack