chore(profiling): address issue found by fuzzer#16631
Conversation
Codeowners resolved as |
d6774d6 to
29d5ba4
Compare
This comment has been minimized.
This comment has been minimized.
taegyunkim
left a comment
There was a problem hiding this comment.
I was wondering how this can happen, maybe there's a chance where we copy over a StackChunk that is of size x, but then it has been incremented as the program executes, and we get an address that is above x?
|
https://gitlab.ddbuild.io/DataDog/apm-reliability/dd-trace-py/-/jobs/1450393487 the crash from test would need some attention There are couple more suites that crashed. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 29d5ba4db2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
6a66816 to
18d6d4d
Compare
<!-- dd-meta {"pullId":"4729b268-475f-44c8-bd80-994fac3bc0b8","source":"chat","resourceId":"ad20b0b3-bac2-4224-8b7b-5f42225c5803","workflowId":"94e37a14-9b8d-462d-aafd-dc7f0d9ddf36","codeChangeId":"94e37a14-9b8d-462d-aafd-dc7f0d9ddf36","sourceType":"chat"} -->
## Description
This fixes a crash happening in `Frame::read` caused by stale `previous` `StackChunk` entries persisting across thread iterations during stack sampling.
### Root Cause
When the Sampling Thread samples more than one Thread, it uses the same global `StackChunk` for each Thread's stack chain. `StackChunk::update_with_depth` recursively copies the linked list of `_PyStackChunk`'s.
However, when a stack chunk has no previous chunk, we would not clear the old `previous` pointer. This left stale `StackChunk` entries from previously-sampled threads in the chain.
When a subsequent Thread's frame address happened to fall within the remote address range of a stale chunk's `origin`, `StackChunk::resolve` would return a pointer into the stale local buffer. The stale data contained garbage field values, which would result in invalid accesses.
This is the same crash signature as #16519 (which fixed a race condition on `copied_size`) and #16631 (which added full-frame bounds checking). The stale `previous` chain was an additional vector for the same class of bug.
This is the crash we would see:
```
#0 0x00007fa8a3507fc6 Frame::read
#1 0x00007fa8a3508114 unwind_frame
#2 0x00007fa8a3509c68 ThreadInfo::unwind
#3 0x00007fa8a3509da2 ThreadInfo::sample
#4 0x00007fa8a350a02e std::_Function_handler<void (_ts*, ThreadInfo&), Datadog::Sampler::sampling_thread(unsigned long)::{lambda(InterpreterInfo&)#1}::operator()(InterpreterInfo&) const::{lambda(_ts*, ThreadInfo&)#1}>::_M_invoke
#5 0x00007fa8a350a2c0 for_each_thread
#6 0x00007fa8a350a382 std::_Function_handler<void (InterpreterInfo&), Datadog::Sampler::sampling_thread(unsigned long)::{lambda(InterpreterInfo&)#1}>::_M_invoke
#7 0x00007fa8a35072f9 for_each_interp
#8 0x00007fa8a350a6e7 Datadog::Sampler::sampling_thread
#9 0x00007fa8a350a853 call_sampling_thread
```
Co-authored-by: thomas.kowalski <thomas.kowalski@datadoghq.com>
Description
https://datadoghq.atlassian.net/browse/PROF-13838
This was reported by the fuzzer and although I don't think we've ever seen it happen in real life (according to crash stacks), it's still valid and worth fixing.