common/mutex_debug: fix arm64 SIGBUS/segfault due to data race#64273
common/mutex_debug: fix arm64 SIGBUS/segfault due to data race#64273SrinivasaBharath merged 1 commit intoceph:mainfrom
Conversation
|
i just spotted this issue when testing one of my pull requests at https://jenkins.ceph.com/job/ceph-pull-requests-arm64/76662/, hence created this fix. |
|
lgtm |
|
jenkins test make check |
|
Small typo in the commit message subject: 'mute_debug' should be 'mutex_debug'. |
The mutex_debugging_base::is_locked_by_me() member function was experiencing random SIGBUS and segfault on ARM64 due to a data race on the non-atomic locked_by member. The racing occurs when: - Thread A writes to locked_by during lock acquisition - Thread B reads from locked_by in is_locked_by_me() - These accesses happen concurrently without synchronization On ARM64, std::thread::id (8 bytes when using libstdc++) writes/reads are not atomic, causing the reader to potentially see partially written or corrupted thread ID values, leading to undefined behavior in the comparison operator. Fix by making locked_by atomic and using proper memory ordering in is_locked_by_me() to ensure synchronized access. Fixes: https://tracker.ceph.com/issues/71547 Signed-off-by: Kefu Chai <tchaikov@gmail.com>
5d0c739 to
8bcf327
Compare
|
indeed. fixed. changelog
|
|
jenkins test api |
|
@yuriw hi Yuri, could you please include this change in your next batch? |
|
@ljflores hi Laura, could you please help test this change? |
bill-scales
left a comment
There was a problem hiding this comment.
I agree that there is technically a race hazard with the update to locked_by on arm64, I don't think that this explains the segmentation faults/bus errors that are being seen on the make checks (arm64) - especially as they have been seen on both arm and x86_64.
On x86_64 the update to locked_by will be atomic unless the 8 bytes of memory used to store the value crosses a L1 cache line (64 bytes) boundary. x86_64 compilers will typically align 8 byte values on an 8 byte boundary for performance reasons so the code would have to try quite hard to misalign a mutex and make this a non-atomic update.
A further issue is that I think that the code only uses is_locked_by_me in asserts to check that the current thread acquired the lock earlier. This type of defensive assert is not going to suffer from the race hazard unless the code has another bug in which case the assert may occasionally fail to fire. The absence of lots of failures where we do hit the assert suggest this isn't the case.
|
This is an automated message by src/script/redmine-upkeep.py. I have resolved the following tracker ticket due to the merge of this PR: No backports are pending for the ticket. If this is incorrect, please update the tracker |
|
@SrinivasaBharath adding reviewed-by is required for merges. Please use src/script/ptl-tool.py or another mechanism to add these. |
The mutex_debugging_base::is_locked_by_me() member function was experiencing random SIGBUS and segfault on ARM64 due to a data race on the non-atomic locked_by member.
The racing occurs when:
On ARM64, std::thread::id (8 bytes when using libstdc++) writes/reads are not atomic, causing the reader to potentially see partially written or corrupted thread ID values, leading to undefined behavior in the comparison operator.
Fix by making locked_by atomic and using proper memory ordering in is_locked_by_me() to ensure synchronized access.
Fixes: https://tracker.ceph.com/issues/71547
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job Definition