Skip to content

common/mutex_debug: fix arm64 SIGBUS/segfault due to data race#64273

Merged
SrinivasaBharath merged 1 commit intoceph:mainfrom
tchaikov:wip-ceph-mutex-atomic
Sep 1, 2025
Merged

common/mutex_debug: fix arm64 SIGBUS/segfault due to data race#64273
SrinivasaBharath merged 1 commit intoceph:mainfrom
tchaikov:wip-ceph-mutex-atomic

Conversation

@tchaikov
Copy link
Contributor

@tchaikov tchaikov commented Jul 1, 2025

The mutex_debugging_base::is_locked_by_me() member function was experiencing random SIGBUS and segfault on ARM64 due to a data race on the non-atomic locked_by member.

The racing occurs when:

  • Thread A writes to locked_by during lock acquisition
  • Thread B reads from locked_by in is_locked_by_me()
  • These accesses happen concurrently without synchronization

On ARM64, std::thread::id (8 bytes when using libstdc++) writes/reads are not atomic, causing the reader to potentially see partially written or corrupted thread ID values, leading to undefined behavior in the comparison operator.

Fix by making locked_by atomic and using proper memory ordering in is_locked_by_me() to ensure synchronized access.

Fixes: https://tracker.ceph.com/issues/71547

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

@github-actions github-actions bot added the common label Jul 1, 2025
@tchaikov tchaikov added the arm64 label Jul 1, 2025
@tchaikov tchaikov requested review from Svelar and rosinL July 1, 2025 06:48
@tchaikov
Copy link
Contributor Author

tchaikov commented Jul 1, 2025

@Svelar and @rosinL hi Rongqi and Rixin, since this issue only manifests on the ARM64 platforms, could you help review this change?

@tchaikov
Copy link
Contributor Author

tchaikov commented Jul 1, 2025

i just spotted this issue when testing one of my pull requests at https://jenkins.ceph.com/job/ceph-pull-requests-arm64/76662/, hence created this fix.

@rosinL
Copy link
Member

rosinL commented Jul 1, 2025

lgtm

@tchaikov
Copy link
Contributor Author

tchaikov commented Jul 2, 2025

jenkins test make check

@phlogistonjohn
Copy link
Contributor

Small typo in the commit message subject: 'mute_debug' should be 'mutex_debug'.
'mute_debug' sounds like you are trying to silence debug messages ;-)

The mutex_debugging_base::is_locked_by_me() member function was
experiencing random SIGBUS and segfault on ARM64 due to a data race on
the non-atomic locked_by member.

The racing occurs when:
- Thread A writes to locked_by during lock acquisition
- Thread B reads from locked_by in is_locked_by_me()
- These accesses happen concurrently without synchronization

On ARM64, std::thread::id (8 bytes when using libstdc++) writes/reads
are not atomic, causing the reader to potentially see partially written
or corrupted thread ID values, leading to undefined behavior in the
comparison operator.

Fix by making locked_by atomic and using proper memory ordering
in is_locked_by_me() to ensure synchronized access.

Fixes: https://tracker.ceph.com/issues/71547

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
@tchaikov tchaikov force-pushed the wip-ceph-mutex-atomic branch from 5d0c739 to 8bcf327 Compare July 3, 2025 04:22
@tchaikov
Copy link
Contributor Author

tchaikov commented Jul 3, 2025

indeed. fixed.

changelog

  • s/mute_debug/mutex_debug/

@tchaikov tchaikov changed the title common/mute_debug: fix arm64 SIGBUS/segfault due to data race common/mutex_debug: fix arm64 SIGBUS/segfault due to data race Jul 3, 2025
@tchaikov
Copy link
Contributor Author

jenkins test api

@tchaikov
Copy link
Contributor Author

@yuriw hi Yuri, could you please include this change in your next batch?

@tchaikov
Copy link
Contributor Author

tchaikov commented Aug 3, 2025

@ljflores hi Laura, could you please help test this change?

Copy link
Contributor

@bill-scales bill-scales left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there is technically a race hazard with the update to locked_by on arm64, I don't think that this explains the segmentation faults/bus errors that are being seen on the make checks (arm64) - especially as they have been seen on both arm and x86_64.

On x86_64 the update to locked_by will be atomic unless the 8 bytes of memory used to store the value crosses a L1 cache line (64 bytes) boundary. x86_64 compilers will typically align 8 byte values on an 8 byte boundary for performance reasons so the code would have to try quite hard to misalign a mutex and make this a non-atomic update.

A further issue is that I think that the code only uses is_locked_by_me in asserts to check that the current thread acquired the lock earlier. This type of defensive assert is not going to suffer from the race hazard unless the code has another bug in which case the assert may occasionally fail to fire. The absence of lots of failures where we do hit the assert suggest this isn't the case.

@ljflores
Copy link
Member

@ljflores hi Laura, could you please help test this change?

@tchaikov sorry for the delay in testing- it's been batched up now. If something needs to go through the rados suite, you can ensure it does by adding the "core" label as Casey did so it gets picked up in the query that Yuri uses.

@amathuria
Copy link
Contributor

@SrinivasaBharath SrinivasaBharath merged commit da1b72e into ceph:main Sep 1, 2025
21 of 22 checks passed
@github-actions
Copy link

github-actions bot commented Sep 1, 2025

This is an automated message by src/script/redmine-upkeep.py.

I have resolved the following tracker ticket due to the merge of this PR:

No backports are pending for the ticket. If this is incorrect, please update the tracker
ticket and reset to Pending Backport state.

Update Log: https://github.com/ceph/ceph/actions/runs/17389214803

@batrick
Copy link
Member

batrick commented Sep 2, 2025

@SrinivasaBharath adding reviewed-by is required for merges. Please use src/script/ptl-tool.py or another mechanism to add these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants