Bug #63378
openrgw/multisite: Segmentation fault during full sync
0%
Description
2023-10-30T22:07:41.493+0000 7f4899a5a640 20 rgw rados thread: cr:s=0x55e0b6295900:op=0x55e0b6478000:28RGWDataFullSyncSingleEntryCR: operate()
2023-10-30T22:07:41.494+0000 7f4899a5a640 -1 ** Caught signal (Segmentation fault) *
in thread 7f4899a5a640 thread_name:data-sync
ceph version 18.0.0-6880-g8b1cc681 (8b1cc681d09f809ade48e839fde79ae1b6bd1850) reef (dev)
1: /lib64/libc.so.6(+0x54db0) [0x7f48c2454db0]
2: radosgw(+0xc8a07d) [0x55e0aefe807d]
3: radosgw(+0x38ad82) [0x55e0ae6e8d82]
4: radosgw(+0x836fa9) [0x55e0aeb94fa9]
5: radosgw(+0x9d14c7) [0x55e0aed2f4c7]
6: (RGWCoroutinesStack::operate(DoutPrefixProvider const*, RGWCoroutinesEnv*)+0x125) [0x55e0ae90f405]
7: (RGWCoroutinesManager::run(DoutPrefixProvider const*, std::__cxx11::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x2b6) [0x55e0ae910c76]
8: (RGWCoroutinesManager::run(DoutPrefixProvider const*, RGWCoroutine*)+0xad) [0x55e0ae911c2d]
9: (RGWRemoteDataLog::run_sync(DoutPrefixProvider const*, int)+0x4dc) [0x55e0aed3c02c]
10: radosgw(+0x781f08) [0x55e0aeadff08]
11: (RGWRadosThread::Worker::entry()+0xb3) [0x55e0aeae2413]
12: /lib64/libc.so.6(+0x9f802) [0x7f48c249f802]
13: /lib64/libc.so.6(+0x3f450) [0x7f48c243f450]
Files
Updated by Casey Bodley over 2 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 54278
Updated by Steven Goodliff over 2 years ago
hi,
i think i see the same on our test 18.2.0 dev cluster. if there is any info you need let us know
*** Caught signal (Segmentation fault) ** in thread 7faf74ace700 thread_name:data-sync ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable) 1: /lib64/libpthread.so.0(+0x12cf0) [0x7fafcb3e9cf0] 2: (RGWCoroutinesStack::_schedule()+0xe) [0x55c2ce5782ae] 3: (RGWCoroutinesManager::run(DoutPrefixProvider const*, std::__cxx11::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0xdc5) [0x55c2ce57afd5] 4: (RGWCoroutinesManager::run(DoutPrefixProvider const*, RGWCoroutine*)+0x91) [0x55c2ce57b721] 5: (RGWRemoteDataLog::run_sync(DoutPrefixProvider const*, int)+0x1e2) [0x55c2ceaff352] 6: (RGWDataSyncProcessorThread::process(DoutPrefixProvider const*)+0x58) [0x55c2ce846d18] 7: (RGWRadosThread::Worker::entry()+0xb3) [0x55c2ce80e003] 8: /lib64/libpthread.so.0(+0x81ca) [0x7fafcb3df1ca] 9: clone()
Updated by Steven Goodliff over 2 years ago
Hi,
is this likely to get into the 18.2.1 release ? https://tracker.ceph.com/versions/675
Updated by Casey Bodley over 2 years ago
- Status changed from Fix Under Review to In Progress
- Pull request ID deleted (
54278)
Updated by Shilpa MJ about 2 years ago
this crash seems to be coming from 'data_sync_init' test cases in:
/ceph/qa/tasks/rgw_multi/tests.py
the crash doesn't reproduce locally, but reproduces consistently in teuthology runs.
Updated by Shilpa MJ about 2 years ago
crash reproduces only in 3-zone or two-zonegroup configurations
Updated by Casey Bodley almost 2 years ago
- Status changed from In Progress to New
Updated by Casey Bodley over 1 year ago
- Backport changed from reef to reef squid
Updated by J. Eric Ivancich over 1 year ago
@Shilpa MJ , apparently much of our test coverage has been turned off and Casey thinks it should be turned back on. Where do you think we stand on this bug?
Updated by Casey Bodley over 1 year ago
is it possible that this was fixed with https://github.com/ceph/ceph/pull/59329?
Updated by Shilpa MJ over 1 year ago
@Casey Bodley no, that wasn't the cause. i see an invalid read in valgrind. it looks like a use-after-free condition when calling RGWDataSyncShardMarkerTrack finish() in RGWDataFullSyncSingleEntryCR().
<error>
<unique>0x423a</unique>
<tid>587</tid>
<threadname>data-sync</threadname>
<kind>InvalidRead</kind>
<what>Invalid read of size 8</what>
<stack>
<frame>
<ip>0xA422F2</ip>
<obj>/usr/bin/radosgw</obj>
<fn>RGWSyncShardMarkerTrack<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >::finish(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)</fn>
<dir>/usr/src/debug/ceph-19.3.0-4432.g37136b32.el9.x86_64/src/rgw/driver/rados</dir>
<file>rgw_sync.h</file>
<line>387</line>
</frame>
<frame>
<ip>0xC25CCE</ip>
<obj>/usr/bin/radosgw</obj>
<fn>RGWDataFullSyncSingleEntryCR::operate(DoutPrefixProvider const*)</fn>
<dir>/usr/src/debug/ceph-19.3.0-4432.g37136b32.el9.x86_64/src/rgw/driver/rados</dir>
<file>rgw_data_sync.cc</file>
<line>1732</line>
</frame>
<frame>
Updated by Shilpa MJ over 1 year ago
- File c2.client.0.log.gz c2.client.0.log.gz added
attaching valgrind logs for future reference.
Updated by Shilpa MJ over 1 year ago
- Pull request ID set to 59536
Updated by Shilpa MJ over 1 year ago
- Status changed from New to Fix Under Review
Updated by Casey Bodley over 1 year ago
- Status changed from Fix Under Review to Pending Backport
Updated by Upkeep Bot over 1 year ago
- Copied to Backport #68297: reef: rgw/multisite: Segmentation fault during full sync added
Updated by Upkeep Bot over 1 year ago
- Copied to Backport #68298: squid: rgw/multisite: Segmentation fault during full sync added
Updated by Upkeep Bot over 1 year ago
- Tags (freeform) set to backport_processed
Updated by Yuri Weinstein 12 months ago
- Target version set to v19.2.3
Updated by Upkeep Bot 8 months ago
- Merge Commit set to 5e9dcafd0038fa66b3c975afe7fa63976dd59247
- Fixed In set to v19.3.0-5220-g5e9dcafd003
- Upkeep Timestamp set to 2025-07-09T16:09:05+00:00
Updated by Upkeep Bot 8 months ago
- Fixed In changed from v19.3.0-5220-g5e9dcafd003 to v19.3.0-5220-g5e9dcafd00
- Upkeep Timestamp changed from 2025-07-09T16:09:05+00:00 to 2025-07-14T17:41:43+00:00
Updated by Upkeep Bot 5 months ago
- Released In set to v20.2.0~1918
- Upkeep Timestamp changed from 2025-07-14T17:41:43+00:00 to 2025-11-01T00:58:22+00:00