Project

General

Profile

Actions

Bug #63378

open

rgw/multisite: Segmentation fault during full sync

Added by Shilpa MJ over 2 years ago. Updated 5 months ago.

Status:
Pending Backport
Priority:
Urgent
Assignee:
Target version:
% Done:

0%

Source:
Backport:
reef squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v19.3.0-5220-g5e9dcafd00
Released In:
v20.2.0~1918
Upkeep Timestamp:
2025-11-01T00:58:22+00:00

Description

http://qa-proxy.ceph.com/teuthology/smanjara-2023-10-30_20:18:36-rgw:multisite-wip-shilpa-rgw-test-multisite-distro-default-smithi/7441423/teuthology.log

2023-10-30T22:07:41.493+0000 7f4899a5a640 20 rgw rados thread: cr:s=0x55e0b6295900:op=0x55e0b6478000:28RGWDataFullSyncSingleEntryCR: operate()
2023-10-30T22:07:41.494+0000 7f4899a5a640 -1 ** Caught signal (Segmentation fault) *
in thread 7f4899a5a640 thread_name:data-sync

ceph version 18.0.0-6880-g8b1cc681 (8b1cc681d09f809ade48e839fde79ae1b6bd1850) reef (dev)
1: /lib64/libc.so.6(+0x54db0) [0x7f48c2454db0]
2: radosgw(+0xc8a07d) [0x55e0aefe807d]
3: radosgw(+0x38ad82) [0x55e0ae6e8d82]
4: radosgw(+0x836fa9) [0x55e0aeb94fa9]
5: radosgw(+0x9d14c7) [0x55e0aed2f4c7]
6: (RGWCoroutinesStack::operate(DoutPrefixProvider const*, RGWCoroutinesEnv*)+0x125) [0x55e0ae90f405]
7: (RGWCoroutinesManager::run(DoutPrefixProvider const*, std::__cxx11::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x2b6) [0x55e0ae910c76]
8: (RGWCoroutinesManager::run(DoutPrefixProvider const*, RGWCoroutine*)+0xad) [0x55e0ae911c2d]
9: (RGWRemoteDataLog::run_sync(DoutPrefixProvider const*, int)+0x4dc) [0x55e0aed3c02c]
10: radosgw(+0x781f08) [0x55e0aeadff08]
11: (RGWRadosThread::Worker::entry()+0xb3) [0x55e0aeae2413]
12: /lib64/libc.so.6(+0x9f802) [0x7f48c249f802]
13: /lib64/libc.so.6(+0x3f450) [0x7f48c243f450]

Files

c2.client.0.log.gz (144 KB) c2.client.0.log.gz Shilpa MJ, 09/04/2024 10:05 PM

Related issues 2 (1 open1 closed)

Copied to rgw - Backport #68297: reef: rgw/multisite: Segmentation fault during full syncNewShilpa MJActions
Copied to rgw - Backport #68298: squid: rgw/multisite: Segmentation fault during full syncResolvedAdam EmersonActions
Actions #1

Updated by Casey Bodley over 2 years ago

  • Priority changed from Normal to Urgent
Actions #2

Updated by Casey Bodley over 2 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 54278
Actions #3

Updated by Steven Goodliff over 2 years ago

hi,

i think i see the same on our test 18.2.0 dev cluster. if there is any info you need let us know

*** Caught signal (Segmentation fault) **
 in thread 7faf74ace700 thread_name:data-sync

 ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)
 1: /lib64/libpthread.so.0(+0x12cf0) [0x7fafcb3e9cf0]
 2: (RGWCoroutinesStack::_schedule()+0xe) [0x55c2ce5782ae]
 3: (RGWCoroutinesManager::run(DoutPrefixProvider const*, std::__cxx11::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0xdc5) [0x55c2ce57afd5]
 4: (RGWCoroutinesManager::run(DoutPrefixProvider const*, RGWCoroutine*)+0x91) [0x55c2ce57b721]
 5: (RGWRemoteDataLog::run_sync(DoutPrefixProvider const*, int)+0x1e2) [0x55c2ceaff352]
 6: (RGWDataSyncProcessorThread::process(DoutPrefixProvider const*)+0x58) [0x55c2ce846d18]
 7: (RGWRadosThread::Worker::entry()+0xb3) [0x55c2ce80e003]
 8: /lib64/libpthread.so.0(+0x81ca) [0x7fafcb3df1ca]
 9: clone()
Actions #4

Updated by Casey Bodley over 2 years ago

  • Backport set to reef
Actions #5

Updated by Steven Goodliff over 2 years ago

Hi,

is this likely to get into the 18.2.1 release ? https://tracker.ceph.com/versions/675

Actions #6

Updated by Casey Bodley over 2 years ago

  • Status changed from Fix Under Review to In Progress
  • Pull request ID deleted (54278)
Actions #7

Updated by Shilpa MJ about 2 years ago

this crash seems to be coming from 'data_sync_init' test cases in:
/ceph/qa/tasks/rgw_multi/tests.py

the crash doesn't reproduce locally, but reproduces consistently in teuthology runs.

Actions #8

Updated by Shilpa MJ about 2 years ago

crash reproduces only in 3-zone or two-zonegroup configurations

Actions #9

Updated by Casey Bodley almost 2 years ago

  • Status changed from In Progress to New
Actions #10

Updated by Casey Bodley over 1 year ago

  • Backport changed from reef to reef squid
Actions #11

Updated by J. Eric Ivancich over 1 year ago

@Shilpa MJ , apparently much of our test coverage has been turned off and Casey thinks it should be turned back on. Where do you think we stand on this bug?

Actions #12

Updated by Casey Bodley over 1 year ago

is it possible that this was fixed with https://github.com/ceph/ceph/pull/59329?

Actions #13

Updated by Shilpa MJ over 1 year ago

@Casey Bodley no, that wasn't the cause. i see an invalid read in valgrind. it looks like a use-after-free condition when calling RGWDataSyncShardMarkerTrack finish() in RGWDataFullSyncSingleEntryCR().

<error>
  <unique>0x423a</unique>
  <tid>587</tid>
  <threadname>data-sync</threadname>
  <kind>InvalidRead</kind>
  <what>Invalid read of size 8</what>
  <stack>
    <frame>
      <ip>0xA422F2</ip>
      <obj>/usr/bin/radosgw</obj>
      <fn>RGWSyncShardMarkerTrack&lt;std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; &gt;, std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; &gt; &gt;::finish(std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; &gt; const&amp;)</fn>
      <dir>/usr/src/debug/ceph-19.3.0-4432.g37136b32.el9.x86_64/src/rgw/driver/rados</dir>
      <file>rgw_sync.h</file>
      <line>387</line>
    </frame>
    <frame>
      <ip>0xC25CCE</ip>
      <obj>/usr/bin/radosgw</obj>
      <fn>RGWDataFullSyncSingleEntryCR::operate(DoutPrefixProvider const*)</fn>
      <dir>/usr/src/debug/ceph-19.3.0-4432.g37136b32.el9.x86_64/src/rgw/driver/rados</dir>
      <file>rgw_data_sync.cc</file>
      <line>1732</line>
    </frame>
    <frame>
Actions #14

Updated by Shilpa MJ over 1 year ago

attaching valgrind logs for future reference.

Actions #16

Updated by Shilpa MJ over 1 year ago

  • Status changed from New to Fix Under Review
Actions #17

Updated by Casey Bodley over 1 year ago

  • Status changed from Fix Under Review to Pending Backport
Actions #18

Updated by Upkeep Bot over 1 year ago

  • Copied to Backport #68297: reef: rgw/multisite: Segmentation fault during full sync added
Actions #19

Updated by Upkeep Bot over 1 year ago

  • Copied to Backport #68298: squid: rgw/multisite: Segmentation fault during full sync added
Actions #20

Updated by Upkeep Bot over 1 year ago

  • Tags (freeform) set to backport_processed
Actions #21

Updated by Yuri Weinstein 12 months ago

  • Target version set to v19.2.3
Actions #22

Updated by Upkeep Bot 8 months ago

  • Merge Commit set to 5e9dcafd0038fa66b3c975afe7fa63976dd59247
  • Fixed In set to v19.3.0-5220-g5e9dcafd003
  • Upkeep Timestamp set to 2025-07-09T16:09:05+00:00
Actions #23

Updated by Upkeep Bot 8 months ago

  • Fixed In changed from v19.3.0-5220-g5e9dcafd003 to v19.3.0-5220-g5e9dcafd00
  • Upkeep Timestamp changed from 2025-07-09T16:09:05+00:00 to 2025-07-14T17:41:43+00:00
Actions #24

Updated by Upkeep Bot 5 months ago

  • Released In set to v20.2.0~1918
  • Upkeep Timestamp changed from 2025-07-14T17:41:43+00:00 to 2025-11-01T00:58:22+00:00
Actions

Also available in: Atom PDF