Project

General

Profile

Actions

Bug #71510

closed

client: crash with concurrent nonblocking fsync and write

Added by Venky Shankar 10 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
fsck/damage handling
Target version:
% Done:

0%

Source:
other
Backport:
tentacle,squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client
Labels (FS):
crash
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v20.3.0-968-gf484edf976
Released In:
Upkeep Timestamp:
2025-09-23T12:45:53+00:00

Description

The asynchronous fsync state machine execution can be halted (request put a wait queue), if Fb caps are in use (i.e., ref count for Fb caps >0). Let's call this stage1. Now, before this stage is reached, if the execution context had to wait for unsafe operations. a ref is incremented in the request and put on wait queue (req->waitfor_safe). Let's call this stage0.

When stage0 request is woken up, the execution context moves to stage1, where the reference is dropped. Now the wait in stage1 does not increment the reference count of the request, however, stage1 execution context can be retried (if Fb caps is already in use), where the reference will be dropped again.

Client crash backtrace

    0x00007f3115b2452c in __pthread_kill_implementation () from /lib64/libc.so.6
    0x00007f3115ad7686 in raise () from /lib64/libc.so.6
    0x00007f3115ac1833 in abort () from /lib64/libc.so.6
    0x00007f3113375d0a in ceph::__ceph_assert_fail (assertion=<optimized out>, file=<optimized out>, line=<optimized out>, func=<optimized out>) at /usr/src/debug/ceph-19.2.0-124.el9cp.x86_64/src/common/assert.cc:74
    0x00007f3113375e6f in ceph::__ceph_assert_fail (ctx=...) at /usr/src/debug/ceph-19.2.0-124.el9cp.x86_64/src/common/assert.cc:79
    0x00007f311237db1d in xlist<MetaRequest*>::item::~item (this=<optimized out>, this=<optimized out>) at /usr/src/debug/ceph-19.2.0-124.el9cp.x86_64/src/include/xlist.h:31
    MetaRequest::~MetaRequest (this=<optimized out>, this=<optimized out>) at /usr/src/debug/ceph-19.2.0-124.el9cp.x86_64/src/client/MetaRequest.cc:65
    Client::put_request (this=0x564b491726c0, request=0x7f301c0165c0) at /usr/src/debug/ceph-19.2.0-124.el9cp.x86_64/src/client/Client.cc:2140
    0x00007f31123c88ad in Client::C_nonblocking_fsync_state::advance (this=0x7f307002e9f0) at /usr/src/debug/ceph-19.2.0-124.el9cp.x86_64/src/client/Client.cc:11905
    0x00007f3112331ccd in Context::complete (this=0x7f3070009250, r=<optimized out>) at /usr/src/debug/ceph-19.2.0-124.el9cp.x86_64/src/include/Context.h:99
    0x00007f311246a964 in Client::signal_context_list(std::__cxx11::list<Context*, std::allocator<Context*> >&) [clone .constprop.0] (ls=std::__cxx11::list = {...}, this=<optimized out>)
        at /usr/src/debug/ceph-19.2.0-124.el9cp.x86_64/src/client/Client.cc:4257
    0x00007f3112395f45 in Client::put_cap_ref (this=0x564b491726c0, in=0x7f306807be90, cap=<optimized out>) at /usr/src/debug/ceph-19.2.0-124.el9cp.x86_64/src/client/Client.cc:3611
    0x00007f31123331f3 in Client::C_Write_Finisher::finish_io (r=0, this=0x7f30240442d0) at /usr/src/debug/ceph-19.2.0-124.el9cp.x86_64/src/client/Client.cc:11381
    Client::CWF_iofinish::finish (this=<optimized out>, r=0) at /usr/src/debug/ceph-19.2.0-124.el9cp.x86_64/src/client/Client.h:1481
    0x00007f3112331ccd in Context::complete (this=0x7f302401afd0, r=<optimized out>) at /usr/src/debug/ceph-19.2.0-124.el9cp.x86_64/src/include/Context.h:99
    0x00007f31123c5242 in Client::C_Lock_Client_Finisher::finish (this=0x7f302403c9d0, r=0) at /usr/src/debug/ceph-19.2.0-124.el9cp.x86_64/src/client/Client.cc:11372
    0x00007f3112331ccd in Context::complete (this=0x7f302403c9d0, r=<optimized out>) at /usr/src/debug/ceph-19.2.0-124.el9cp.x86_64/src/include/Context.h:99
    0x00007f31134374ad in Finisher::finisher_thread_entry (this=0x564b491730b0) at /usr/src/debug/ceph-19.2.0-124.el9cp.x86_64/src/common/Finisher.cc:72
    0x00007f3115b227e2 in start_thread () from /lib64/libc.so.6
    0x00007f3115ba7800 in clone3 () from /lib64/libc.so.6
    0x0000000000000000 in ?? ()

Related issues 3 (1 open2 closed)

Related to CephFS - Bug #71515: qa: add test to validate fix for crash sue to asynchronous write and fsync running concurrentlyPending BackportVenky Shankar

Actions
Copied to CephFS - Backport #71708: tentacle: client: crash with concurrent nonblocking fsync and writeResolvedVenky ShankarActions
Copied to CephFS - Backport #71709: squid: client: crash with concurrent nonblocking fsync and writeResolvedVenky ShankarActions
Actions #1

Updated by Venky Shankar 10 months ago

See Client::C_nonblocking_fsync_state::advance(), case 0 and case 1.

Actions #2

Updated by Venky Shankar 10 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 63619
Actions #3

Updated by Venky Shankar 10 months ago

  • Related to Bug #71515: qa: add test to validate fix for crash sue to asynchronous write and fsync running concurrently added
Actions #4

Updated by Venky Shankar 10 months ago

  • Description updated (diff)
Actions #5

Updated by Venky Shankar 9 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #6

Updated by Upkeep Bot 9 months ago

  • Copied to Backport #71708: tentacle: client: crash with concurrent nonblocking fsync and write added
Actions #7

Updated by Upkeep Bot 9 months ago

  • Copied to Backport #71709: squid: client: crash with concurrent nonblocking fsync and write added
Actions #8

Updated by Upkeep Bot 9 months ago

  • Tags (freeform) set to backport_processed
Actions #9

Updated by Upkeep Bot 9 months ago

  • Merge Commit set to f484edf976c350b2f4b42fe15e0498fb30cc449a
  • Fixed In set to v20.3.0-968-gf484edf976c
  • Upkeep Timestamp set to 2025-07-02T14:27:16+00:00
Actions #10

Updated by Upkeep Bot 8 months ago

  • Fixed In changed from v20.3.0-968-gf484edf976c to v20.3.0-968-gf484edf976c3
  • Upkeep Timestamp changed from 2025-07-02T14:27:16+00:00 to 2025-07-14T15:20:03+00:00
Actions #11

Updated by Upkeep Bot 8 months ago

  • Fixed In changed from v20.3.0-968-gf484edf976c3 to v20.3.0-968-gf484edf976
  • Upkeep Timestamp changed from 2025-07-14T15:20:03+00:00 to 2025-07-14T20:44:43+00:00
Actions #12

Updated by Upkeep Bot 6 months ago

  • Status changed from Pending Backport to Resolved
  • Upkeep Timestamp changed from 2025-07-14T20:44:43+00:00 to 2025-09-23T12:45:53+00:00
Actions

Also available in: Atom PDF