Project

General

Profile

Actions

Bug #70726

closed

ceph_ll_nonblocking_readv_writev() fails to read with a pre-existing open file handle

Added by Anoop C S 12 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Development
Backport:
tentacle,squid
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client, Samba VFS, libcephfs
Labels (FS):
Samba/CIFS
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v20.3.0-1068-gdaff91268d
Released In:
Upkeep Timestamp:
2025-09-23T12:46:31+00:00

Description

After switching to asynchronous IO using libcephfs low level non blocking API we started observing failures as described in https://github.com/samba-in-kubernetes/sit-test-cases/issues/100. Analyzing the pattern we identified that the failure itself is due to the unexpected result from the underlying libcephfs API.

ceph_ll_nonblocking_readv_writev() returns 0 while trying to read recently written data by a different client using an open file handle obtained prior to the write. See the attachment for a standalone reproducer.

# gcc -D_FILE_OFFSET_BITS=64 -lcephfs libcephfs_nonblocking_rw.c

# using ceph_ll_nonblocking_readv_writev()
# ./a.out foo /etc/ceph/ceph.conf cephvol test.dat 1
Bytes written = 131072
Bytes read = 0

# using ceph_ll_write() and ceph_ll_read()
# ./a.out foo /etc/ceph/ceph.conf cephvol test.dat 0
Bytes written = 131072
Bytes read = 131072

Files

libcephfs_nonblocking_rw.c (5.38 KB) libcephfs_nonblocking_rw.c Anoop C S, 03/31/2025 11:20 AM

Related issues 2 (0 open2 closed)

Copied to CephFS - Backport #71766: squid: ceph_ll_nonblocking_readv_writev() fails to read with a pre-existing open file handleResolvedVenky ShankarActions
Copied to CephFS - Backport #71767: tentacle: ceph_ll_nonblocking_readv_writev() fails to read with a pre-existing open file handleResolvedVenky ShankarActions
Actions #1

Updated by Anoop C S 12 months ago

  • Description updated (diff)
Actions #3

Updated by Dhairya Parmar 12 months ago · Edited

There is a cap acquisition issue, in this case the execution enters case 2:

} else if (onfinish) {
    // CASE 2 - non-blocking caller with sync read from the OSD (since client
    //          does not have the required caps or the above config is set)

instead of case 1:
if (!conf->client_debug_force_sync_read &&
      conf->client_oc &&
      (have & (CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO))) {
    // CAES 1 - blocking or non-blocking caller with the client holding Fc caps
    //          and client_debug_force_sync_read being default (`false)

comparision b/w bugged and normal case:
normal:

2025-04-01T18:24:55.440+0530 7f5c53cb9640 10 client.4291 get_caps 0x10000000000.head(faked_ino=0 nref=8 ll_ref=2 cap_refs={4=0,1024=1,4096=0,8192=1} open={1=1,3=1} mode=100744 size=131072/4194304 nlink=1 btime=2025-04-01T18:24:54.402387+0530 mtime=2025-04-01T18:24:54.441082+0530 ctime=2025-04-01T18:24:54.441082+0530 change_attr=1 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) dirty_caps=Fw objectset[0x10000000000 ts 0/0 objects 1 dirty_or_tx 131072] parents=0x1.head["check_async_io_with_two_clients" ref=1 ino=0x10000000000.head csg=1] 0x7f5c0c006f70) have pAsxLsXsxFsxcrwb need Fr want Fc revoking -

bugged:

2025-04-01T18:10:55.699+0530 7f2bf668b640 10 client.4301 get_caps 0x10000000000.head(faked_ino=0 nref=4 ll_ref=1 cap_refs={} open={1=1} mode=100744 size=0/0 nlink=1 btime=2025-04-01T18:10:54.637539+0530 mtime=2025-04-01T18:10:54.637539+0530 ctime=2025-04-01T18:10:54.637539+0530 change_attr=0 caps=pAsLsXsFrw(0=pAsLsXsFrw) objectset[0x10000000000 ts 0/0 objects 0 dirty_or_tx 0]

i.e. CEPH_CAP_FILE_CACHE is missing

Actions #4

Updated by Dhairya Parmar 12 months ago

im surprised but i see for this case especially, the call block for write is:

else {
    if (f->flags & O_DIRECT)
      _flush_range(in, offset, size);

    // simple, non-atomic sync write
    if (onfinish == nullptr) {
      // We need a safer condition to wait on.
      cond_iofinish = new C_SaferCond();
      filer_iofinish.reset(cond_iofinish);
    } else {
      //Register a wrapper callback C_Lock_Client_Finisher for the C_Write_Finisher which takes 'client_lock'.
      //Use C_OnFinisher for callbacks. The op_cancel_writes has to be called without 'client_lock' held because
      //the callback registered here needs to take it. This would cause incorrect lock order i.e., objecter->rwlock
      //taken by objecter's op_cancel and then 'client_lock' taken by callback. To fix the lock order, queue
      //the callback using the finisher
      filer_iofinish.reset(new C_OnFinisher(new C_Lock_Client_Finisher(this, iofinish.get()), &objecter_finisher));
    }

but it should be:

if (cct->_conf->client_oc &&
      (have & (CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO))) {
....
....

Actions #5

Updated by Shachar Sharon 12 months ago

Additional info:

if you add:
ceph_ll_getattr(r_cmount, r_inode, &stx, CEPH_STATX_BASIC_STATS, 0, r_perms);

immediately after:
ceph_ll_nonblocking_readv_writev(w_cmount, &io_info);

the problem disappears.

Actions #6

Updated by Venky Shankar 12 months ago

  • Category set to Correctness/Safety
  • Status changed from New to Triaged
  • Assignee set to Venky Shankar
  • Priority changed from Normal to Urgent
  • Target version set to v20.0.0
  • Source set to Development
  • Backport set to squid

This is likely a fallout from the recent path_walk changes.

Actions #7

Updated by Venky Shankar 12 months ago

Venky Shankar wrote in #note-6:

This is likely a fallout from the recent path_walk changes.

I read it again, this is async IO, so probably this bug was just discovered. I will have a look. Thanks for the reproducer.

Actions #8

Updated by Greg Farnum 12 months ago

Dhairya Parmar wrote in #note-3:

There is a cap acquisition issue, in this case the execution enters case 2:
[...]
instead of case 1:
[...]

comparision b/w bugged and normal case:
normal:
[...]

bugged:
[...]

i.e. CEPH_CAP_FILE_CACHE is missing

This is expected behavior because with 1 client it can buffer writes and cache reads, but with two clients it can't. So that looks to be as expected — if it weren't doing this separate branch, that would neatly explain the described behavior!

Actions #9

Updated by Greg Farnum 12 months ago

Shachar Sharon wrote in #note-5:

Additional info:

if you add:
> ceph_ll_getattr(r_cmount, r_inode, &stx, CEPH_STATX_BASIC_STATS, 0, r_perms); >

immediately after:
> ceph_ll_nonblocking_readv_writev(w_cmount, &io_info); >

the problem disappears.

This makes me wonder if the issue is size synchronization between the clients, as the writer is extending the file size. It should be able to do the right thing in this case, but perhaps ceph_ll_nonblocking_readv_writev inadvertently skipped that logic.
I know we have major issues with simultaneous writers invoking "append" since the logic there is just "1) check size on mds, 2) issue write at that location to OSDs".

Actions #10

Updated by Venky Shankar 11 months ago

Update: This issue is also reproducible with synchronous IO interfaces.

Actions #11

Updated by Anoop C S 11 months ago

Venky Shankar wrote in #note-10:

Update: This issue is also reproducible with synchronous IO interfaces.

Oh..how different is the reproducer? At least with the attached reproducer ceph_ll_write() and ceph_ll_read() works fine.

Actions #12

Updated by Venky Shankar 11 months ago

Anoop C S wrote in #note-11:

Venky Shankar wrote in #note-10:

Update: This issue is also reproducible with synchronous IO interfaces.

Oh..how different is the reproducer? At least with the attached reproducer ceph_ll_write() and ceph_ll_read() works fine.

I ran into this with cephfs-mirror daemon. I will share the failed teuthology run where this is seen.

Actions #13

Updated by Anoop C S 11 months ago

just checking..

Where do we stand w.r.t the underlying issue?

Actions #14

Updated by Venky Shankar 11 months ago

Anoop C S wrote in #note-13:

just checking..

Where do we stand w.r.t the underlying issue?

Was out last week. Will prioritize for this week.

Actions #15

Updated by Venky Shankar 10 months ago

I couldn't spend focus time on this. Give me some more time and I will get it done.

Actions #16

Updated by Venky Shankar 10 months ago

The following seems buggy. In _read():

  } else if (onfinish) {
    // CASE 2 - non-blocking caller with sync read from the OSD (since client
    //          does not have the required caps or the above config is set)

    // handle _sync_read without blocking...
    // This sounds odd, but we want to accomplish what is done in the else
    // branch below but in a non-blocking fashion. The code in _read_sync
    // is duplicated and modified and exists in
    // C_Read_Sync_NonBlocking::finish().

    // trim read based on file size?
    if (std::cmp_greater_equal(offset, in->size) || (size == 0)) {
      // read is requested at the EOF or the read len is zero, therefore just
      // release managed pointers and complete the C_Read_Finisher immediately with 0 bytes

      Context *iof = iofinish.release();
      crf.release();
      iof->complete(0);

      // Signal async completion
      return 0;
    }

In this case, the client does not have Fc caps, but is freely checking in->size, which could be 0 if the MDS hasn't sent the updated mtime/size to the client as part of a write operation from another client.

Actions #17

Updated by Venky Shankar 10 months ago

  • Status changed from Triaged to Fix Under Review
  • Pull request ID set to 63383
Actions #18

Updated by Venky Shankar 9 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport changed from squid to tentacle,squid
Actions #19

Updated by Upkeep Bot 9 months ago

  • Copied to Backport #71766: squid: ceph_ll_nonblocking_readv_writev() fails to read with a pre-existing open file handle added
Actions #20

Updated by Upkeep Bot 9 months ago

  • Copied to Backport #71767: tentacle: ceph_ll_nonblocking_readv_writev() fails to read with a pre-existing open file handle added
Actions #21

Updated by Upkeep Bot 9 months ago

  • Tags (freeform) set to backport_processed
Actions #22

Updated by Upkeep Bot 9 months ago

  • Merge Commit set to daff91268dd005d82ec5f156f23f75d07c117065
  • Fixed In set to v20.3.0-1068-gdaff91268dd
  • Upkeep Timestamp set to 2025-07-08T14:47:02+00:00
Actions #23

Updated by Upkeep Bot 8 months ago

  • Fixed In changed from v20.3.0-1068-gdaff91268dd to v20.3.0-1068-gdaff91268dd0
  • Upkeep Timestamp changed from 2025-07-08T14:47:02+00:00 to 2025-07-14T15:21:09+00:00
Actions #24

Updated by Upkeep Bot 8 months ago

  • Fixed In changed from v20.3.0-1068-gdaff91268dd0 to v20.3.0-1068-gdaff91268d
  • Upkeep Timestamp changed from 2025-07-14T15:21:09+00:00 to 2025-07-14T20:45:43+00:00
Actions #25

Updated by Upkeep Bot 6 months ago

  • Status changed from Pending Backport to Resolved
  • Upkeep Timestamp changed from 2025-07-14T20:45:43+00:00 to 2025-09-23T12:46:31+00:00
Actions

Also available in: Atom PDF