Bug #70726
closedceph_ll_nonblocking_readv_writev() fails to read with a pre-existing open file handle
0%
Description
After switching to asynchronous IO using libcephfs low level non blocking API we started observing failures as described in https://github.com/samba-in-kubernetes/sit-test-cases/issues/100. Analyzing the pattern we identified that the failure itself is due to the unexpected result from the underlying libcephfs API.
ceph_ll_nonblocking_readv_writev() returns 0 while trying to read recently written data by a different client using an open file handle obtained prior to the write. See the attachment for a standalone reproducer.
# gcc -D_FILE_OFFSET_BITS=64 -lcephfs libcephfs_nonblocking_rw.c
# using ceph_ll_nonblocking_readv_writev()
# ./a.out foo /etc/ceph/ceph.conf cephvol test.dat 1
Bytes written = 131072
Bytes read = 0
# using ceph_ll_write() and ceph_ll_read()
# ./a.out foo /etc/ceph/ceph.conf cephvol test.dat 0
Bytes written = 131072
Bytes read = 131072
Files
Updated by Dhairya Parmar 12 months ago
Reproducer: https://github.com/ceph/ceph/pull/62602
Updated by Dhairya Parmar 12 months ago · Edited
There is a cap acquisition issue, in this case the execution enters case 2:
} else if (onfinish) {
// CASE 2 - non-blocking caller with sync read from the OSD (since client
// does not have the required caps or the above config is set)
instead of case 1:
if (!conf->client_debug_force_sync_read &&
conf->client_oc &&
(have & (CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO))) {
// CAES 1 - blocking or non-blocking caller with the client holding Fc caps
// and client_debug_force_sync_read being default (`false)
comparision b/w bugged and normal case:
normal:
2025-04-01T18:24:55.440+0530 7f5c53cb9640 10 client.4291 get_caps 0x10000000000.head(faked_ino=0 nref=8 ll_ref=2 cap_refs={4=0,1024=1,4096=0,8192=1} open={1=1,3=1} mode=100744 size=131072/4194304 nlink=1 btime=2025-04-01T18:24:54.402387+0530 mtime=2025-04-01T18:24:54.441082+0530 ctime=2025-04-01T18:24:54.441082+0530 change_attr=1 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) dirty_caps=Fw objectset[0x10000000000 ts 0/0 objects 1 dirty_or_tx 131072] parents=0x1.head["check_async_io_with_two_clients" ref=1 ino=0x10000000000.head csg=1] 0x7f5c0c006f70) have pAsxLsXsxFsxcrwb need Fr want Fc revoking -
bugged:
2025-04-01T18:10:55.699+0530 7f2bf668b640 10 client.4301 get_caps 0x10000000000.head(faked_ino=0 nref=4 ll_ref=1 cap_refs={} open={1=1} mode=100744 size=0/0 nlink=1 btime=2025-04-01T18:10:54.637539+0530 mtime=2025-04-01T18:10:54.637539+0530 ctime=2025-04-01T18:10:54.637539+0530 change_attr=0 caps=pAsLsXsFrw(0=pAsLsXsFrw) objectset[0x10000000000 ts 0/0 objects 0 dirty_or_tx 0]
i.e. CEPH_CAP_FILE_CACHE is missing
Updated by Dhairya Parmar 12 months ago
im surprised but i see for this case especially, the call block for write is:
else {
if (f->flags & O_DIRECT)
_flush_range(in, offset, size);
// simple, non-atomic sync write
if (onfinish == nullptr) {
// We need a safer condition to wait on.
cond_iofinish = new C_SaferCond();
filer_iofinish.reset(cond_iofinish);
} else {
//Register a wrapper callback C_Lock_Client_Finisher for the C_Write_Finisher which takes 'client_lock'.
//Use C_OnFinisher for callbacks. The op_cancel_writes has to be called without 'client_lock' held because
//the callback registered here needs to take it. This would cause incorrect lock order i.e., objecter->rwlock
//taken by objecter's op_cancel and then 'client_lock' taken by callback. To fix the lock order, queue
//the callback using the finisher
filer_iofinish.reset(new C_OnFinisher(new C_Lock_Client_Finisher(this, iofinish.get()), &objecter_finisher));
}
but it should be:
if (cct->_conf->client_oc &&
(have & (CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO))) {
....
....
Updated by Shachar Sharon 12 months ago
Additional info:
if you add:
ceph_ll_getattr(r_cmount, r_inode, &stx, CEPH_STATX_BASIC_STATS, 0, r_perms);
immediately after:
ceph_ll_nonblocking_readv_writev(w_cmount, &io_info);
the problem disappears.
Updated by Venky Shankar 12 months ago
- Category set to Correctness/Safety
- Status changed from New to Triaged
- Assignee set to Venky Shankar
- Priority changed from Normal to Urgent
- Target version set to v20.0.0
- Source set to Development
- Backport set to squid
This is likely a fallout from the recent path_walk changes.
Updated by Venky Shankar 12 months ago
Venky Shankar wrote in #note-6:
This is likely a fallout from the recent path_walk changes.
I read it again, this is async IO, so probably this bug was just discovered. I will have a look. Thanks for the reproducer.
Updated by Greg Farnum 12 months ago
Dhairya Parmar wrote in #note-3:
There is a cap acquisition issue, in this case the execution enters case 2:
[...]
instead of case 1:
[...]comparision b/w bugged and normal case:
normal:
[...]bugged:
[...]i.e.
CEPH_CAP_FILE_CACHEis missing
This is expected behavior because with 1 client it can buffer writes and cache reads, but with two clients it can't. So that looks to be as expected — if it weren't doing this separate branch, that would neatly explain the described behavior!
Updated by Greg Farnum 12 months ago
Shachar Sharon wrote in #note-5:
Additional info:
if you add:
> ceph_ll_getattr(r_cmount, r_inode, &stx, CEPH_STATX_BASIC_STATS, 0, r_perms); >immediately after:
> ceph_ll_nonblocking_readv_writev(w_cmount, &io_info); >the problem disappears.
This makes me wonder if the issue is size synchronization between the clients, as the writer is extending the file size. It should be able to do the right thing in this case, but perhaps ceph_ll_nonblocking_readv_writev inadvertently skipped that logic.
I know we have major issues with simultaneous writers invoking "append" since the logic there is just "1) check size on mds, 2) issue write at that location to OSDs".
Updated by Venky Shankar 11 months ago
Update: This issue is also reproducible with synchronous IO interfaces.
Updated by Venky Shankar 11 months ago
Anoop C S wrote in #note-11:
Venky Shankar wrote in #note-10:
Update: This issue is also reproducible with synchronous IO interfaces.
Oh..how different is the reproducer? At least with the attached reproducer ceph_ll_write() and ceph_ll_read() works fine.
I ran into this with cephfs-mirror daemon. I will share the failed teuthology run where this is seen.
Updated by Venky Shankar 11 months ago
Anoop C S wrote in #note-13:
just checking..
Where do we stand w.r.t the underlying issue?
Was out last week. Will prioritize for this week.
Updated by Venky Shankar 10 months ago
I couldn't spend focus time on this. Give me some more time and I will get it done.
Updated by Venky Shankar 10 months ago
The following seems buggy. In _read():
} else if (onfinish) {
// CASE 2 - non-blocking caller with sync read from the OSD (since client
// does not have the required caps or the above config is set)
// handle _sync_read without blocking...
// This sounds odd, but we want to accomplish what is done in the else
// branch below but in a non-blocking fashion. The code in _read_sync
// is duplicated and modified and exists in
// C_Read_Sync_NonBlocking::finish().
// trim read based on file size?
if (std::cmp_greater_equal(offset, in->size) || (size == 0)) {
// read is requested at the EOF or the read len is zero, therefore just
// release managed pointers and complete the C_Read_Finisher immediately with 0 bytes
Context *iof = iofinish.release();
crf.release();
iof->complete(0);
// Signal async completion
return 0;
}
In this case, the client does not have Fc caps, but is freely checking in->size, which could be 0 if the MDS hasn't sent the updated mtime/size to the client as part of a write operation from another client.
Updated by Venky Shankar 10 months ago
- Status changed from Triaged to Fix Under Review
- Pull request ID set to 63383
Updated by Venky Shankar 9 months ago
- Status changed from Fix Under Review to Pending Backport
- Backport changed from squid to tentacle,squid
Updated by Upkeep Bot 9 months ago
- Copied to Backport #71766: squid: ceph_ll_nonblocking_readv_writev() fails to read with a pre-existing open file handle added
Updated by Upkeep Bot 9 months ago
- Copied to Backport #71767: tentacle: ceph_ll_nonblocking_readv_writev() fails to read with a pre-existing open file handle added
Updated by Upkeep Bot 9 months ago
- Merge Commit set to daff91268dd005d82ec5f156f23f75d07c117065
- Fixed In set to v20.3.0-1068-gdaff91268dd
- Upkeep Timestamp set to 2025-07-08T14:47:02+00:00
Updated by Upkeep Bot 8 months ago
- Fixed In changed from v20.3.0-1068-gdaff91268dd to v20.3.0-1068-gdaff91268dd0
- Upkeep Timestamp changed from 2025-07-08T14:47:02+00:00 to 2025-07-14T15:21:09+00:00
Updated by Upkeep Bot 8 months ago
- Fixed In changed from v20.3.0-1068-gdaff91268dd0 to v20.3.0-1068-gdaff91268d
- Upkeep Timestamp changed from 2025-07-14T15:21:09+00:00 to 2025-07-14T20:45:43+00:00
Updated by Upkeep Bot 6 months ago
- Status changed from Pending Backport to Resolved
- Upkeep Timestamp changed from 2025-07-14T20:45:43+00:00 to 2025-09-23T12:46:31+00:00