Skip to content

client: fix async/sync I/O stalling due to buffer list exceeding INT_MAX#58564

Merged
vshankar merged 3 commits intoceph:mainfrom
dparmar18:wip-66245
Jul 21, 2025
Merged

client: fix async/sync I/O stalling due to buffer list exceeding INT_MAX#58564
vshankar merged 3 commits intoceph:mainfrom
dparmar18:wip-66245

Conversation

@dparmar18
Copy link
Contributor

@dparmar18 dparmar18 commented Jul 12, 2024

In linux, systems calls like write() anyways don't allow writes > 2GiB, the total write size passed to the Client::_write is clamped to INT_MAX but bufferlist is not handled. This PR addresses this issue which stalls async i/o due to:

unknown file: Failure
C++ exception with description "End of buffer [buffer:2]" thrown in the test body.
2024-05-28T16:17:06.854+0530 7f9a5d24c9c0  2 client.4311 unmount
2024-05-28T16:17:06.854+0530 7f9a5d24c9c0  2 client.4311 unmounting

which results in disconnected inode and cap leaks:

2024-05-28T16:17:11.855+0530 7f9a5d24c9c0  1 client.4311 dump_inode: DISCONNECTED inode 0x10000000001 #0x10000000001 ref 3 0x10000000001.head(faked_ino=0 nref=3 ll_ref=0 cap_refs={4=0,1024=1,4096=1,8192=2} open={3=0} mode=100666 size=0/4294967296 nlink=1 btime=2024-05-28T16:17:03.387546+0530 mtime=2024-05-28T16:17:03.387546+0530 ctime=2024-05-28T16:17:03.387546+0530 change_attr=0 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) objectset[0x10000000001 ts 0/0 objects 1 dirty_or_tx 0] 0x7f9a2c009530)
2024-05-28T16:17:11.855+0530 7f9a5d24c9c0  2 client.4311 cache still has 0+1 items, waiting (for caps to release?)

Fixes: https://tracker.ceph.com/issues/66245
Signed-off-by: Dhairya Parmar dparmar@redhat.com

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@github-actions github-actions bot added cephfs Ceph File System tests labels Jul 12, 2024
@dparmar18 dparmar18 requested a review from a team July 12, 2024 15:35
@dparmar18
Copy link
Contributor Author

this passes locally:

[----------] 19 tests from TestClient (27089 ms total)

[----------] Global test environment tear-down
[==========] 19 tests from 1 test suite ran. (27089 ms total)
[  PASSED  ] 19 tests.

@chrisphoffman
Copy link
Contributor

I've tried compiling this on my local machine am I'm getting:

nonblocking.cc:637:18: error: aggregate ‘statvfs stbuf’ has incomplete type and cannot be defined
637 |   struct statvfs stbuf;

I was able to work around the issue by adding
#include <sys/statvfs.h>

@dparmar18
Copy link
Contributor Author

I've tried compiling this on my local machine am I'm getting:

nonblocking.cc:637:18: error: aggregate ‘statvfs stbuf’ has incomplete type and cannot be defined
637 |   struct statvfs stbuf;

I was able to work around the issue by adding #include <sys/statvfs.h>

this is added here. I'm expecting that to be merged first but I guess I should add a temp commit adding the dependency

@dparmar18
Copy link
Contributor Author

dparmar18 commented Jul 17, 2024

Hey @chrisphoffman, added a temp commit. Thanks for taking a look :)

Comment on lines +11526 to +11527
if (total_appended >= static_cast<ssize_t>(size)) {
bl.append(nullptr, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we just stop appending nullptr buffer pointers and break out of the for loop here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

@dparmar18
Copy link
Contributor Author

last push enhanced the code to handle cases where if:

  1. current iov is gte INT_MAX
  2. total size appended is gte INT_MAX
  3. both 1 and 2.

@dparmar18 dparmar18 force-pushed the wip-66245 branch 2 times, most recently from 57bed9a to c4b591d Compare August 21, 2024 12:34
@dparmar18 dparmar18 changed the title Client: Fix async I/O stalling due to buffer list exceeding total write size Client: Fix async/sync I/O stalling due to buffer list exceeding INT_MAX Aug 21, 2024
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@vshankar
Copy link
Contributor

RFR @ceph/cephfs

@vshankar
Copy link
Contributor

@dparmar18 this required a rebase please.

@dparmar18 dparmar18 requested a review from batrick May 8, 2025 07:46
@dparmar18
Copy link
Contributor Author

@vshankar ping

@vshankar
Copy link
Contributor

@vshankar ping

Noted. In my review for this week.

@dparmar18
Copy link
Contributor Author

@vshankar do we have any update on this?

@vshankar
Copy link
Contributor

@vshankar do we have any update on this?

I was catching up on backlog. Getting to this soon.

@vshankar
Copy link
Contributor

jenkins retest this please

@dparmar18
Copy link
Contributor Author

jenkins test make check

@vshankar
Copy link
Contributor

jenkins test make check

@dparmar18
Copy link
Contributor Author

@vshankar was this tested?

@vshankar
Copy link
Contributor

@vshankar was this tested?

No, not really. But I'll put this with the other client side fixes.

@vshankar
Copy link
Contributor

jenkins test make check

@vshankar
Copy link
Contributor

This PR is under test in https://tracker.ceph.com/issues/71837.

vshankar added a commit to vshankar/ceph that referenced this pull request Jun 25, 2025
* refs/pull/58564/head:
	client: clamp sizes to INT_MAX in sync i/o code paths
	client: restrict bufferlist to total write size
	src/test: test sync/async i/o code paths with huge (4GiB) buffers

Reviewed-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Kotresh Hiremath Ravishankar <khiremat@redhat.com>
Reviewed-by: Christopher Hoffman <choffman@redhat.com>
Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar vshankar dismissed batrick’s stale review July 21, 2025 04:06

Seems to be that the comments have been addressed.

@vshankar vshankar merged commit fa6e884 into ceph:main Jul 21, 2025
12 checks passed
zmc pushed a commit to ceph/ceph-ci that referenced this pull request Aug 5, 2025
In linux, systems calls like write() anyways don't allow writes > 2GiB,
the total write size passed to the Client::_write is clamped to INT_MAX
but bufferlist is not handled. This edge case is patched here.

bug that arises due to buffer list beyond INT_MAX stalls async i/o due to:
unknown file: Failure
C++ exception with description "End of buffer [buffer:2]" thrown in the test body.
2024-05-28T16:17:06.854+0530 7f9a5d24c9c0  2 client.4311 unmount
2024-05-28T16:17:06.854+0530 7f9a5d24c9c0  2 client.4311 unmounting

which results in disconnected inode and cap leaks:
2024-05-28T16:17:11.855+0530 7f9a5d24c9c0  1 client.4311 dump_inode: DISCONNECTED inode 0x10000000001 #0x10000000001 ref 3 0x10000000001.head(faked_ino=0 nref=3 ll_ref=0 cap_refs={4=0,1024=1,4096=1,8192=2} open={3=0} mode=100666 size=0/4294967296 nlink=1 btime=2024-05-28T16:17:03.387546+0530 mtime=2024-05-28T16:17:03.387546+0530 ctime=2024-05-28T16:17:03.387546+0530 change_attr=0 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) objectset[0x10000000001 ts 0/0 objects 1 dirty_or_tx 0] 0x7f9a2c009530)
2024-05-28T16:17:11.855+0530 7f9a5d24c9c0  2 client.4311 cache still has 0+1 items, waiting (for caps to release?)

This commit changes the way Client::_write accepts data. So, now instead of accepting ptr,
iovec array and iovcnt, the helper now accepts a bufferlist which should be contructed by
the caller itself. The reason behind this change is simple - to declutter the API.
For more context checkout this conversation ceph/ceph#58564 (comment)

Fixes: https://tracker.ceph.com/issues/66245
Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
joscollin pushed a commit to joscollin/ceph that referenced this pull request Aug 27, 2025
In linux, systems calls like write() anyways don't allow writes > 2GiB,
the total write size passed to the Client::_write is clamped to INT_MAX
but bufferlist is not handled. This edge case is patched here.

bug that arises due to buffer list beyond INT_MAX stalls async i/o due to:
unknown file: Failure
C++ exception with description "End of buffer [buffer:2]" thrown in the test body.
2024-05-28T16:17:06.854+0530 7f9a5d24c9c0  2 client.4311 unmount
2024-05-28T16:17:06.854+0530 7f9a5d24c9c0  2 client.4311 unmounting

which results in disconnected inode and cap leaks:
2024-05-28T16:17:11.855+0530 7f9a5d24c9c0  1 client.4311 dump_inode: DISCONNECTED inode 0x10000000001 #0x10000000001 ref 3 0x10000000001.head(faked_ino=0 nref=3 ll_ref=0 cap_refs={4=0,1024=1,4096=1,8192=2} open={3=0} mode=100666 size=0/4294967296 nlink=1 btime=2024-05-28T16:17:03.387546+0530 mtime=2024-05-28T16:17:03.387546+0530 ctime=2024-05-28T16:17:03.387546+0530 change_attr=0 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) objectset[0x10000000001 ts 0/0 objects 1 dirty_or_tx 0] 0x7f9a2c009530)
2024-05-28T16:17:11.855+0530 7f9a5d24c9c0  2 client.4311 cache still has 0+1 items, waiting (for caps to release?)

This commit changes the way Client::_write accepts data. So, now instead of accepting ptr,
iovec array and iovcnt, the helper now accepts a bufferlist which should be contructed by
the caller itself. The reason behind this change is simple - to declutter the API.
For more context checkout this conversation ceph#58564 (comment)

Fixes: https://tracker.ceph.com/issues/66245
Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
(cherry picked from commit f4f359e)
dparmar18 added a commit to dparmar18/ceph that referenced this pull request Aug 28, 2025
In linux, systems calls like write() anyways don't allow writes > 2GiB,
the total write size passed to the Client::_write is clamped to INT_MAX
but bufferlist is not handled. This edge case is patched here.

bug that arises due to buffer list beyond INT_MAX stalls async i/o due to:
unknown file: Failure
C++ exception with description "End of buffer [buffer:2]" thrown in the test body.
2024-05-28T16:17:06.854+0530 7f9a5d24c9c0  2 client.4311 unmount
2024-05-28T16:17:06.854+0530 7f9a5d24c9c0  2 client.4311 unmounting

which results in disconnected inode and cap leaks:
2024-05-28T16:17:11.855+0530 7f9a5d24c9c0  1 client.4311 dump_inode: DISCONNECTED inode 0x10000000001 #0x10000000001 ref 3 0x10000000001.head(faked_ino=0 nref=3 ll_ref=0 cap_refs={4=0,1024=1,4096=1,8192=2} open={3=0} mode=100666 size=0/4294967296 nlink=1 btime=2024-05-28T16:17:03.387546+0530 mtime=2024-05-28T16:17:03.387546+0530 ctime=2024-05-28T16:17:03.387546+0530 change_attr=0 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) objectset[0x10000000001 ts 0/0 objects 1 dirty_or_tx 0] 0x7f9a2c009530)
2024-05-28T16:17:11.855+0530 7f9a5d24c9c0  2 client.4311 cache still has 0+1 items, waiting (for caps to release?)

This commit changes the way Client::_write accepts data. So, now instead of accepting ptr,
iovec array and iovcnt, the helper now accepts a bufferlist which should be contructed by
the caller itself. The reason behind this change is simple - to declutter the API.
For more context checkout this conversation ceph#58564 (comment)

Fixes: https://tracker.ceph.com/issues/66245
Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
(cherry picked from commit f4f359e)
dparmar18 added a commit to dparmar18/ceph that referenced this pull request Aug 28, 2025
In linux, systems calls like write() anyways don't allow writes > 2GiB,
the total write size passed to the Client::_write is clamped to INT_MAX
but bufferlist is not handled. This edge case is patched here.

bug that arises due to buffer list beyond INT_MAX stalls async i/o due to:
unknown file: Failure
C++ exception with description "End of buffer [buffer:2]" thrown in the test body.
2024-05-28T16:17:06.854+0530 7f9a5d24c9c0  2 client.4311 unmount
2024-05-28T16:17:06.854+0530 7f9a5d24c9c0  2 client.4311 unmounting

which results in disconnected inode and cap leaks:
2024-05-28T16:17:11.855+0530 7f9a5d24c9c0  1 client.4311 dump_inode: DISCONNECTED inode 0x10000000001 #0x10000000001 ref 3 0x10000000001.head(faked_ino=0 nref=3 ll_ref=0 cap_refs={4=0,1024=1,4096=1,8192=2} open={3=0} mode=100666 size=0/4294967296 nlink=1 btime=2024-05-28T16:17:03.387546+0530 mtime=2024-05-28T16:17:03.387546+0530 ctime=2024-05-28T16:17:03.387546+0530 change_attr=0 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) objectset[0x10000000001 ts 0/0 objects 1 dirty_or_tx 0] 0x7f9a2c009530)
2024-05-28T16:17:11.855+0530 7f9a5d24c9c0  2 client.4311 cache still has 0+1 items, waiting (for caps to release?)

This commit changes the way Client::_write accepts data. So, now instead of accepting ptr,
iovec array and iovcnt, the helper now accepts a bufferlist which should be contructed by
the caller itself. The reason behind this change is simple - to declutter the API.
For more context checkout this conversation ceph#58564 (comment)

Fixes: https://tracker.ceph.com/issues/66245
Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
(cherry picked from commit f4f359e)
dparmar18 added a commit to dparmar18/ceph that referenced this pull request Aug 28, 2025
In linux, systems calls like write() anyways don't allow writes > 2GiB,
the total write size passed to the Client::_write is clamped to INT_MAX
but bufferlist is not handled. This edge case is patched here.

bug that arises due to buffer list beyond INT_MAX stalls async i/o due to:
unknown file: Failure
C++ exception with description "End of buffer [buffer:2]" thrown in the test body.
2024-05-28T16:17:06.854+0530 7f9a5d24c9c0  2 client.4311 unmount
2024-05-28T16:17:06.854+0530 7f9a5d24c9c0  2 client.4311 unmounting

which results in disconnected inode and cap leaks:
2024-05-28T16:17:11.855+0530 7f9a5d24c9c0  1 client.4311 dump_inode: DISCONNECTED inode 0x10000000001 #0x10000000001 ref 3 0x10000000001.head(faked_ino=0 nref=3 ll_ref=0 cap_refs={4=0,1024=1,4096=1,8192=2} open={3=0} mode=100666 size=0/4294967296 nlink=1 btime=2024-05-28T16:17:03.387546+0530 mtime=2024-05-28T16:17:03.387546+0530 ctime=2024-05-28T16:17:03.387546+0530 change_attr=0 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) objectset[0x10000000001 ts 0/0 objects 1 dirty_or_tx 0] 0x7f9a2c009530)
2024-05-28T16:17:11.855+0530 7f9a5d24c9c0  2 client.4311 cache still has 0+1 items, waiting (for caps to release?)

This commit changes the way Client::_write accepts data. So, now instead of accepting ptr,
iovec array and iovcnt, the helper now accepts a bufferlist which should be contructed by
the caller itself. The reason behind this change is simple - to declutter the API.
For more context checkout this conversation ceph#58564 (comment)

Fixes: https://tracker.ceph.com/issues/66245
Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
(cherry picked from commit f4f359e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System tests wip-mchangir-testing not yet production ready

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants