Skip to content

WIP - libcephfs: add async I/O capability#44991

Closed
ffilz wants to merge 12 commits intoceph:mainfrom
ffilz:async-io
Closed

WIP - libcephfs: add async I/O capability#44991
ffilz wants to merge 12 commits intoceph:mainfrom
ffilz:async-io

Conversation

@ffilz
Copy link
Contributor

@ffilz ffilz commented Feb 11, 2022

This is the first pass at supporting async I/O calls from Ganesha. I would appreciate another set of eyes to verify that I have appropriately covered all the bases. I am concerned that somehow I have left a path for synchronous completion which leaves confusion about the return value. The intent is to have all I/O complete async even if it completes before returning to the caller (which is prepared to deal with that situation). I'm also concerned about the best way to detect EOF. Ganesha currently assumes a 0 length read indicates EOF, but maybe we can do better.

Signed-off-by: Frank S. Filz ffilzlnx@mindspring.com

I haven't tried to complete the checklist, I know there is at least documentation that should be written, and tests are yet to be written.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@github-actions github-actions bot added the cephfs Ceph File System label Feb 11, 2022
@vshankar vshankar requested a review from a team February 14, 2022 06:57
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@jtlayton jtlayton changed the title WIP - libcephfs: add anych I/O capability WIP - libcephfs: add async I/O capability Feb 15, 2022
@adamemerson adamemerson self-requested a review March 3, 2022 18:30
@ajarr ajarr self-requested a review March 21, 2022 14:34
@vshankar vshankar requested a review from jtlayton March 23, 2022 13:19
@jtlayton
Copy link
Contributor

jtlayton commented Mar 25, 2022

AIUI, the only reliable way to detect the EOF in ceph is to keep track of the inode size and clamp your reads there. Because we're reading from an array of objects, a short read doesn't necessarily indicate an EOF condition. It may just be that the object has a sparse region at the end. Ditto for a non-existent OSD object.

That said, the cephfs read handler should fix up the result of the read from the OSD to look like a short read if you're at the EOF.

@jtlayton
Copy link
Contributor

This patch is enormous. It'd be very nice to break this up into smaller changes, as it's a little hard for me to follow what your general approach is. That'll also be nice if this causes a regression and we have to bisect to find it.

@mchangir
Copy link
Contributor

AIUI, the only reliable way to detect the EOF in ceph is to keep track of the inode size and clamp your reads there. Because we're reading from an array of objects, a short read doesn't necessarily indicate an EOF condition. It may just be that the object has a sparse region at the end. Ditto for a non-existent OSD object.

That's odd. For an existing object, I'd expect the object size to be truncated to the one as defined in its file-layout to avoid a messy short read. Or, is that intentional ?

That said, the cephfs read handler should fix up the result of the read from the OSD to look like a short read if you're at the EOF.

@jtlayton
Copy link
Contributor

That's odd. For an existing object, I'd expect the object size to be truncated to the one as defined in its file-layout to avoid a messy short read. Or, is that intentional ?

The OSD doesn't really have the concept of file layouts. Those are a cephfs concept and are managed by the MDS. All the OSD understands is objects, and multiple (usually 4M) objects back a single cephfs inode.

OSD objects can be sparse with byte granularity as well, so it really only stores what the clients have written. If the client leaves a sparse area in an object, or never creates an object that is completely sparse, then that's OK and is preferable since it takes up less space in the backing store. This is also what makes fscrypt possible -- we have to know which parts are actually sparse to properly decrypt the data.

The upshot is that cephfs clients must be prepared to zero-fill these holes in OSD data for the caller on a read. A hole could even be at the end of the file too which is why we have to carefully track the file size.

@adamemerson adamemerson self-assigned this Apr 12, 2022
@ffilz ffilz force-pushed the async-io branch 2 times, most recently from dc87488 to 24e465e Compare April 21, 2022 00:09
@ffilz ffilz force-pushed the async-io branch 2 times, most recently from a20a1e8 to 5f879a8 Compare April 29, 2022 23:27
@ffilz
Copy link
Contributor Author

ffilz commented May 17, 2022

I've updated this patch set. I hope I've addressed comments. I've also added a unit test and resolved several issues that arose from testing. I will shortly start testing from Ganesha.

int Client::uninline_data(Inode *in, Context *onfinish)
{
if (!in->inline_data.length()) {
// TODO - should we drop the client_lock here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar Can you weigh in here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - I wasn't following this PR closely. I'll take a look next week.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although the onfinish context signals a condition variable on completion, its a no-op if there's no inline data to be uninlined, since the onfinish context is not handed over to an asynchronous network operation to signal is completion
so, IMO, we don't need to drop the client_lock here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Removed the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, onunline is now only done fro the write path AND is done synchronously, so all the async onunline bits that I had added in are gone.

@djgalloway djgalloway changed the base branch from master to main May 25, 2022 20:03
@ffilz ffilz requested a review from batrick June 23, 2022 17:14
Copy link
Contributor

@ajarr ajarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone through the write path and Client:: _write in detail. I've a couple of minor questions/comments.

I need to go through the read path, Client:: _read() in detail. I'm yet to go through fsync related changes.

iofinished_r = 0;
onuninlinefinished_r = 0;
iofinished = false;
onuninlinefinished = onuninline == nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onuninlinefinished = false;

Copy link
Contributor Author

@ffilz ffilz Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't always have to uninline a file. If so, there will be no onuninline Context, so we need to mark it already finished.

I should comment to that effect though...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajarr means the type of onuninlinefinished is bool, you should assign it to false instead of nullptr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lxbsz thanks! that's what I meant. onunlinefinished is bool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, onuninlinefinished is bool, It isn't being assigned onuninline

It's being assigned the result of (onuninline == nullptr). The == operator results in a bool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK! It looked like onunlinefinished = onunline = false. Thanks!

Yes, a comment mentioning "We don't always have to uninline a file. If so, there will be no onuninline Context, so we need to mark it already finished." should also help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing/changing special onunline handling, so I think this one becomes moot

iofinished_r = 0;
onuninlinefinished_r = 0;
iofinished = false;
onuninlinefinished = onuninline == nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onunlinefinished = false;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't always have to uninline a file. If so, there will be no onuninline Context, so we need to mark it already finished.

I should comment to that effect though...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. A comment will help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing/changing special onunline handling, so I think this one becomes moot

// time
lat = ceph_clock_now();
lat -= start;
logger->tinc(l_c_wrlat, lat);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above line was replaced by,

++nr_write_request;
update_io_stat_write(lat);

Please see the following lines in the commit, 967e24fe5c0efd9d7#diff-7a3052fe46aebfed0382c9d0bb9880ea1328add824e0b10c5d551ddfee282cd1R10572

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that, I put that change in.

* License version 2.1, as published by the Free Software
* Foundation. See file COPYING.
*
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see 2021?

if (rc < 0)
goto done;
} else if (onfinish) {
// handle _sync_read asynchronously...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the comments. I'm able to understand this better.

Copy link
Contributor

@ajarr ajarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through Client::_read() in detail. I've a few more questions/comments.

I am finding Client::C_Read_Sync_Async::finish() tricky. I'm still going through that couple of more times.

I also need to go through fsync related changes.


lat = ceph_clock_now();
lat -= start;
clnt->logger->tinc(l_c_read, lat);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This above line was replaced by,

  ++nr_read_request;
  update_io_stat_read(lat);

Please see 967e24fe5c0efd9d7#diff-7a3052fe46aebfed0382c9d0bb9880ea1328add824e0b10c5d551ddfee282cd1R10111

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that, I put that change in.

// Caller holds client_lock so we don't need to take it.

if (r >= 0) {
if (is_read_async) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow the need for this if...else statement . Can you please elaborate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, for that one, I had a question if I needed to do anything to deal with a short read

Handling would be different depending on whether read is sync or async (a separate sync/async from what I'm doing...)

}

ceph_assert(r >= 0);
if (movepos) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to call call update_read_io_size() above this line as we do in during success: in Client::_read()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that will be fixed in the next update.

{
bool fini;

clnt->client_lock.lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to Client::C_Read_Finisher::finish_io() wouldn't the caller still hold the client _lock? I don't see where it's dropped before calling this method, finish_onunline()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, uninline_data can complete immediately, clearly with the client_lock held.

If it actually runs, it's not clear to me if the client_lock is held or not (its all done in objector), I'm thinking not.

The original onuninline using a C_SaferCond doesn't need the client lock to signal or wait for the condition, so doesn't care if client_lock is held or not...

Considering we really shouldn't have to uninline data, maybe I should just make uninline synchronous, and always use a C_SaferCond and wait for it to be done before returning, the _read (for example) calls finish_onunline with the client_lock held.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, yes, I think we should be okay to make uninline synchronous here and in the write path. I suspect that'd simplify the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I have done that then.

*/
void Client::C_Read_Sync_Async::finish(int r)
{
clnt->client_lock.lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't the caller hold the client_lock? I don't see where client_lock is getting dropped in Client::_read() before this method is called

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finish will be called via read_trunc, which looks like it is scheduled so runs on another thread. I haven't followed that code path all the way to the gory end. Maybe it does take the client_lock somehwere? But it all goes down into filer and objecter so seems unlikely any of that would acquire the client lock, and I don't see any inline completion that could call finish on the submitting thread that WOULD still hold the client_lock.

I welcome more analysis on this is one place I got in over my head on following and understanding the code.

We don't even ask for and to be sure that we have granted the Fw
caps when reading, we shouldn't write contents to Rados.

The bug was introduced by commit a0cb524
(client: Read inline data path)

Fixes: https://tracker.ceph.com/issues/56553
Signed-off-by: Xiubo Li <xiubli@redhat.com>
ffilz added 3 commits July 14, 2022 15:24
For async I/O, we will want to be able to override block_writes_upfront
so rename the member cfg_block_writes_upfront and add an option to
pass block_writes_upfront as a parameter along with a member access
method so caller can pass cfg_block_writes_upfront.

Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
These bits of code need to be invoked from a separate spot when we
introduce async I/O, so break them out now.

Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
}
return 0;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

io_finish.get() always called on the above, but only release when onfinish != nullptr.

Is there a leak reference when onfinish == nullptr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's a managed pointer (std::unique_ptr). It will go out of context and be cleaned up on any exit from Client::_read_async(). We release it HERE because it needs to have a life beyond the call to _read_async().

@ffilz
Copy link
Contributor Author

ffilz commented Jul 18, 2022

There is some confusion about use of sync and async in existing code. I'm thinking maybe I need to change the wording of my efforts from async to nonblocking.

ffilz added 8 commits July 18, 2022 15:39
Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
…xt list

To make an async version of fsync (to be used for async write and commit),
we need to be able to signal an arbitrary Context on completion of either
of these lists.

add_async_onfinish_to_context_list Adds such a Context to the list.

Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
We will need the ability to do an async write that finishes with fsync
so we need non-blocking fsync.

Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
…readv_writev

Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
@ffilz
Copy link
Contributor Author

ffilz commented Jul 18, 2022

OK, I believe this update includes all the required fixes.

I'm still not sure if I need any short read handling in C_Read_Finisher::finish_io.

The client_lock issue in C_Read_Sync_Async::finish still needs more analysis.

@ffilz
Copy link
Contributor Author

ffilz commented Jul 20, 2022

I posted a new version changing the title and names from async to nonblocking to reduce confusion.

Here is a pastebin of the diff:

https://pastebin.com/MU7ZtLaL

Here is the nonblocking-io branch:

https://github.com/ffilz/ceph/commits/nonblocking-io

If folks think this looks good, I will push the changes to this pull request (and update the pull request title).

I think this makes things easier to understand and gets rid of Sync_Async in function names...

@ffilz ffilz requested review from dang and mattbenjamin July 20, 2022 22:36
@lxbsz
Copy link
Member

lxbsz commented Jul 21, 2022

@ffilz I suggest just wait for @mchangir uninline_data PR#44359, which will add a scrub command to uninline the inline_data, and a following PR which will remove the inline data code from Client.cc.

Or in this PR you can just skip the uninline related code by adding some comments or just return not supported if it has inline data ?

@ffilz
Copy link
Contributor Author

ffilz commented Jul 21, 2022

Dear reviewers,

I need to figure out one more thing, and maybe I need to drop the client_lock to make the callback to Ganesha. I also possibly have to close the filehandle used. For ganesha to call ceph_ll_close, the client_lock can't be held since ll_close will take it.

It was reasonable to add non-blocking fsync on write completion, but file close at end of I/O may be too much to ask... (plus it's a bit tricky in the Ganesha code since the callback isn't actually managing the lifetime of the filehandle).

Can we safely drop the client_lock in the callback?

@ffilz ffilz mentioned this pull request Sep 9, 2022
14 tasks
@ffilz
Copy link
Contributor Author

ffilz commented Sep 9, 2022

Please see new pull request #48038

@ffilz ffilz closed this Sep 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants