WIP - libcephfs: add async I/O capability#44991
Conversation
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
AIUI, the only reliable way to detect the EOF in ceph is to keep track of the inode size and clamp your reads there. Because we're reading from an array of objects, a short read doesn't necessarily indicate an EOF condition. It may just be that the object has a sparse region at the end. Ditto for a non-existent OSD object. That said, the cephfs read handler should fix up the result of the read from the OSD to look like a short read if you're at the EOF. |
|
This patch is enormous. It'd be very nice to break this up into smaller changes, as it's a little hard for me to follow what your general approach is. That'll also be nice if this causes a regression and we have to bisect to find it. |
That's odd. For an existing object, I'd expect the object size to be truncated to the one as defined in its file-layout to avoid a messy short read. Or, is that intentional ?
|
The OSD doesn't really have the concept of file layouts. Those are a cephfs concept and are managed by the MDS. All the OSD understands is objects, and multiple (usually 4M) objects back a single cephfs inode. OSD objects can be sparse with byte granularity as well, so it really only stores what the clients have written. If the client leaves a sparse area in an object, or never creates an object that is completely sparse, then that's OK and is preferable since it takes up less space in the backing store. This is also what makes fscrypt possible -- we have to know which parts are actually sparse to properly decrypt the data. The upshot is that cephfs clients must be prepared to zero-fill these holes in OSD data for the caller on a read. A hole could even be at the end of the file too which is why we have to carefully track the file size. |
dc87488 to
24e465e
Compare
a20a1e8 to
5f879a8
Compare
|
I've updated this patch set. I hope I've addressed comments. I've also added a unit test and resolved several issues that arose from testing. I will shortly start testing from Ganesha. |
src/client/Client.cc
Outdated
| int Client::uninline_data(Inode *in, Context *onfinish) | ||
| { | ||
| if (!in->inline_data.length()) { | ||
| // TODO - should we drop the client_lock here? |
There was a problem hiding this comment.
Sorry - I wasn't following this PR closely. I'll take a look next week.
There was a problem hiding this comment.
although the onfinish context signals a condition variable on completion, its a no-op if there's no inline data to be uninlined, since the onfinish context is not handed over to an asynchronous network operation to signal is completion
so, IMO, we don't need to drop the client_lock here
There was a problem hiding this comment.
Thanks. Removed the comment.
There was a problem hiding this comment.
Also, onunline is now only done fro the write path AND is done synchronously, so all the async onunline bits that I had added in are gone.
ajarr
left a comment
There was a problem hiding this comment.
I've gone through the write path and Client:: _write in detail. I've a couple of minor questions/comments.
I need to go through the read path, Client:: _read() in detail. I'm yet to go through fsync related changes.
src/client/Client.h
Outdated
| iofinished_r = 0; | ||
| onuninlinefinished_r = 0; | ||
| iofinished = false; | ||
| onuninlinefinished = onuninline == nullptr; |
There was a problem hiding this comment.
We don't always have to uninline a file. If so, there will be no onuninline Context, so we need to mark it already finished.
I should comment to that effect though...
There was a problem hiding this comment.
@ajarr means the type of onuninlinefinished is bool, you should assign it to false instead of nullptr.
There was a problem hiding this comment.
@lxbsz thanks! that's what I meant. onunlinefinished is bool.
There was a problem hiding this comment.
Right, onuninlinefinished is bool, It isn't being assigned onuninline
It's being assigned the result of (onuninline == nullptr). The == operator results in a bool.
There was a problem hiding this comment.
Ah OK! It looked like onunlinefinished = onunline = false. Thanks!
Yes, a comment mentioning "We don't always have to uninline a file. If so, there will be no onuninline Context, so we need to mark it already finished." should also help.
There was a problem hiding this comment.
Removing/changing special onunline handling, so I think this one becomes moot
src/client/Client.h
Outdated
| iofinished_r = 0; | ||
| onuninlinefinished_r = 0; | ||
| iofinished = false; | ||
| onuninlinefinished = onuninline == nullptr; |
There was a problem hiding this comment.
We don't always have to uninline a file. If so, there will be no onuninline Context, so we need to mark it already finished.
I should comment to that effect though...
There was a problem hiding this comment.
Got it. A comment will help.
There was a problem hiding this comment.
Removing/changing special onunline handling, so I think this one becomes moot
src/client/Client.cc
Outdated
| // time | ||
| lat = ceph_clock_now(); | ||
| lat -= start; | ||
| logger->tinc(l_c_wrlat, lat); |
There was a problem hiding this comment.
The above line was replaced by,
++nr_write_request;
update_io_stat_write(lat);
Please see the following lines in the commit, 967e24fe5c0efd9d7#diff-7a3052fe46aebfed0382c9d0bb9880ea1328add824e0b10c5d551ddfee282cd1R10572
There was a problem hiding this comment.
Thanks for that, I put that change in.
| * License version 2.1, as published by the Free Software | ||
| * Foundation. See file COPYING. | ||
| * | ||
| */ |
| if (rc < 0) | ||
| goto done; | ||
| } else if (onfinish) { | ||
| // handle _sync_read asynchronously... |
There was a problem hiding this comment.
thanks for the comments. I'm able to understand this better.
ajarr
left a comment
There was a problem hiding this comment.
I went through Client::_read() in detail. I've a few more questions/comments.
I am finding Client::C_Read_Sync_Async::finish() tricky. I'm still going through that couple of more times.
I also need to go through fsync related changes.
src/client/Client.cc
Outdated
|
|
||
| lat = ceph_clock_now(); | ||
| lat -= start; | ||
| clnt->logger->tinc(l_c_read, lat); |
There was a problem hiding this comment.
This above line was replaced by,
++nr_read_request;
update_io_stat_read(lat);
Please see 967e24fe5c0efd9d7#diff-7a3052fe46aebfed0382c9d0bb9880ea1328add824e0b10c5d551ddfee282cd1R10111
There was a problem hiding this comment.
Thanks for that, I put that change in.
| // Caller holds client_lock so we don't need to take it. | ||
|
|
||
| if (r >= 0) { | ||
| if (is_read_async) { |
There was a problem hiding this comment.
I don't follow the need for this if...else statement . Can you please elaborate?
There was a problem hiding this comment.
Ah, for that one, I had a question if I needed to do anything to deal with a short read
Handling would be different depending on whether read is sync or async (a separate sync/async from what I'm doing...)
| } | ||
|
|
||
| ceph_assert(r >= 0); | ||
| if (movepos) { |
There was a problem hiding this comment.
do we need to call call update_read_io_size() above this line as we do in during success: in Client::_read()?
There was a problem hiding this comment.
Yes, that will be fixed in the next update.
src/client/Client.cc
Outdated
| { | ||
| bool fini; | ||
|
|
||
| clnt->client_lock.lock(); |
There was a problem hiding this comment.
Similar to Client::C_Read_Finisher::finish_io() wouldn't the caller still hold the client _lock? I don't see where it's dropped before calling this method, finish_onunline()
There was a problem hiding this comment.
Hmm, uninline_data can complete immediately, clearly with the client_lock held.
If it actually runs, it's not clear to me if the client_lock is held or not (its all done in objector), I'm thinking not.
The original onuninline using a C_SaferCond doesn't need the client lock to signal or wait for the condition, so doesn't care if client_lock is held or not...
Considering we really shouldn't have to uninline data, maybe I should just make uninline synchronous, and always use a C_SaferCond and wait for it to be done before returning, the _read (for example) calls finish_onunline with the client_lock held.
There was a problem hiding this comment.
As far as I understand, yes, I think we should be okay to make uninline synchronous here and in the write path. I suspect that'd simplify the code?
There was a problem hiding this comment.
Cool, I have done that then.
| */ | ||
| void Client::C_Read_Sync_Async::finish(int r) | ||
| { | ||
| clnt->client_lock.lock(); |
There was a problem hiding this comment.
wouldn't the caller hold the client_lock? I don't see where client_lock is getting dropped in Client::_read() before this method is called
There was a problem hiding this comment.
finish will be called via read_trunc, which looks like it is scheduled so runs on another thread. I haven't followed that code path all the way to the gory end. Maybe it does take the client_lock somehwere? But it all goes down into filer and objecter so seems unlikely any of that would acquire the client lock, and I don't see any inline completion that could call finish on the submitting thread that WOULD still hold the client_lock.
I welcome more analysis on this is one place I got in over my head on following and understanding the code.
We don't even ask for and to be sure that we have granted the Fw caps when reading, we shouldn't write contents to Rados. The bug was introduced by commit a0cb524 (client: Read inline data path) Fixes: https://tracker.ceph.com/issues/56553 Signed-off-by: Xiubo Li <xiubli@redhat.com>
For async I/O, we will want to be able to override block_writes_upfront so rename the member cfg_block_writes_upfront and add an option to pass block_writes_upfront as a parameter along with a member access method so caller can pass cfg_block_writes_upfront. Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
These bits of code need to be invoked from a separate spot when we introduce async I/O, so break them out now. Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
| } | ||
| return 0; | ||
| } | ||
|
|
There was a problem hiding this comment.
io_finish.get() always called on the above, but only release when onfinish != nullptr.
Is there a leak reference when onfinish == nullptr?
There was a problem hiding this comment.
No, it's a managed pointer (std::unique_ptr). It will go out of context and be cleaned up on any exit from Client::_read_async(). We release it HERE because it needs to have a life beyond the call to _read_async().
|
There is some confusion about use of sync and async in existing code. I'm thinking maybe I need to change the wording of my efforts from async to nonblocking. |
Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
…xt list To make an async version of fsync (to be used for async write and commit), we need to be able to signal an arbitrary Context on completion of either of these lists. add_async_onfinish_to_context_list Adds such a Context to the list. Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
We will need the ability to do an async write that finishes with fsync so we need non-blocking fsync. Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
…readv_writev Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
|
OK, I believe this update includes all the required fixes. I'm still not sure if I need any short read handling in C_Read_Finisher::finish_io. The client_lock issue in C_Read_Sync_Async::finish still needs more analysis. |
|
I posted a new version changing the title and names from async to nonblocking to reduce confusion. Here is a pastebin of the diff: Here is the nonblocking-io branch: https://github.com/ffilz/ceph/commits/nonblocking-io If folks think this looks good, I will push the changes to this pull request (and update the pull request title). I think this makes things easier to understand and gets rid of Sync_Async in function names... |
|
@ffilz I suggest just wait for @mchangir uninline_data PR#44359, which will add a scrub command to uninline the inline_data, and a following PR which will remove the inline data code from Or in this PR you can just skip the uninline related code by adding some comments or just return not supported if it has inline data ? |
|
Dear reviewers, I need to figure out one more thing, and maybe I need to drop the client_lock to make the callback to Ganesha. I also possibly have to close the filehandle used. For ganesha to call ceph_ll_close, the client_lock can't be held since ll_close will take it. It was reasonable to add non-blocking fsync on write completion, but file close at end of I/O may be too much to ask... (plus it's a bit tricky in the Ganesha code since the callback isn't actually managing the lifetime of the filehandle). Can we safely drop the client_lock in the callback? |
|
Please see new pull request #48038 |
This is the first pass at supporting async I/O calls from Ganesha. I would appreciate another set of eyes to verify that I have appropriately covered all the bases. I am concerned that somehow I have left a path for synchronous completion which leaves confusion about the return value. The intent is to have all I/O complete async even if it completes before returning to the caller (which is prepared to deal with that situation). I'm also concerned about the best way to detect EOF. Ganesha currently assumes a 0 length read indicates EOF, but maybe we can do better.
Signed-off-by: Frank S. Filz ffilzlnx@mindspring.com
I haven't tried to complete the checklist, I know there is at least documentation that should be written, and tests are yet to be written.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox