osd: add cmpext and writesame ops#8175
osd: add cmpext and writesame ops#8175mikechristie wants to merge 2 commits intoceph:masterfrom mikechristie:wip-osd-vaai
Conversation
This adds a new ceph request writesame that writes a buffer of length writesame.data_length bytes at writesame.offset over writesame.length bytes. This command maps to SCSI's WRITE SAME request, so users like LIO+rbd can pass this to the OSD. Right now, it only saves having to transfer writesame.length bytes over the network, but future versions will be to fully offload it by passing it directly to the FS/devices if they support it. Signed-off-by: Mike Christie <mchristi@redhat.com>
This adds support for a new op cmpext. The request will read extent.length bytes and compare them to extent.length bytes at extent.offset on disk. If there is a miscompare the osd will return -EILSEQ, and position in the buffer the offset was found and the mismatched buffer that was read. rbd will use this in a multi op request to implement the SCSI COMPARE_AND_WRITE request which is used by VMware for its atomic test and set request. v3: - Return mismatch offset on matching failures. v2: - Merge David's tracing fixes. - Instead of returning the mismatch offset and buffer on matching failure just return the buffer. The client can figure out the offset if it needs it. Signed-off-by: Mike Christie <mchristi@redhat.com>
|
in my mind, VAAI is a block layer api spec which should integrate into librbd, why not implement this in RADOS instead of librbd? |
|
@yuyuyu101 The purpose of these commands is to do data updates on the "disk"; doing it in librbd would require Read-Modify-Write. So they definitely need to happen on the OSD. @mikechristie, is there a particular reason these need to be OSD ops and not class operations? The particular operations seem awfully specific and unlikely to be reused by other kinds of clients. |
|
@gregsfortytwo yep, I also means we could impl detail logics in librbd layer, and leave generic CAS things to rados. Oh, I guess you will impl it in kernel rbd firstly. |
|
@yuyuyu101, Yes, it is implemented in the kernel only right now. Here are the patches that are being proposed upstream: https://github.com/mikechristie/linux-kernel/commits/vaai @gregsfortytwo, You mean implement them as rbd cls ops in src/cls/rbd/cls_rbd.cc right? I swear I saw a comment in the ceph code saying something about being able to compare extents could be useful, so I went wild and implemented both as a OSD ops. I cannot find this comment, so maybe I was seeing things :) I think you are probably right and the only user would be rbd for this ESX support. |
|
Yeah, that's what I meant. (Or perhaps as a separate object class, if that's desirable in terms of auth caps or something.) I don't care that much and @athanatos is your real gate, but in general it's nice to not expand public interfaces too much. ;) |
|
Ok, actually, it looks like it is less invasive on the kernel side to implement as cls ops. The existing libceph and rbd kernel code seem to have the infrastructure I need to implement them. So if no one wants them as osd ops, it makes the kernel side nicer to do them as cls ops. |
|
@mikechristie why not consider impl librbd firstly? related to iscsi target problem? |
|
I am new'ish to ceph. What do you mean by implement in librbd? Is that in the userspace lib ceph/src/librbd? If my target was in userpsace like tgt or tcmu, I could do something like this (tgt making rbd_*() lib calls) I like [fixed a word here] this approach, but tgt is being replaced with the upstream LIO target in most distros. LIO was all kernel based, and so that was one reason I was focusing on krbd support. I also felt performance would be better and it would be a easier implementation. |
|
I also recall someone thinking a generic op would be useful, but can't find where. A class op can still be reused easily though - it also makes testing easier (no need to plumb through new client calls in userspace either). So new class methods (don't really care if they're cls_rbd or a new cls) sound good to me. |
|
These changes both look good to me: Though I'm not against implementing these as class operation, I think a generic cmpext op will be useful outside of just RBD - atomic compare-and-swap operations are already possible using xattrs and OMAP values. Allowing the same functionality with object data is complementary to this IMO. If extra tests are needed for these operations, I'll gladly put my hand up. |
|
class ops are just as reusable as new rados ops. I don't see much reason to add more rados ops (like these) that could be implemented as class ops directly. class ops are easier to write, and can be used without bloating inferfaces - no need to plumb new ops through the Objecter and librados on the client side, for example. For testing this, we'll want to update src/test/librados/(maybe misc.cc?) so we can detect regressions without going through extra layers like an application on top of iscsi. |
|
I too remember this discussion last year -- I thought it was @liewegas who suggested they be ops as opposed to Mike's original implementation using class methods. |
| tracepoint(osd, do_osd_op_pre_writesame, soid.oid.name.c_str(), soid.snap.val, oi.size, op.writesame.offset, op.writesame.length, op.writesame.data_length); | ||
|
|
||
| result = do_writesame(ctx, osd_op); | ||
| break; |
|
Rados ops seem fine to me, but yeah, a class would also work. @athanatos ? |
|
We need functional tests in test/librados/misc.cc that verify these work as expected. |
|
Rados ops seem reasonable. I'm ok either way on that part. +1 on the functional tests in test/librados/misc.cc I'd also like ceph_test_rados (src/test/osd/RadosModel.h and src/test/osd/TestRados.cc) extended with these operations. CMPEXT can probably be tested as part of the normal write op by randomly adding a CMPEXT which should succeed to one of the writes and sometimes (rand() % 3 or something) adding a CMPEXT that should fail with a write that would be noticed later in read if it happened (0's or something). WRITESAME should probably be its own op (Object.h will need to be changed a bit). |
|
@mikechristie why the CMPEXT op just does the comparison? I think it should do a write if the comparison result is equal. |
|
@wonzhq: similar to the xattr ops, atomic test-and-set updates are handled on the client side by issuing compound cmpext+write requests. |
|
@mikechristie are you still working on this? If target to jewel and no much free time, I would like to help |
|
@yuyuyu101 we'd definitely appreciate help, especially with the ceph_test_rados changes. |
|
@yuyuyu101, I might have messed up on my tests. I will push my current branch later today, and then we can see what still needs to be done and you can help with any of it. |
|
@mikechristie great, look forward to this |
|
Hi, I hope I'm not duplicating work that Mike has already done, but I've pushed a preliminary writesame librados API and some basic ceph_test_rados_api_misc tests to https://github.com/ddiss/ceph/tree/mc-wip-osd-vaai I'll continue adding writesame tests, and work on a cmpext API today/tomorrow. Preliminary feedback would be much appreciated. |
|
@ddiss looks good! |
|
Hey ddiss, A little overlap. Thanks for working on it!. I am having trouble pushing my branch. Just so I can get it out to you and yuyuyu101, I just made a new branch wip-osd-vaai2. Here is a link: It has the beginnings of cmpext librados support and tests. I did not do the C API cmpext stuff yet. |
|
Taking a look now - thanks Mike! |
|
Hey ddiss, You do not have to actually waste your time and review the ws/cmp rados patches. Just use them for a guide if you have not started the cmpext stuff. I do not think some of the patches build against the tree I uploaded against. I just pushed them incase you wanted to look at them while you were working on this. I think your ws patches are more complete so we should go forward with yours. |
|
I've pushed a new round of writesame changes to my branch, with improved test coverage (aio, etc.). I'm now expanding on Mike's cmpext API changes and tests. |
|
closing this pr since it's been superseded by #8568 and a future cmpext patchset |
The following patches made over the ceph master branch
implement OSD side support for VMware VAAI's Atomic Test
and Set (ATS) and Write Same (Zero) requests.
ATS is used for operations like locking and heartbeats. It
is implemented by as the SCSI COMPARE_AND_WRITE command which
requires the device to read N blocks, compare them to data
sent with the command, and if equal, write N blocks.
Zero is used to initialize blocks to zero. It is implemented
as the SCSI WRITE_SAME command which passes the device a
block's worth of data and has it write it multiple times.
More info on VAAI can be found here:
http://www.vmware.com/resources/techresources/10337
The krbd patches which use these requests are in vaai branch of
this tree:
https://github.com/mikechristie/linux-kernel
I did not submit them in this thread, because they depend on other
patches that are still being reviewed upstream and I did not want
to waste people's time reviewing them if they change. These OSD side
patches should be ok to review and merge, because the op format
and implementation should not change.
v3
v2