Skip to content

osd: add cmpext and writesame ops#8175

Closed
mikechristie wants to merge 2 commits intoceph:masterfrom
mikechristie:wip-osd-vaai
Closed

osd: add cmpext and writesame ops#8175
mikechristie wants to merge 2 commits intoceph:masterfrom
mikechristie:wip-osd-vaai

Conversation

@mikechristie
Copy link

The following patches made over the ceph master branch
implement OSD side support for VMware VAAI's Atomic Test
and Set (ATS) and Write Same (Zero) requests.

ATS is used for operations like locking and heartbeats. It
is implemented by as the SCSI COMPARE_AND_WRITE command which
requires the device to read N blocks, compare them to data
sent with the command, and if equal, write N blocks.

Zero is used to initialize blocks to zero. It is implemented
as the SCSI WRITE_SAME command which passes the device a
block's worth of data and has it write it multiple times.

More info on VAAI can be found here:
http://www.vmware.com/resources/techresources/10337

The krbd patches which use these requests are in vaai branch of
this tree:
https://github.com/mikechristie/linux-kernel

I did not submit them in this thread, because they depend on other
patches that are still being reviewed upstream and I did not want
to waste people's time reviewing them if they change. These OSD side
patches should be ok to review and merge, because the op format
and implementation should not change.

v3

  • Add CMPEXT miscompare offset back.
    v2
  • Integrated David's tracing fixes.
  • Dropped CMPEXT miscompare offset.

Mike Christie added 2 commits March 15, 2016 17:14
This adds a new ceph request writesame that writes a buffer of length
writesame.data_length bytes at writesame.offset over
writesame.length bytes.

This command maps to SCSI's WRITE SAME request, so users like LIO+rbd
can pass this to the OSD. Right now, it only saves having to transfer
writesame.length bytes over the network, but future versions will be
to fully offload it by passing it directly to the FS/devices if they
support it.

Signed-off-by: Mike Christie <mchristi@redhat.com>
This adds support for a new op cmpext. The request will read
extent.length bytes and compare them to extent.length bytes at
extent.offset on disk. If there is a miscompare the osd will return
-EILSEQ, and position in the buffer the offset was found and
the mismatched buffer that was read.

rbd will use this in a multi op request to implement the
SCSI COMPARE_AND_WRITE request which is used by VMware for
its atomic test and set request.

v3:
- Return mismatch offset on matching failures.

v2:
- Merge David's tracing fixes.
- Instead of returning the mismatch offset and buffer on matching
failure just return the buffer. The client can figure out the offset
if it needs it.

Signed-off-by: Mike Christie <mchristi@redhat.com>
@yuyuyu101
Copy link
Member

Hi @mikechristie

in my mind, VAAI is a block layer api spec which should integrate into librbd, why not implement this in RADOS instead of librbd?

@gregsfortytwo
Copy link
Member

@yuyuyu101 The purpose of these commands is to do data updates on the "disk"; doing it in librbd would require Read-Modify-Write. So they definitely need to happen on the OSD.

@mikechristie, is there a particular reason these need to be OSD ops and not class operations? The particular operations seem awfully specific and unlikely to be reused by other kinds of clients.

@yuyuyu101
Copy link
Member

@gregsfortytwo yep, I also means we could impl detail logics in librbd layer, and leave generic CAS things to rados.

Oh, I guess you will impl it in kernel rbd firstly.

@mikechristie
Copy link
Author

@yuyuyu101, Yes, it is implemented in the kernel only right now. Here are the patches that are being proposed upstream:

https://github.com/mikechristie/linux-kernel/commits/vaai

@gregsfortytwo, You mean implement them as rbd cls ops in src/cls/rbd/cls_rbd.cc right? I swear I saw a comment in the ceph code saying something about being able to compare extents could be useful, so I went wild and implemented both as a OSD ops. I cannot find this comment, so maybe I was seeing things :) I think you are probably right and the only user would be rbd for this ESX support.

@gregsfortytwo
Copy link
Member

Yeah, that's what I meant. (Or perhaps as a separate object class, if that's desirable in terms of auth caps or something.)

I don't care that much and @athanatos is your real gate, but in general it's nice to not expand public interfaces too much. ;)

@mikechristie
Copy link
Author

Ok, actually, it looks like it is less invasive on the kernel side to implement as cls ops. The existing libceph and rbd kernel code seem to have the infrastructure I need to implement them.

So if no one wants them as osd ops, it makes the kernel side nicer to do them as cls ops.

@yuyuyu101
Copy link
Member

@mikechristie why not consider impl librbd firstly? related to iscsi target problem?

@mikechristie
Copy link
Author

@yuyuyu101,

I am new'ish to ceph. What do you mean by implement in librbd? Is that in the userspace lib ceph/src/librbd? If my target was in userpsace like tgt or tcmu, I could do something like this

(tgt making rbd_*() lib calls)
https://github.com/fujita/tgt/blob/master/usr/bs_rbd.c

I like [fixed a word here] this approach, but tgt is being replaced with the upstream LIO target in most distros. LIO was all kernel based, and so that was one reason I was focusing on krbd support. I also felt performance would be better and it would be a easier implementation.

@jdurgin
Copy link
Member

jdurgin commented Mar 17, 2016

I also recall someone thinking a generic op would be useful, but can't find where. A class op can still be reused easily though - it also makes testing easier (no need to plumb through new client calls in userspace either).

So new class methods (don't really care if they're cls_rbd or a new cls) sound good to me.

@jdurgin jdurgin added the rbd label Mar 17, 2016
@ddiss
Copy link
Contributor

ddiss commented Mar 18, 2016

These changes both look good to me:
Reviewed-by: David Disseldorp ddiss@suse.de

Though I'm not against implementing these as class operation, I think a generic cmpext op will be useful outside of just RBD - atomic compare-and-swap operations are already possible using xattrs and OMAP values. Allowing the same functionality with object data is complementary to this IMO.

If extra tests are needed for these operations, I'll gladly put my hand up.

@jdurgin
Copy link
Member

jdurgin commented Mar 18, 2016

class ops are just as reusable as new rados ops. I don't see much reason to add more rados ops (like these) that could be implemented as class ops directly. class ops are easier to write, and can be used without bloating inferfaces - no need to plumb new ops through the Objecter and librados on the client side, for example.

For testing this, we'll want to update src/test/librados/(maybe misc.cc?) so we can detect regressions without going through extra layers like an application on top of iscsi.

@dillaman
Copy link

I too remember this discussion last year -- I thought it was @liewegas who suggested they be ops as opposed to Mike's original implementation using class methods.

@jdurgin jdurgin added this to the jewel milestone Mar 24, 2016
tracepoint(osd, do_osd_op_pre_writesame, soid.oid.name.c_str(), soid.snap.val, oi.size, op.writesame.offset, op.writesame.length, op.writesame.data_length);

result = do_writesame(ctx, osd_op);
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation looks wonky?

@liewegas
Copy link
Member

Rados ops seem fine to me, but yeah, a class would also work. @athanatos ?

@liewegas
Copy link
Member

We need functional tests in test/librados/misc.cc that verify these work as expected.

@athanatos
Copy link
Contributor

Rados ops seem reasonable. I'm ok either way on that part.

+1 on the functional tests in test/librados/misc.cc

I'd also like ceph_test_rados (src/test/osd/RadosModel.h and src/test/osd/TestRados.cc) extended with these operations. CMPEXT can probably be tested as part of the normal write op by randomly adding a CMPEXT which should succeed to one of the writes and sometimes (rand() % 3 or something) adding a CMPEXT that should fail with a write that would be noticed later in read if it happened (0's or something). WRITESAME should probably be its own op (Object.h will need to be changed a bit).

@wonzhq
Copy link
Contributor

wonzhq commented Mar 30, 2016

@mikechristie why the CMPEXT op just does the comparison? I think it should do a write if the comparison result is equal.

@ddiss
Copy link
Contributor

ddiss commented Mar 30, 2016

@wonzhq: similar to the xattr ops, atomic test-and-set updates are handled on the client side by issuing compound cmpext+write requests.

@yuyuyu101
Copy link
Member

@mikechristie are you still working on this? If target to jewel and no much free time, I would like to help

@liewegas
Copy link
Member

liewegas commented Apr 4, 2016

@yuyuyu101 we'd definitely appreciate help, especially with the ceph_test_rados changes.

@mikechristie
Copy link
Author

@yuyuyu101, I might have messed up on my tests. I will push my current branch later today, and then we can see what still needs to be done and you can help with any of it.

@yuyuyu101
Copy link
Member

@mikechristie great, look forward to this

@ddiss
Copy link
Contributor

ddiss commented Apr 6, 2016

Hi,

I hope I'm not duplicating work that Mike has already done, but I've pushed a preliminary writesame librados API and some basic ceph_test_rados_api_misc tests to https://github.com/ddiss/ceph/tree/mc-wip-osd-vaai
It's all atop Mike's existing OSD changes.

I'll continue adding writesame tests, and work on a cmpext API today/tomorrow. Preliminary feedback would be much appreciated.

@liewegas
Copy link
Member

liewegas commented Apr 6, 2016

@ddiss looks good!

@mikechristie
Copy link
Author

Hey ddiss,

A little overlap. Thanks for working on it!.

I am having trouble pushing my branch. Just so I can get it out to you and yuyuyu101, I just made a new branch wip-osd-vaai2.

Here is a link:
https://github.com/mikechristie/ceph/commits/wip-osd-vaai2

It has the beginnings of cmpext librados support and tests. I did not do the C API cmpext stuff yet.

@ddiss
Copy link
Contributor

ddiss commented Apr 6, 2016

Taking a look now - thanks Mike!

@mikechristie
Copy link
Author

Hey ddiss,

You do not have to actually waste your time and review the ws/cmp rados patches. Just use them for a guide if you have not started the cmpext stuff. I do not think some of the patches build against the tree I uploaded against. I just pushed them incase you wanted to look at them while you were working on this.

I think your ws patches are more complete so we should go forward with yours.

@liewegas liewegas changed the title Wip osd ESX vaai v3 osd: add cmpext and writesame ops Apr 7, 2016
@ddiss
Copy link
Contributor

ddiss commented Apr 7, 2016

I've pushed a new round of writesame changes to my branch, with improved test coverage (aio, etc.). I'm now expanding on Mike's cmpext API changes and tests.

@liewegas liewegas modified the milestones: kraken, jewel Apr 15, 2016
@jdurgin
Copy link
Member

jdurgin commented Apr 28, 2016

closing this pr since it's been superseded by #8568 and a future cmpext patchset

@jdurgin jdurgin closed this Apr 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants