osd: add cmpext and writesame ops by mikechristie · Pull Request #8175 · ceph/ceph

mikechristie · 2016-03-17T06:03:13Z

The following patches made over the ceph master branch
implement OSD side support for VMware VAAI's Atomic Test
and Set (ATS) and Write Same (Zero) requests.

ATS is used for operations like locking and heartbeats. It
is implemented by as the SCSI COMPARE_AND_WRITE command which
requires the device to read N blocks, compare them to data
sent with the command, and if equal, write N blocks.

Zero is used to initialize blocks to zero. It is implemented
as the SCSI WRITE_SAME command which passes the device a
block's worth of data and has it write it multiple times.

More info on VAAI can be found here:
http://www.vmware.com/resources/techresources/10337

The krbd patches which use these requests are in vaai branch of
this tree:
https://github.com/mikechristie/linux-kernel

I did not submit them in this thread, because they depend on other
patches that are still being reviewed upstream and I did not want
to waste people's time reviewing them if they change. These OSD side
patches should be ok to review and merge, because the op format
and implementation should not change.

v3

Add CMPEXT miscompare offset back.
v2
Integrated David's tracing fixes.
Dropped CMPEXT miscompare offset.

This adds a new ceph request writesame that writes a buffer of length writesame.data_length bytes at writesame.offset over writesame.length bytes. This command maps to SCSI's WRITE SAME request, so users like LIO+rbd can pass this to the OSD. Right now, it only saves having to transfer writesame.length bytes over the network, but future versions will be to fully offload it by passing it directly to the FS/devices if they support it. Signed-off-by: Mike Christie <mchristi@redhat.com>

This adds support for a new op cmpext. The request will read extent.length bytes and compare them to extent.length bytes at extent.offset on disk. If there is a miscompare the osd will return -EILSEQ, and position in the buffer the offset was found and the mismatched buffer that was read. rbd will use this in a multi op request to implement the SCSI COMPARE_AND_WRITE request which is used by VMware for its atomic test and set request. v3: - Return mismatch offset on matching failures. v2: - Merge David's tracing fixes. - Instead of returning the mismatch offset and buffer on matching failure just return the buffer. The client can figure out the offset if it needs it. Signed-off-by: Mike Christie <mchristi@redhat.com>

yuyuyu101 · 2016-03-17T06:11:12Z

Hi @mikechristie

in my mind, VAAI is a block layer api spec which should integrate into librbd, why not implement this in RADOS instead of librbd?

gregsfortytwo · 2016-03-17T06:36:52Z

@yuyuyu101 The purpose of these commands is to do data updates on the "disk"; doing it in librbd would require Read-Modify-Write. So they definitely need to happen on the OSD.

@mikechristie, is there a particular reason these need to be OSD ops and not class operations? The particular operations seem awfully specific and unlikely to be reused by other kinds of clients.

yuyuyu101 · 2016-03-17T06:43:58Z

@gregsfortytwo yep, I also means we could impl detail logics in librbd layer, and leave generic CAS things to rados.

Oh, I guess you will impl it in kernel rbd firstly.

mikechristie · 2016-03-17T07:06:06Z

@yuyuyu101, Yes, it is implemented in the kernel only right now. Here are the patches that are being proposed upstream:

https://github.com/mikechristie/linux-kernel/commits/vaai

@gregsfortytwo, You mean implement them as rbd cls ops in src/cls/rbd/cls_rbd.cc right? I swear I saw a comment in the ceph code saying something about being able to compare extents could be useful, so I went wild and implemented both as a OSD ops. I cannot find this comment, so maybe I was seeing things :) I think you are probably right and the only user would be rbd for this ESX support.

gregsfortytwo · 2016-03-17T07:13:55Z

Yeah, that's what I meant. (Or perhaps as a separate object class, if that's desirable in terms of auth caps or something.)

I don't care that much and @athanatos is your real gate, but in general it's nice to not expand public interfaces too much. ;)

mikechristie · 2016-03-17T07:29:19Z

Ok, actually, it looks like it is less invasive on the kernel side to implement as cls ops. The existing libceph and rbd kernel code seem to have the infrastructure I need to implement them.

So if no one wants them as osd ops, it makes the kernel side nicer to do them as cls ops.

yuyuyu101 · 2016-03-17T07:37:14Z

@mikechristie why not consider impl librbd firstly? related to iscsi target problem?

mikechristie · 2016-03-17T07:51:31Z

@yuyuyu101,

I am new'ish to ceph. What do you mean by implement in librbd? Is that in the userspace lib ceph/src/librbd? If my target was in userpsace like tgt or tcmu, I could do something like this

(tgt making rbd_*() lib calls)
https://github.com/fujita/tgt/blob/master/usr/bs_rbd.c

I like [fixed a word here] this approach, but tgt is being replaced with the upstream LIO target in most distros. LIO was all kernel based, and so that was one reason I was focusing on krbd support. I also felt performance would be better and it would be a easier implementation.

jdurgin · 2016-03-17T17:59:06Z

I also recall someone thinking a generic op would be useful, but can't find where. A class op can still be reused easily though - it also makes testing easier (no need to plumb through new client calls in userspace either).

So new class methods (don't really care if they're cls_rbd or a new cls) sound good to me.

ddiss · 2016-03-18T01:00:24Z

These changes both look good to me:
Reviewed-by: David Disseldorp ddiss@suse.de

Though I'm not against implementing these as class operation, I think a generic cmpext op will be useful outside of just RBD - atomic compare-and-swap operations are already possible using xattrs and OMAP values. Allowing the same functionality with object data is complementary to this IMO.

If extra tests are needed for these operations, I'll gladly put my hand up.

jdurgin · 2016-03-18T02:07:11Z

class ops are just as reusable as new rados ops. I don't see much reason to add more rados ops (like these) that could be implemented as class ops directly. class ops are easier to write, and can be used without bloating inferfaces - no need to plumb new ops through the Objecter and librados on the client side, for example.

For testing this, we'll want to update src/test/librados/(maybe misc.cc?) so we can detect regressions without going through extra layers like an application on top of iscsi.

dillaman · 2016-03-18T15:17:31Z

I too remember this discussion last year -- I thought it was @liewegas who suggested they be ops as opposed to Mike's original implementation using class methods.

liewegas · 2016-03-28T13:16:18Z

src/osd/ReplicatedPG.cc

+      tracepoint(osd, do_osd_op_pre_writesame, soid.oid.name.c_str(), soid.snap.val, oi.size, op.writesame.offset, op.writesame.length, op.writesame.data_length);
+
+     result = do_writesame(ctx, osd_op);
+     break;


nit: indentation looks wonky?

liewegas · 2016-03-28T13:19:02Z

Rados ops seem fine to me, but yeah, a class would also work. @athanatos ?

liewegas · 2016-03-28T13:19:59Z

We need functional tests in test/librados/misc.cc that verify these work as expected.

athanatos · 2016-03-28T17:34:28Z

Rados ops seem reasonable. I'm ok either way on that part.

+1 on the functional tests in test/librados/misc.cc

I'd also like ceph_test_rados (src/test/osd/RadosModel.h and src/test/osd/TestRados.cc) extended with these operations. CMPEXT can probably be tested as part of the normal write op by randomly adding a CMPEXT which should succeed to one of the writes and sometimes (rand() % 3 or something) adding a CMPEXT that should fail with a write that would be noticed later in read if it happened (0's or something). WRITESAME should probably be its own op (Object.h will need to be changed a bit).

wonzhq · 2016-03-30T09:02:20Z

@mikechristie why the CMPEXT op just does the comparison? I think it should do a write if the comparison result is equal.

ddiss · 2016-03-30T09:11:16Z

@wonzhq: similar to the xattr ops, atomic test-and-set updates are handled on the client side by issuing compound cmpext+write requests.

yuyuyu101 · 2016-04-04T12:53:33Z

@mikechristie are you still working on this? If target to jewel and no much free time, I would like to help

liewegas · 2016-04-04T16:53:12Z

@yuyuyu101 we'd definitely appreciate help, especially with the ceph_test_rados changes.

mikechristie · 2016-04-04T16:59:19Z

@yuyuyu101, I might have messed up on my tests. I will push my current branch later today, and then we can see what still needs to be done and you can help with any of it.

yuyuyu101 · 2016-04-05T05:44:44Z

@mikechristie great, look forward to this

ddiss · 2016-04-06T17:33:34Z

Hi,

I hope I'm not duplicating work that Mike has already done, but I've pushed a preliminary writesame librados API and some basic ceph_test_rados_api_misc tests to https://github.com/ddiss/ceph/tree/mc-wip-osd-vaai
It's all atop Mike's existing OSD changes.

I'll continue adding writesame tests, and work on a cmpext API today/tomorrow. Preliminary feedback would be much appreciated.

liewegas · 2016-04-06T19:57:51Z

@ddiss looks good!

mikechristie · 2016-04-06T21:41:39Z

Hey ddiss,

A little overlap. Thanks for working on it!.

I am having trouble pushing my branch. Just so I can get it out to you and yuyuyu101, I just made a new branch wip-osd-vaai2.

Here is a link:
https://github.com/mikechristie/ceph/commits/wip-osd-vaai2

It has the beginnings of cmpext librados support and tests. I did not do the C API cmpext stuff yet.

ddiss · 2016-04-06T22:31:27Z

Taking a look now - thanks Mike!

mikechristie · 2016-04-07T00:52:15Z

Hey ddiss,

You do not have to actually waste your time and review the ws/cmp rados patches. Just use them for a guide if you have not started the cmpext stuff. I do not think some of the patches build against the tree I uploaded against. I just pushed them incase you wanted to look at them while you were working on this.

I think your ws patches are more complete so we should go forward with yours.

ddiss · 2016-04-07T14:19:24Z

I've pushed a new round of writesame changes to my branch, with improved test coverage (aio, etc.). I'm now expanding on Mike's cmpext API changes and tests.

jdurgin · 2016-04-28T21:32:26Z

closing this pr since it's been superseded by #8568 and a future cmpext patchset

Mike Christie added 2 commits March 15, 2016 17:14

gregsfortytwo added feature core labels Mar 17, 2016

jdurgin added the rbd label Mar 17, 2016

jdurgin added this to the jewel milestone Mar 24, 2016

liewegas reviewed Mar 28, 2016
View reviewed changes

liewegas changed the title ~~Wip osd ESX vaai v3~~ osd: add cmpext and writesame ops Apr 7, 2016

ddiss mentioned this pull request Apr 13, 2016

OSD and librados writesame support #8568

Merged

liewegas modified the milestones: kraken, jewel Apr 15, 2016

jdurgin closed this Apr 28, 2016

ddiss mentioned this pull request May 3, 2016

osd,librados: cmpext support #8911

Closed

Conversation

mikechristie commented Mar 17, 2016

Uh oh!

yuyuyu101 commented Mar 17, 2016

Uh oh!

gregsfortytwo commented Mar 17, 2016

Uh oh!

yuyuyu101 commented Mar 17, 2016

Uh oh!

mikechristie commented Mar 17, 2016

Uh oh!

gregsfortytwo commented Mar 17, 2016

Uh oh!

mikechristie commented Mar 17, 2016

Uh oh!

yuyuyu101 commented Mar 17, 2016

Uh oh!

mikechristie commented Mar 17, 2016

Uh oh!

jdurgin commented Mar 17, 2016

Uh oh!

ddiss commented Mar 18, 2016

Uh oh!

jdurgin commented Mar 18, 2016

Uh oh!

dillaman commented Mar 18, 2016

Uh oh!

liewegas Mar 28, 2016

Choose a reason for hiding this comment

Uh oh!

liewegas commented Mar 28, 2016

Uh oh!

liewegas commented Mar 28, 2016

Uh oh!

athanatos commented Mar 28, 2016

Uh oh!

wonzhq commented Mar 30, 2016

Uh oh!

ddiss commented Mar 30, 2016

Uh oh!

yuyuyu101 commented Apr 4, 2016

Uh oh!

liewegas commented Apr 4, 2016

Uh oh!

mikechristie commented Apr 4, 2016

Uh oh!

yuyuyu101 commented Apr 5, 2016

Uh oh!

ddiss commented Apr 6, 2016

Uh oh!

liewegas commented Apr 6, 2016

Uh oh!

mikechristie commented Apr 6, 2016

Uh oh!

ddiss commented Apr 6, 2016

Uh oh!

mikechristie commented Apr 7, 2016

Uh oh!

ddiss commented Apr 7, 2016

Uh oh!

jdurgin commented Apr 28, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants