Skip to content

OSD and librados writesame support#8568

Merged
jdurgin merged 7 commits intoceph:masterfrom
ddiss:osd-librados-writesame
Apr 28, 2016
Merged

OSD and librados writesame support#8568
jdurgin merged 7 commits intoceph:masterfrom
ddiss:osd-librados-writesame

Conversation

@ddiss
Copy link
Contributor

@ddiss ddiss commented Apr 13, 2016

This is a follow up to #8175 , which adds a librados API and tests atop the OSD writesame implementation from @mikechristie. I'll submit the cmpext support separately.
A few things potentially worth changing:

  • allow for librados zero-length writesame requests
  • use .*write_same() instead of *writesame(), to be consistent with write_full()
  • add librados striper API and tests (?)

@@ -256,6 +256,9 @@ extern const char *ceph_osd_state_name(int s);
f(CACHE_PIN, __CEPH_OSD_OP(WR, DATA, 36), "cache-pin") \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should get rid of the 'cherry-picked from' in the commit message, since that ref won't exist after this one is merged

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, will drop that in the next round.

@jdurgin
Copy link
Member

jdurgin commented Apr 14, 2016

looking good, nice a thorough on the various forms of librados i/o operations. writesame as the method name is fine with me. A libradosstriper api isn't necessary.

The one other test @athanatos was interested in seeing was for ceph_test_rados - extending src/test/osd/RadosModel.h to add writesame ops.

case CEPH_OSD_OP_WRITESAME:
++ctx->num_write;
tracepoint(osd, do_osd_op_pre_writesame, soid.oid.name.c_str(), soid.snap.val, oi.size, op.writesame.offset, op.writesame.length, op.writesame.data_length);
result = do_writesame(ctx, osd_op);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write_update_size_and_usage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do_writesame() recurses with do_osd_ops(op = CEPH_OSD_OP_WRITE), so I don't think an extra write_update_size_and_usage() call is needed here.

@yuyuyu101
Copy link
Member

👍

* @returns 0 on success, negative error code on failure
*/
CEPH_RADOS_API int rados_writesame(rados_ioctx_t io, const char *oid,
const char *buf, size_t data_len,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we change write_len to write_count, so we don't need to care about the data_len align with write_len. And it will be more intuitionistic to caller to reduce potential mistake.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am a little undecided here, as we may choose to drop the alignment requirement in future.
I'll leave the wire protocol as is, and rework the client API to take a write_count multiplier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think we might drop the alignment requirement I'd leave the api as-is, so we don't need a new version of the function to handle that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The new round I just pushed uses the same data_len + write_len API.

@ddiss
Copy link
Contributor Author

ddiss commented Apr 14, 2016

@jdurgin and @yuyuyu101: thanks a lot for the feedback, will integrate your suggested changes and re-push.

@ddiss ddiss force-pushed the osd-librados-writesame branch from b62575b to 535e6fe Compare April 14, 2016 16:03
@ddiss
Copy link
Contributor Author

ddiss commented Apr 22, 2016

Ping - did anyone get a chance to look at the latest round of changes?

@jdurgin
Copy link
Member

jdurgin commented Apr 22, 2016

Yes, it's looking good. Ran it through the rados suite last week, and it passed, but I've been traveling this week.

Will look at merging next week when I'm at a real computer.

*/
int write_full(const std::string& oid, bufferlist& bl);
int writesame(const std::string& oid, bufferlist& bl,
size_t data_len, size_t write_len,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would we have data_len? could we assume bl.length == data_len?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be a valid assumption.
I added the data_len parameter to be consistent with write(const std::string& oid, bufferlist& bl, size_t len, uint64_t off) here. I can remove the parameter if desired.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think it could remove redundant argument check

Copy link
Contributor Author

@ddiss ddiss Apr 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove for the next round

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me. Not sure why we have the len parameter for write().

Mike Christie and others added 7 commits April 25, 2016 15:07
This adds a new ceph request writesame that writes a buffer of length
writesame.data_length bytes at writesame.offset over
writesame.length bytes.

This command maps to SCSI's WRITE SAME request, so users like LIO+rbd
can pass this to the OSD. Right now, it only saves having to transfer
writesame.length bytes over the network, but future versions will be
to fully offload it by passing it directly to the FS/devices if they
support it.

v2:
- Fix tab/spaces to matching coding style.
- Allow zero write length. Check for invalid data lengths.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: David Disseldorp <ddiss@suse.de>
The writesame operation allows callers to write the same data buffer
multiple times to a given object.

Signed-off-by: David Disseldorp <ddiss@suse.de>
Test the new ioctx.writesame() and rados_writesame() API functions.

Signed-off-by: David Disseldorp <ddiss@suse.de>
Write a buffer a number of times using writesame, read the full range
back, and check that it matches. Do this using rados_aio_writesame(),
ioctx.aio_writesame() and ioctx.aio_operate(op.writesame()).

Signed-off-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: David Disseldorp <ddiss@suse.de>
Can be tested via "--op writesame". Requests are currently dispatched
*without* a multiplication factor, i.e. data_len == write_len.

Signed-off-by: David Disseldorp <ddiss@suse.de>
@ddiss ddiss force-pushed the osd-librados-writesame branch from 535e6fe to 0098319 Compare April 25, 2016 14:04
@ddiss
Copy link
Contributor Author

ddiss commented Apr 25, 2016

Changes since last version:

  • removed duplicate parameter checks
  • removed redundant data_len parameter

Feedback appreciated

@yuyuyu101
Copy link
Member

good to me, thanks!

@jdurgin
Copy link
Member

jdurgin commented Apr 26, 2016

lgtm too

write_length -= op.writesame.data_length;
}

write_op.op.op = CEPH_OSD_OP_WRITE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think we need to use CEPH_OSD_OP_ZERO instead. Otherwise, writesame op doesn't actually improve vmware zero operation performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean for writezero the special but common case, we need to handle this properly..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that kind of optimisation belongs here - IMO it would be much better handled on the client (i.e. iSCSI gateway). That way it could avoid sending the (zeroed) buffer in the first place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I mean ObjectStore support zero operation like fallocate api. It will much better than write zero bufferlist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't follow your meaning, what's the meaning of "That way it could avoid sending the (zeroed) buffer in the first place."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think David is suggesting that the client should send a zero osd op instead of using writesame to zero out an object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdurgin: Exactly.

@yuyuyu101: The (current) use-case for CEPH_OSD_OP_WRITESAME is to offload handling of SCSI WRITE SAME I/O to the OSDs.
SCSI WRITE SAME can be used by a SCSI initiator in a couple of ways:

  1. write an arbitrary buffer multiple times to disk
  2. trim/discard a range on a thin-provisioned disk (UNMAP=1)

For type (2) WRITE SAME handling, the kernel client uses CEPH_OSD_OP_DELETE, CEPH_OSD_OP_TRUNCATE or CEPH_OSD_OP_ZERO to implement the trim/discard behaviour.

For type (1) WRITE SAME handling, CEPH_OSD_OP_WRITESAME will initially be used. CEPH_OSD_OP_ZERO could be used later to optimise the case where the buffer is all zeros.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants