OSD and librados writesame support#8568
Conversation
| @@ -256,6 +256,9 @@ extern const char *ceph_osd_state_name(int s); | |||
| f(CACHE_PIN, __CEPH_OSD_OP(WR, DATA, 36), "cache-pin") \ | |||
There was a problem hiding this comment.
should get rid of the 'cherry-picked from' in the commit message, since that ref won't exist after this one is merged
There was a problem hiding this comment.
Oops, will drop that in the next round.
|
looking good, nice a thorough on the various forms of librados i/o operations. writesame as the method name is fine with me. A libradosstriper api isn't necessary. The one other test @athanatos was interested in seeing was for ceph_test_rados - extending src/test/osd/RadosModel.h to add writesame ops. |
| case CEPH_OSD_OP_WRITESAME: | ||
| ++ctx->num_write; | ||
| tracepoint(osd, do_osd_op_pre_writesame, soid.oid.name.c_str(), soid.snap.val, oi.size, op.writesame.offset, op.writesame.length, op.writesame.data_length); | ||
| result = do_writesame(ctx, osd_op); |
There was a problem hiding this comment.
do_writesame() recurses with do_osd_ops(op = CEPH_OSD_OP_WRITE), so I don't think an extra write_update_size_and_usage() call is needed here.
|
👍 |
| * @returns 0 on success, negative error code on failure | ||
| */ | ||
| CEPH_RADOS_API int rados_writesame(rados_ioctx_t io, const char *oid, | ||
| const char *buf, size_t data_len, |
There was a problem hiding this comment.
I suggest we change write_len to write_count, so we don't need to care about the data_len align with write_len. And it will be more intuitionistic to caller to reduce potential mistake.
There was a problem hiding this comment.
Am a little undecided here, as we may choose to drop the alignment requirement in future.
I'll leave the wire protocol as is, and rework the client API to take a write_count multiplier.
There was a problem hiding this comment.
If you think we might drop the alignment requirement I'd leave the api as-is, so we don't need a new version of the function to handle that
There was a problem hiding this comment.
Thanks. The new round I just pushed uses the same data_len + write_len API.
|
@jdurgin and @yuyuyu101: thanks a lot for the feedback, will integrate your suggested changes and re-push. |
b62575b to
535e6fe
Compare
|
Ping - did anyone get a chance to look at the latest round of changes? |
|
Yes, it's looking good. Ran it through the rados suite last week, and it passed, but I've been traveling this week. Will look at merging next week when I'm at a real computer. |
src/include/rados/librados.hpp
Outdated
| */ | ||
| int write_full(const std::string& oid, bufferlist& bl); | ||
| int writesame(const std::string& oid, bufferlist& bl, | ||
| size_t data_len, size_t write_len, |
There was a problem hiding this comment.
why would we have data_len? could we assume bl.length == data_len?
There was a problem hiding this comment.
Yes, that would be a valid assumption.
I added the data_len parameter to be consistent with write(const std::string& oid, bufferlist& bl, size_t len, uint64_t off) here. I can remove the parameter if desired.
There was a problem hiding this comment.
yes, I think it could remove redundant argument check
There was a problem hiding this comment.
Will remove for the next round
There was a problem hiding this comment.
sounds good to me. Not sure why we have the len parameter for write().
This adds a new ceph request writesame that writes a buffer of length writesame.data_length bytes at writesame.offset over writesame.length bytes. This command maps to SCSI's WRITE SAME request, so users like LIO+rbd can pass this to the OSD. Right now, it only saves having to transfer writesame.length bytes over the network, but future versions will be to fully offload it by passing it directly to the FS/devices if they support it. v2: - Fix tab/spaces to matching coding style. - Allow zero write length. Check for invalid data lengths. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: David Disseldorp <ddiss@suse.de>
The writesame operation allows callers to write the same data buffer multiple times to a given object. Signed-off-by: David Disseldorp <ddiss@suse.de>
Test the new ioctx.writesame() and rados_writesame() API functions. Signed-off-by: David Disseldorp <ddiss@suse.de>
Write a buffer a number of times using writesame, read the full range back, and check that it matches. Do this using rados_aio_writesame(), ioctx.aio_writesame() and ioctx.aio_operate(op.writesame()). Signed-off-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: David Disseldorp <ddiss@suse.de>
Can be tested via "--op writesame". Requests are currently dispatched *without* a multiplication factor, i.e. data_len == write_len. Signed-off-by: David Disseldorp <ddiss@suse.de>
535e6fe to
0098319
Compare
|
Changes since last version:
Feedback appreciated |
|
good to me, thanks! |
|
lgtm too |
| write_length -= op.writesame.data_length; | ||
| } | ||
|
|
||
| write_op.op.op = CEPH_OSD_OP_WRITE; |
There was a problem hiding this comment.
Hmm, I think we need to use CEPH_OSD_OP_ZERO instead. Otherwise, writesame op doesn't actually improve vmware zero operation performance.
There was a problem hiding this comment.
I mean for writezero the special but common case, we need to handle this properly..
There was a problem hiding this comment.
I don't think that kind of optimisation belongs here - IMO it would be much better handled on the client (i.e. iSCSI gateway). That way it could avoid sending the (zeroed) buffer in the first place.
There was a problem hiding this comment.
no, I mean ObjectStore support zero operation like fallocate api. It will much better than write zero bufferlist.
There was a problem hiding this comment.
I can't follow your meaning, what's the meaning of "That way it could avoid sending the (zeroed) buffer in the first place."
There was a problem hiding this comment.
I think David is suggesting that the client should send a zero osd op instead of using writesame to zero out an object.
There was a problem hiding this comment.
@jdurgin: Exactly.
@yuyuyu101: The (current) use-case for CEPH_OSD_OP_WRITESAME is to offload handling of SCSI WRITE SAME I/O to the OSDs.
SCSI WRITE SAME can be used by a SCSI initiator in a couple of ways:
- write an arbitrary buffer multiple times to disk
- trim/discard a range on a thin-provisioned disk (UNMAP=1)
For type (2) WRITE SAME handling, the kernel client uses CEPH_OSD_OP_DELETE, CEPH_OSD_OP_TRUNCATE or CEPH_OSD_OP_ZERO to implement the trim/discard behaviour.
For type (1) WRITE SAME handling, CEPH_OSD_OP_WRITESAME will initially be used. CEPH_OSD_OP_ZERO could be used later to optimise the case where the buffer is all zeros.
This is a follow up to #8175 , which adds a librados API and tests atop the OSD writesame implementation from @mikechristie. I'll submit the cmpext support separately.
A few things potentially worth changing: