Further EC partial stripe read fixes#59480
Conversation
|
jenkins test make check |
|
jenkins test api |
idryomov
left a comment
There was a problem hiding this comment.
It sounds like we are no longer planning to disable the optimization for multi-region reads? If so, I'd suggest merging #58749 and rebasing this on top -- I stumbled for a bit on mismatching diffstat numbers caused by Radek's commit being included here as I was pulling the new fix into my testing branch.
idryomov
left a comment
There was a problem hiding this comment.
I'll run more tests, but based on
https://pulpito.ceph.com/dis-2024-08-28_23:16:38-rbd-wip-dis-testing-2-distro-default-smithi/
https://pulpito.ceph.com/dis-2024-08-28_23:17:42-rbd-wip-dis-testing-2-distro-default-smithi/
the issue that I brought up in #58749 (comment) is fixed!
src/osd/ECBackend.cc
Outdated
| ceph_assert( | ||
| (offset + length) <= | ||
| (range.first.get_off() + range.first.get_len())); | ||
| (range_offset + range_length)); |
There was a problem hiding this comment.
This assert would fit on a single line now (and parens on both sides of <= seem redundant).
src/osd/ECCommon.cc
Outdated
| uint64_t chunk_size = read_pipeline.sinfo.get_chunk_size(); | ||
| uint64_t trim_offset = 0; | ||
| for (auto shard: wanted_to_read) { | ||
| if (shard * chunk_size < read_offset_in_stripe) { |
There was a problem hiding this comment.
Should this be <=? Otherwise, if read_offset_in_stripe is at the shard boundary, it seems like the preceding shard would not be trimmed.
There was a problem hiding this comment.
The code works and < is correct, but only because read.offset is always aligned to a shard boundary and hence read_offset_in_stripe is also aligned. This means we are comparing the start offset of each chunk that has been read to the start offset of the chunk we want to keep - if it is < then we need to trim it, if it is = we want to keep it.
The alignment to shard boundaries - either to chunk boundaries for partial reads or to stripe boundaries otherwise occurs in ECBackend::objects_read_async (which also has the completion callback that trims the buffer back to the required size). Overwrites and reconstructs also issue reads, but these are always stripe aligned.
But the bounds and aligned variables above implies that this code is intended to cope with misaligned ranges, even if this currently never happens. If read_offset_in_stripe was not aligned then this code would incorrectly trim a shard that contains data we want. So I think that read_offset_in_stripe should be renamed as aligned_offset_in_stripe and be calculated from aligned.first rather than read.offset
There was a problem hiding this comment.
If
read_offset_in_stripewas not aligned then this code would incorrectly trim a shard that contains data we want.
Heh, I actually wrote (almost) that but then didn't like the wording, incorrectly inverted it and ended up saying that the "at the shard boundary" case is wrong instead of the "in the middle of the shard" case. Mea culpa.
src/osd/ECCommon.cc
Outdated
| } | ||
| } | ||
| auto off = read.offset + trim_offset - aligned.first; | ||
| ceph_assert(read.size <= bl.length() - off); |
There was a problem hiding this comment.
I'd suggest moving this assert after the dout so that off and other values are logged before the potential crash.
src/osd/ECCommon.cc
Outdated
| } | ||
| auto off = read.offset + trim_offset - aligned.first; | ||
| ceph_assert(read.size <= bl.length() - off); | ||
| dout(20) << __func__ << "bl.length()=" << bl.length() |
There was a problem hiding this comment.
| dout(20) << __func__ << "bl.length()=" << bl.length() | |
| dout(20) << __func__ << " bl.length()=" << bl.length() |
src/osd/ECCommon.cc
Outdated
| read_pipeline.sinfo.logical_to_prev_stripe_offset(read.offset); | ||
| uint64_t chunk_size = read_pipeline.sinfo.get_chunk_size(); | ||
| uint64_t trim_offset = 0; | ||
| for (auto shard: wanted_to_read) { |
There was a problem hiding this comment.
Range-based for loops are usually formatted with spaces on both sides of the colon:
| for (auto shard: wanted_to_read) { | |
| for (auto shard : wanted_to_read) { |
92bb888 to
b61da58
Compare
|
Updated version addressing @idryomov review comments |
|
jenkins test make check |
1 similar comment
|
jenkins test make check |
|
jenkins test api |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
@bill-scales #58749 merged a while ago, could you please rebase this? |
b61da58 to
9688dc2
Compare
|
@idryomov - rebased on main. Passes all the tests in ceph_test_rados_io_sequence which includes tests that submit ops with multiple partial reads. |
|
@JonBailey1993 has a new version of ceph_test_rados_io_sequence that now includes tests for different EC plugins, and in particular for LRC which can shuffle the order of the shards using a chunk_mapping lookup. This has found a bug where this fix is not taking into account the chunk_mapping. I've got a fix that I'll tidy up and push tomorrow. |
* More fixes for multi-region reads * Enable partial stripe reads that span stripes Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Tidy up EC code. Extend stripe_info_t to store K, M, a complete mapping for shard->raw_shard and the reverse mapping raw_shard->shard. Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
LRC plugin uses a mapping from raw_shard to shard, all other EC plugins have an identity mapping. Partial reads need to convert shard back to a raw_shard when calculating how to trim the buffers that were read. Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
9688dc2 to
701d8d0
Compare
|
@rzarzynski Bill refreshed the PR, who would be the designated reviewer from @ceph/core here? I'm asking because pulling this into my integration branches to avoid spurious failures in the RBD suite is getting old ;) |
|
@bill-scales, @rzarzynski mentioned you may have another fix for this PR. Is there more work you want to do here before we merge? |
@markhpc , @rzarzynski I'd like to get this reviewed and merged, especially as what is currently in main has a bug that Iyia keeps tripping over with encrypted rbd images. Our ec optimizations work re-writes some of this code and has some further optimizations for reads, but this isn't ready to merge yet |
|
I'll look at this tomorrow. |
athanatos
left a comment
There was a problem hiding this comment.
LGTM, seems good to merge once tested (probably already done?). @rzarzynski do you want to weigh in first?
|
Looks like @idryomov already tested this? |
I (continue to) include this PR in my RBD runs and I haven't seen crashes raised in #58749 (comment) since. So definitely +1 from me, but I'm not sure if that counts as "tested" for the purposes of RADOS. |
|
Adding the needs-qa flag here. We've got an approval from Sam and a partial approval from Ilya, that's good enough for me. Feel free to remove if anyone disagrees. :) |
osd: further EC partial stripe read fixes
ECUtil::stripe_info_t::chunk_aligned_offset_len_to_chunk is fixed to return the correct span of a shard that needs to be read to obtain all the data for a specified range of the object. Previously this did not read enough data if the range was multiple stripes and was not stripe aligned.
finish_single_request is fixed to trim the data that is read. For partial reads of more than one stripe the shards that are read (wanted_to_read) is a union of the requirements for each range. This means there can be excess data that needs to be trimmed.
Example 1: chunksize = 4K, 2+2 profile. read offset 4K for length 8K
Example 2: chunksize = 4K, 5+2 profile. read offset 8K for 4K, offset 20K for 4K, offset 56K for 4K
** offset 0, length 4K (first stripe) from each shard for 1st read
** offset 4K, length 4K (second stripe) from each shard for 2nd read
** offset 8K, length 4K (third stripe) from each shard for 3rd read
** stripe 0, shard 0,2 and 4
** stripe 1, shard 0,2 and 4
** stripe 2, shard 0,2 and 4
** offset=4K (stripe 0, shard 2), length 4K for 1st read
** offset=12K (stripe 1, shard 0), length 4K for 2nd read
** offset=32K (stripe 2, shard 4), length 4K for 3rd read
Unit tests have been updated. Core has no good test tool for validating multi-region reads, regressions were found by librbd LUKs tests. A new test tool designed to perform better validation of read and write I/O to erasure coded pools will be delivered by another PR, this change has been tested and debugged with an early version of that tool.
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e