Skip to content

Further EC partial stripe read fixes#59480

Merged
bill-scales merged 3 commits intoceph:mainfrom
bill-scales:ec_partial_read
Jan 30, 2025
Merged

Further EC partial stripe read fixes#59480
bill-scales merged 3 commits intoceph:mainfrom
bill-scales:ec_partial_read

Conversation

@bill-scales
Copy link
Contributor

osd: further EC partial stripe read fixes

  • More fixes for multi-region reads
  • Enable partial stripe reads that span stripes

ECUtil::stripe_info_t::chunk_aligned_offset_len_to_chunk is fixed to return the correct span of a shard that needs to be read to obtain all the data for a specified range of the object. Previously this did not read enough data if the range was multiple stripes and was not stripe aligned.

finish_single_request is fixed to trim the data that is read. For partial reads of more than one stripe the shards that are read (wanted_to_read) is a union of the requirements for each range. This means there can be excess data that needs to be trimmed.

Example 1: chunksize = 4K, 2+2 profile. read offset 4K for length 8K

  • chunk_aligned_offset_len_to_chunk calculates that we need to read offset 0, length 8K (first two stripes) from each shard
  • shard 0 and shard 1 need to be read so we end up with 2*8K = 16K of data
  • finish_single_request needs to trim the data read to remove the first 4K and last 4K

Example 2: chunksize = 4K, 5+2 profile. read offset 8K for 4K, offset 20K for 4K, offset 56K for 4K

  • chunk_aligned_offset_len_to_chunk calculates that we need to read:
    ** offset 0, length 4K (first stripe) from each shard for 1st read
    ** offset 4K, length 4K (second stripe) from each shard for 2nd read
    ** offset 8K, length 4K (third stripe) from each shard for 3rd read
  • shards 0, 2 and 4 need to be read so we end up with 3*12K = 36K of data
    ** stripe 0, shard 0,2 and 4
    ** stripe 1, shard 0,2 and 4
    ** stripe 2, shard 0,2 and 4
  • finish_single_request needs to trim the data read:
    ** offset=4K (stripe 0, shard 2), length 4K for 1st read
    ** offset=12K (stripe 1, shard 0), length 4K for 2nd read
    ** offset=32K (stripe 2, shard 4), length 4K for 3rd read

Unit tests have been updated. Core has no good test tool for validating multi-region reads, regressions were found by librbd LUKs tests. A new test tool designed to perform better validation of read and write I/O to erasure coded pools will be delivered by another PR, this change has been tested and debugged with an early version of that tool.

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@idryomov
Copy link
Contributor

jenkins test make check

@idryomov
Copy link
Contributor

jenkins test api

Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like we are no longer planning to disable the optimization for multi-region reads? If so, I'd suggest merging #58749 and rebasing this on top -- I stumbled for a bit on mismatching diffstat numbers caused by Radek's commit being included here as I was pulling the new fix into my testing branch.

Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ceph_assert(
(offset + length) <=
(range.first.get_off() + range.first.get_len()));
(range_offset + range_length));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assert would fit on a single line now (and parens on both sides of <= seem redundant).

uint64_t chunk_size = read_pipeline.sinfo.get_chunk_size();
uint64_t trim_offset = 0;
for (auto shard: wanted_to_read) {
if (shard * chunk_size < read_offset_in_stripe) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be <=? Otherwise, if read_offset_in_stripe is at the shard boundary, it seems like the preceding shard would not be trimmed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code works and < is correct, but only because read.offset is always aligned to a shard boundary and hence read_offset_in_stripe is also aligned. This means we are comparing the start offset of each chunk that has been read to the start offset of the chunk we want to keep - if it is < then we need to trim it, if it is = we want to keep it.

The alignment to shard boundaries - either to chunk boundaries for partial reads or to stripe boundaries otherwise occurs in ECBackend::objects_read_async (which also has the completion callback that trims the buffer back to the required size). Overwrites and reconstructs also issue reads, but these are always stripe aligned.

But the bounds and aligned variables above implies that this code is intended to cope with misaligned ranges, even if this currently never happens. If read_offset_in_stripe was not aligned then this code would incorrectly trim a shard that contains data we want. So I think that read_offset_in_stripe should be renamed as aligned_offset_in_stripe and be calculated from aligned.first rather than read.offset

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If read_offset_in_stripe was not aligned then this code would incorrectly trim a shard that contains data we want.

Heh, I actually wrote (almost) that but then didn't like the wording, incorrectly inverted it and ended up saying that the "at the shard boundary" case is wrong instead of the "in the middle of the shard" case. Mea culpa.

}
}
auto off = read.offset + trim_offset - aligned.first;
ceph_assert(read.size <= bl.length() - off);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest moving this assert after the dout so that off and other values are logged before the potential crash.

}
auto off = read.offset + trim_offset - aligned.first;
ceph_assert(read.size <= bl.length() - off);
dout(20) << __func__ << "bl.length()=" << bl.length()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dout(20) << __func__ << "bl.length()=" << bl.length()
dout(20) << __func__ << " bl.length()=" << bl.length()

read_pipeline.sinfo.logical_to_prev_stripe_offset(read.offset);
uint64_t chunk_size = read_pipeline.sinfo.get_chunk_size();
uint64_t trim_offset = 0;
for (auto shard: wanted_to_read) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Range-based for loops are usually formatted with spaces on both sides of the colon:

Suggested change
for (auto shard: wanted_to_read) {
for (auto shard : wanted_to_read) {

@bill-scales
Copy link
Contributor Author

Updated version addressing @idryomov review comments

@bill-scales
Copy link
Contributor Author

jenkins test make check

1 similar comment
@bill-scales
Copy link
Contributor Author

jenkins test make check

@bill-scales
Copy link
Contributor Author

jenkins test api

@github-actions
Copy link

github-actions bot commented Nov 2, 2024

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Nov 2, 2024
@idryomov
Copy link
Contributor

@bill-scales #58749 merged a while ago, could you please rebase this?

@bill-scales
Copy link
Contributor Author

@idryomov - rebased on main. Passes all the tests in ceph_test_rados_io_sequence which includes tests that submit ops with multiple partial reads.

@bill-scales
Copy link
Contributor Author

@JonBailey1993 has a new version of ceph_test_rados_io_sequence that now includes tests for different EC plugins, and in particular for LRC which can shuffle the order of the shards using a chunk_mapping lookup. This has found a bug where this fix is not taking into account the chunk_mapping.

I've got a fix that I'll tidy up and push tomorrow.

* More fixes for multi-region reads
* Enable partial stripe reads that span stripes

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Tidy up EC code. Extend stripe_info_t to store K, M,
a complete mapping for shard->raw_shard and the
reverse mapping raw_shard->shard.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
LRC plugin uses a mapping from raw_shard to shard, all other
EC plugins have an identity mapping. Partial reads need to
convert shard back to a raw_shard when calculating how to
trim the buffers that were read.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
@bill-scales bill-scales requested a review from a team as a code owner November 29, 2024 21:04
@idryomov
Copy link
Contributor

idryomov commented Dec 13, 2024

@rzarzynski Bill refreshed the PR, who would be the designated reviewer from @ceph/core here? I'm asking because pulling this into my integration branches to avoid spurious failures in the RBD suite is getting old ;)

@markhpc
Copy link
Member

markhpc commented Dec 19, 2024

@bill-scales, @rzarzynski mentioned you may have another fix for this PR. Is there more work you want to do here before we merge?

@bill-scales
Copy link
Contributor Author

@bill-scales, @rzarzynski mentioned you may have another fix for this PR. Is there more work you want to do here before we merge?

@markhpc , @rzarzynski I'd like to get this reviewed and merged, especially as what is currently in main has a bug that Iyia keeps tripping over with encrypted rbd images. Our ec optimizations work re-writes some of this code and has some further optimizations for reads, but this isn't ready to merge yet

@athanatos
Copy link
Contributor

I'll look at this tomorrow.

Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, seems good to merge once tested (probably already done?). @rzarzynski do you want to weigh in first?

@athanatos
Copy link
Contributor

Looks like @idryomov already tested this?

@idryomov
Copy link
Contributor

idryomov commented Jan 9, 2025

Looks like @idryomov already tested this?

I (continue to) include this PR in my RBD runs and I haven't seen crashes raised in #58749 (comment) since. So definitely +1 from me, but I'm not sure if that counts as "tested" for the purposes of RADOS.

@markhpc
Copy link
Member

markhpc commented Jan 9, 2025

Adding the needs-qa flag here. We've got an approval from Sam and a partial approval from Ilya, that's good enough for me. Feel free to remove if anyone disagrees. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants