Further EC partial stripe read fixes by bill-scales · Pull Request #59480 · ceph/ceph

bill-scales · 2024-08-28T12:44:47Z

osd: further EC partial stripe read fixes

More fixes for multi-region reads
Enable partial stripe reads that span stripes

ECUtil::stripe_info_t::chunk_aligned_offset_len_to_chunk is fixed to return the correct span of a shard that needs to be read to obtain all the data for a specified range of the object. Previously this did not read enough data if the range was multiple stripes and was not stripe aligned.

finish_single_request is fixed to trim the data that is read. For partial reads of more than one stripe the shards that are read (wanted_to_read) is a union of the requirements for each range. This means there can be excess data that needs to be trimmed.

Example 1: chunksize = 4K, 2+2 profile. read offset 4K for length 8K

chunk_aligned_offset_len_to_chunk calculates that we need to read offset 0, length 8K (first two stripes) from each shard
shard 0 and shard 1 need to be read so we end up with 2*8K = 16K of data
finish_single_request needs to trim the data read to remove the first 4K and last 4K

Example 2: chunksize = 4K, 5+2 profile. read offset 8K for 4K, offset 20K for 4K, offset 56K for 4K

chunk_aligned_offset_len_to_chunk calculates that we need to read:
** offset 0, length 4K (first stripe) from each shard for 1st read
** offset 4K, length 4K (second stripe) from each shard for 2nd read
** offset 8K, length 4K (third stripe) from each shard for 3rd read
shards 0, 2 and 4 need to be read so we end up with 3*12K = 36K of data
** stripe 0, shard 0,2 and 4
** stripe 1, shard 0,2 and 4
** stripe 2, shard 0,2 and 4
finish_single_request needs to trim the data read:
** offset=4K (stripe 0, shard 2), length 4K for 1st read
** offset=12K (stripe 1, shard 0), length 4K for 2nd read
** offset=32K (stripe 2, shard 4), length 4K for 3rd read

Unit tests have been updated. Core has no good test tool for validating multi-region reads, regressions were found by librbd LUKs tests. A new test tool designed to perform better validation of read and write I/O to erasure coded pools will be delivered by another PR, this change has been tested and debugged with an early version of that tool.

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

idryomov · 2024-08-28T19:36:33Z

jenkins test make check

idryomov · 2024-08-28T19:36:50Z

jenkins test api

idryomov

It sounds like we are no longer planning to disable the optimization for multi-region reads? If so, I'd suggest merging #58749 and rebasing this on top -- I stumbled for a bit on mismatching diffstat numbers caused by Radek's commit being included here as I was pulling the new fix into my testing branch.

idryomov

I'll run more tests, but based on

https://pulpito.ceph.com/dis-2024-08-28_23:16:38-rbd-wip-dis-testing-2-distro-default-smithi/
https://pulpito.ceph.com/dis-2024-08-28_23:17:42-rbd-wip-dis-testing-2-distro-default-smithi/

the issue that I brought up in #58749 (comment) is fixed!

idryomov · 2024-08-29T05:59:42Z

src/osd/ECBackend.cc

 	  ceph_assert(
 	    (offset + length) <=
-	    (range.first.get_off() + range.first.get_len()));
+	    (range_offset + range_length));


This assert would fit on a single line now (and parens on both sides of <= seem redundant).

idryomov · 2024-08-29T06:22:08Z

src/osd/ECCommon.cc

+      uint64_t chunk_size = read_pipeline.sinfo.get_chunk_size();
+      uint64_t trim_offset = 0;
+      for (auto shard: wanted_to_read) {
+	if (shard * chunk_size < read_offset_in_stripe) {


Should this be <=? Otherwise, if read_offset_in_stripe is at the shard boundary, it seems like the preceding shard would not be trimmed.

The code works and < is correct, but only because read.offset is always aligned to a shard boundary and hence read_offset_in_stripe is also aligned. This means we are comparing the start offset of each chunk that has been read to the start offset of the chunk we want to keep - if it is < then we need to trim it, if it is = we want to keep it.

The alignment to shard boundaries - either to chunk boundaries for partial reads or to stripe boundaries otherwise occurs in ECBackend::objects_read_async (which also has the completion callback that trims the buffer back to the required size). Overwrites and reconstructs also issue reads, but these are always stripe aligned.

But the bounds and aligned variables above implies that this code is intended to cope with misaligned ranges, even if this currently never happens. If read_offset_in_stripe was not aligned then this code would incorrectly trim a shard that contains data we want. So I think that read_offset_in_stripe should be renamed as aligned_offset_in_stripe and be calculated from aligned.first rather than read.offset

If read_offset_in_stripe was not aligned then this code would incorrectly trim a shard that contains data we want.

Heh, I actually wrote (almost) that but then didn't like the wording, incorrectly inverted it and ended up saying that the "at the shard boundary" case is wrong instead of the "in the middle of the shard" case. Mea culpa.

idryomov · 2024-08-29T06:23:01Z

src/osd/ECCommon.cc

+	}
+      }
+      auto off = read.offset + trim_offset - aligned.first;
+      ceph_assert(read.size <= bl.length() - off);


I'd suggest moving this assert after the dout so that off and other values are logged before the potential crash.

idryomov · 2024-08-29T06:24:37Z

src/osd/ECCommon.cc

+      }
+      auto off = read.offset + trim_offset - aligned.first;
+      ceph_assert(read.size <= bl.length() - off);
+      dout(20) << __func__ << "bl.length()=" << bl.length()


Suggested change

dout(20) << __func__ << "bl.length()=" << bl.length()

dout(20) << __func__ << " bl.length()=" << bl.length()

idryomov · 2024-08-29T06:26:39Z

src/osd/ECCommon.cc

+	read_pipeline.sinfo.logical_to_prev_stripe_offset(read.offset);
+      uint64_t chunk_size = read_pipeline.sinfo.get_chunk_size();
+      uint64_t trim_offset = 0;
+      for (auto shard: wanted_to_read) {


Range-based for loops are usually formatted with spaces on both sides of the colon:

Suggested change

for (auto shard: wanted_to_read) {

for (auto shard : wanted_to_read) {

bill-scales · 2024-09-02T10:11:44Z

Updated version addressing @idryomov review comments

bill-scales · 2024-09-02T16:08:34Z

jenkins test make check

bill-scales · 2024-09-03T11:58:08Z

jenkins test make check

bill-scales · 2024-09-03T15:39:00Z

jenkins test api

github-actions · 2024-11-02T16:01:44Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

idryomov · 2024-11-25T16:00:16Z

@bill-scales #58749 merged a while ago, could you please rebase this?

bill-scales · 2024-11-27T15:35:35Z

@idryomov - rebased on main. Passes all the tests in ceph_test_rados_io_sequence which includes tests that submit ops with multiple partial reads.

bill-scales · 2024-11-28T18:28:28Z

@JonBailey1993 has a new version of ceph_test_rados_io_sequence that now includes tests for different EC plugins, and in particular for LRC which can shuffle the order of the shards using a chunk_mapping lookup. This has found a bug where this fix is not taking into account the chunk_mapping.

I've got a fix that I'll tidy up and push tomorrow.

* More fixes for multi-region reads * Enable partial stripe reads that span stripes Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>

Tidy up EC code. Extend stripe_info_t to store K, M, a complete mapping for shard->raw_shard and the reverse mapping raw_shard->shard. Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>

LRC plugin uses a mapping from raw_shard to shard, all other EC plugins have an identity mapping. Partial reads need to convert shard back to a raw_shard when calculating how to trim the buffers that were read. Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>

idryomov · 2024-12-13T14:43:44Z

@rzarzynski Bill refreshed the PR, who would be the designated reviewer from @ceph/core here? I'm asking because pulling this into my integration branches to avoid spurious failures in the RBD suite is getting old ;)

markhpc · 2024-12-19T17:08:46Z

@bill-scales, @rzarzynski mentioned you may have another fix for this PR. Is there more work you want to do here before we merge?

bill-scales · 2024-12-19T18:33:10Z

@bill-scales, @rzarzynski mentioned you may have another fix for this PR. Is there more work you want to do here before we merge?

@markhpc , @rzarzynski I'd like to get this reviewed and merged, especially as what is currently in main has a bug that Iyia keeps tripping over with encrypted rbd images. Our ec optimizations work re-writes some of this code and has some further optimizations for reads, but this isn't ready to merge yet

athanatos · 2025-01-02T01:18:25Z

I'll look at this tomorrow.

athanatos

LGTM, seems good to merge once tested (probably already done?). @rzarzynski do you want to weigh in first?

athanatos · 2025-01-02T18:18:56Z

Looks like @idryomov already tested this?

idryomov · 2025-01-09T09:33:10Z

Looks like @idryomov already tested this?

I (continue to) include this PR in my RBD runs and I haven't seen crashes raised in #58749 (comment) since. So definitely +1 from me, but I'm not sure if that counts as "tested" for the purposes of RADOS.

markhpc · 2025-01-09T14:58:43Z

Adding the needs-qa flag here. We've got an approval from Sam and a partial approval from Ilya, that's good enough for me. Feel free to remove if anyone disagrees. :)

bill-scales requested a review from a team as a code owner August 28, 2024 12:44

github-actions bot added core tests labels Aug 28, 2024

bill-scales requested review from idryomov and rzarzynski August 28, 2024 12:47

idryomov reviewed Aug 28, 2024

View reviewed changes

idryomov reviewed Aug 29, 2024

View reviewed changes

idryomov added the bug-fix label Aug 29, 2024

markhpc added the performance label Aug 29, 2024

bill-scales force-pushed the ec_partial_read branch from 92bb888 to b61da58 Compare September 2, 2024 10:09

github-actions bot added the stale label Nov 2, 2024

idryomov removed the stale label Nov 25, 2024

bill-scales force-pushed the ec_partial_read branch from b61da58 to 9688dc2 Compare November 27, 2024 15:28

bill-scales added 3 commits November 29, 2024 10:08

osd: further EC partial stripe read fixes

ccf1229

* More fixes for multi-region reads * Enable partial stripe reads that span stripes Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>

osd: EC: helper functions for chunk mapping

32f22f4

Tidy up EC code. Extend stripe_info_t to store K, M, a complete mapping for shard->raw_shard and the reverse mapping raw_shard->shard. Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>

bill-scales force-pushed the ec_partial_read branch from 9688dc2 to 701d8d0 Compare November 29, 2024 21:04

bill-scales requested a review from a team as a code owner November 29, 2024 21:04

github-actions bot added the crimson label Nov 29, 2024

athanatos approved these changes Jan 2, 2025

View reviewed changes

markhpc added the needs-qa label Jan 9, 2025

bill-scales mentioned this pull request Jan 20, 2025

OSD: Refactor EC to have two versions: Legacy and New EC #61263

Merged

14 tasks

ljflores added the wip-yuri-testing label Jan 23, 2025

yuriw removed the wip-yuri-testing label Jan 26, 2025

bill-scales merged commit 69a936d into ceph:main Jan 30, 2025

	dout(20) << __func__ << "bl.length()=" << bl.length()
	dout(20) << __func__ << " bl.length()=" << bl.length()

	for (auto shard: wanted_to_read) {
	for (auto shard : wanted_to_read) {

Conversation

bill-scales commented Aug 28, 2024

Contribution Guidelines

Checklist

Uh oh!

idryomov commented Aug 28, 2024

Uh oh!

idryomov commented Aug 28, 2024

Uh oh!

idryomov left a comment

Choose a reason for hiding this comment

Uh oh!

idryomov left a comment

Choose a reason for hiding this comment

Uh oh!

idryomov Aug 29, 2024

Choose a reason for hiding this comment

Uh oh!

idryomov Aug 29, 2024

Choose a reason for hiding this comment

Uh oh!

bill-scales Aug 29, 2024

Choose a reason for hiding this comment

Uh oh!

idryomov Aug 29, 2024

Choose a reason for hiding this comment

Uh oh!

idryomov Aug 29, 2024

Choose a reason for hiding this comment

Uh oh!

idryomov Aug 29, 2024

Choose a reason for hiding this comment

Uh oh!

idryomov Aug 29, 2024

Choose a reason for hiding this comment

Uh oh!

bill-scales commented Sep 2, 2024

Uh oh!

bill-scales commented Sep 2, 2024

Uh oh!

bill-scales commented Sep 3, 2024

Uh oh!

bill-scales commented Sep 3, 2024

Uh oh!

github-actions bot commented Nov 2, 2024

Uh oh!

idryomov commented Nov 25, 2024

Uh oh!

bill-scales commented Nov 27, 2024

Uh oh!

bill-scales commented Nov 28, 2024

Uh oh!

idryomov commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markhpc commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bill-scales commented Dec 19, 2024

Uh oh!

athanatos commented Jan 2, 2025

Uh oh!

athanatos left a comment

Choose a reason for hiding this comment

Uh oh!

athanatos commented Jan 2, 2025

Uh oh!

idryomov commented Jan 9, 2025

Uh oh!

markhpc commented Jan 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

idryomov commented Dec 13, 2024 •

edited

Loading

markhpc commented Dec 19, 2024 •

edited

Loading