Skip to content

osd: EC Partial Stripe Reads (Retry of #23138 and #52746)#55196

Merged
rzarzynski merged 23 commits intoceph:mainfrom
rzarzynski:wip-osd-ec-partial-reads
Jul 10, 2024
Merged

osd: EC Partial Stripe Reads (Retry of #23138 and #52746)#55196
rzarzynski merged 23 commits intoceph:mainfrom
rzarzynski:wip-osd-ec-partial-reads

Conversation

@rzarzynski
Copy link
Contributor

This PR is a resurrection of #52746 rebased on top of #54930.

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@rzarzynski
Copy link
Contributor Author

unittest_ecbackend explodes on g_conf() access in:

std::pair<uint64_t, uint64_t> ECUtil::stripe_info_t::aligned_offset_len_to_chunk(
  std::pair<uint64_t, uint64_t> in) const {
  // we need to align to stripe again to deal with partial chunk read.
  auto aligned =
    g_conf()->osd_ec_partial_reads ? offset_len_to_stripe_bounds(in) : in;
  return std::make_pair(
    aligned_logical_offset_to_chunk_offset(aligned.first),
    aligned_logical_offset_to_chunk_offset(aligned.second));
}

The global context is not initialized? Will check!

@rzarzynski
Copy link
Contributor Author

The patch...

$ git diff --cached
diff --git a/src/test/osd/CMakeLists.txt b/src/test/osd/CMakeLists.txt
index c9216c42d5c..c7492e238e5 100644
--- a/src/test/osd/CMakeLists.txt
+++ b/src/test/osd/CMakeLists.txt
@@ -55,6 +55,7 @@ target_link_libraries(unittest_osd_types global)
 # unittest_ecbackend
 add_executable(unittest_ecbackend
   TestECBackend.cc
+  $<TARGET_OBJECTS:unit-main>
   )
 add_ceph_unittest(unittest_ecbackend)
 target_link_libraries(unittest_ecbackend osd global)

... tackles the unittest's segfault, at least locally. Repushing.

@rzarzynski rzarzynski force-pushed the wip-osd-ec-partial-reads branch from cfb68a7 to 6d93c15 Compare January 17, 2024 17:45
@rzarzynski
Copy link
Contributor Author

jenkins retest this please

@markhpc markhpc self-requested a review February 8, 2024 15:48
@markhpc
Copy link
Member

markhpc commented Mar 7, 2024

@rzarzynski Are you still working on this branch? Do you have another one for some of the new work?

@rzarzynski
Copy link
Contributor Author

@markhpc: I'm working on this branch right now. I'm poking with the reads' wings.

@rzarzynski rzarzynski force-pushed the wip-osd-ec-partial-reads branch from 6d93c15 to af1e37e Compare March 8, 2024 22:17
@rzarzynski
Copy link
Contributor Author

Pushed an early version of the generalized partial reads. WIP.

@rzarzynski rzarzynski force-pushed the wip-osd-ec-partial-reads branch 2 times, most recently from 3679473 to 6b159a1 Compare March 13, 2024 20:07
@rzarzynski rzarzynski requested a review from athanatos March 13, 2024 21:11
@rzarzynski rzarzynski force-pushed the wip-osd-ec-partial-reads branch from 6b159a1 to aa84e32 Compare March 14, 2024 10:41
@rzarzynski
Copy link
Contributor Author

@athanatos: would you mind another round?

Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm convinced that this is broadly correct and getting close. There are still some specific bits I'm a bit less sure of though.

ceph_assert(r == 0);
ceph_assert(bl.length() == sinfo.get_stripe_width());
ceph_assert(bl.length() % sinfo.get_chunk_size() == 0);
if (!g_conf()->osd_ec_partial_reads) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given your other changes, let's remove this prior to merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped!

for(uint64_t i = left_shard_index; i < right_shard_index; i++) {
auto chunk = chunk_mapping.size() > i ? chunk_mapping[i] : i;
want_to_read->insert(static_cast<int>(chunk));
}
Copy link
Contributor

@athanatos athanatos Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose we're dealing with a 4+2 code with a chunk size of 1k and therefore a stripe size of 4k.

If the caller passes in offset=1k and length=2k, it seems like this will populate want_to_read correctly with {1, 2}.
However, if the caller passes in offset=1k and length=12k, it seeks like we'd get {1, 2, 3} where we should actually have gotten {0, 1, 2, 3}.

I may be missing something though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right. This was fine while the sub-stripe restriction was in place but requires generalization as well. Pushed a bunch fixups for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the same 4+2 1k/4k chunk/stripe size code as before, consider a stripe aligned read of 0k~8k.

left_chunk_index = 0
right_chunk_index = 8 (from sinfo.offset_length_to_data_chunk_indices, (1024 - 1 + 0 + 8192) / 1024)

The for loop will therefore try to add 0, 1, 2, 3, 4, 5, 6, 7 to want_to_read. It seems like we're missing a % sinfo.get_data_chunk_count().

Perhaps I'm missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index into chunk_mapping is similarly suspect.

Copy link
Contributor

@athanatos athanatos Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change has proved subtle enough that it's worth a unit test -- would mean converting to a free function, but that doesn't seem too hard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

src/osd/ECUtil.h Outdated
uint64_t off, uint64_t len) const {
assert(chunk_size > 0);
const auto first_chunk_idx = (off / chunk_size);
const auto last_chunk_idx = (chunk_size - 1 + len) / chunk_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like off + len should probably show up in here somewhere.

const auto last_chunk_idx = (chunk_size - 1 + len + off) / chunk_size;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@rzarzynski
Copy link
Contributor Author

@athanatos: pushed a bunch of fixups. I believe it's worth final round.

@athanatos
Copy link
Contributor

The two logic bugs are troubling for a PR you intend to quickly backport to squid. How did they make it through testing?

@markhpc
Copy link
Member

markhpc commented Mar 21, 2024

Speaking of testing, I'm happy to jump in and repeat the same tests that were performed for #52746 once folks are ready. I think we're going to need to get as much testing on this as possible if we are really intent on getting this into squid at this late hour.

@athanatos
Copy link
Contributor

@markhpc The two bugs in question seem as though they'd have been triggered by any read with a non-stripe aligned offset and a length longer than a stripe. It seems like this should have been covered by our existing automated testing, so at the moment my main interest is why our automated testing didn't catch it.

@ljflores
Copy link
Member

@rzarzynski I'm starting a build for it here and will schedule a rados suite against it when it's ready: https://shaman.ceph.com/builds/ceph/wip-lflores-testing-2-2024-06-10-1925/37bbe1d8d788b7c9a71280e7d510c295a5853f0b/

I think we can just track that directly on the PR since it's in "pre-approval" state. Then when you're ready to have it go into one of @yuriw's batches, we can push it into the "needs-qa" pipeline, etc.

@rzarzynski here are the scheduled tests:
https://pulpito.ceph.com/lflores-2024-06-10_22:21:08-rados-wip-lflores-testing-2-2024-06-10-1925-distro-default-smithi/

@ljflores
Copy link
Member

jenkins retest this please

@rzarzynski rzarzynski force-pushed the wip-osd-ec-partial-reads branch 2 times, most recently from 3dac973 to e8399c0 Compare June 18, 2024 17:11
@rzarzynski
Copy link
Contributor Author

jenkins retest this please

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
This is supposed to fix:

```
2024-05-15T01:19:55.945 INFO:tasks.workunit.client.0.smithi190.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-eio.sh:243: rados_get_data_bad_size:  rados_get td/test-erasure-
eio pool-jerasure obj-size-81362-1-10 fail
2024-05-15T01:19:55.946 INFO:tasks.workunit.client.0.smithi190.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-eio.sh:104: rados_get:  local dir=td/test-erasure-eio
2024-05-15T01:19:55.946 INFO:tasks.workunit.client.0.smithi190.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-eio.sh:105: rados_get:  local poolname=pool-jerasure
2024-05-15T01:19:55.946 INFO:tasks.workunit.client.0.smithi190.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-eio.sh:106: rados_get:  local objname=obj-size-81362-1-10
2024-05-15T01:19:55.946 INFO:tasks.workunit.client.0.smithi190.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-eio.sh:107: rados_get:  local expect=fail
2024-05-15T01:19:55.946 INFO:tasks.workunit.client.0.smithi190.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-eio.sh:112: rados_get:  '[' fail = fail ']'
2024-05-15T01:19:55.946 INFO:tasks.workunit.client.0.smithi190.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-eio.sh:114: rados_get:  rados --pool pool-jerasure get obj-size-
81362-1-10 td/test-erasure-eio/COPY
2024-05-15T01:19:56.175 INFO:tasks.workunit.client.0.smithi190.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-eio.sh:115: rados_get:  return
2024-05-15T01:19:56.175 INFO:tasks.workunit.client.0.smithi190.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-eio.sh:243: rados_get_data_bad_size:  return 1
2024-05-15T01:19:56.175 INFO:tasks.workunit.client.0.smithi190.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-eio.sh:323: TEST_rados_get_bad_size_shard_1:  return 1
2024-05-15T01:19:56.175 INFO:tasks.workunit.client.0.smithi190.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-eio.sh:41: run:  return 1
```

(https://pulpito.ceph.com/rzarzynski-2024-05-14_22:09:16-rados-wip-osd-ec-partial-reads-distro-default-smithi/7706517/)

The failed scenario was exercising a behavior that got truly
changed by introduction of partial reads. Before, regardless
of read size, OSD was always reading and checking for errors
entire stripe.

In this test first 4 KB has been read from an EC pool with
m=2 k=1 while errors had been injected to shards 1 and 2.
Handling the first 4 KB doesn't really require the damaged
shards but, because of the full-stripe alignment, EIO was
returned. This is not anymore.

Signed-off-by: Radosław Zarzyński <rzarzyns@redhat.com>
Fixes: https://tracker.ceph.com/issues/66321
Signed-off-by: Radosław Zarzyński <rzarzyns@redhat.com>
(cherry picked from commit 5c821233d097eb6eb4287bbd1d0b6d01638e5f90)
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radosław Zarzyński <rzarzyns@redhat.com>
We want to have better coverage espcially in the matter of:
  * subchunking,
  * want_to_read handling,
  * result bl trimming.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
@rzarzynski rzarzynski force-pushed the wip-osd-ec-partial-reads branch from e8399c0 to a1084f5 Compare June 20, 2024 20:38
@rzarzynski
Copy link
Contributor Author

@athanatos: would you mind another round?

After a few fixes since the initial teuthology run neither @ljflores nor I see anything obviously related in the rerun.

@rzarzynski
Copy link
Contributor Author

The make check bot failed due to a known issue:

/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/flap.sh:21: run:  return 1

Retriggering.

@rzarzynski
Copy link
Contributor Author

jenkins test make check

@rzarzynski
Copy link
Contributor Author

[PostBuildScript] - [INFO] Build does not have any of the results [ABORTED]. Did not execute build step #0.

@rzarzynski
Copy link
Contributor Author

jenkins test make check

Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like my testing concerns above have been addressed.

@markhpc
Copy link
Member

markhpc commented Jun 27, 2024

@rzarzynski Ready for performance testing now?

@rzarzynski
Copy link
Contributor Author

@markhpc: yes!

@ljflores
Copy link
Member

ljflores commented Jul 9, 2024

@rzarzynski rzarzynski merged commit f25386f into ceph:main Jul 10, 2024
@markhpc
Copy link
Member

markhpc commented Jul 11, 2024

Why did this merge before we did performance testing?

@rzarzynski
Copy link
Contributor Author

rzarzynski commented Jul 12, 2024

I don't perceive performance as the main, sole driving force behind having this in main. The changeset has 3 effects:

  1. It's a performance optimization.
  2. The entire EC read infrastructure deals with the chunk alignment instead of stripe alignment. It's hard to imagine upcoming EC partial writes without this.
  3. The testing coverage: reads are now randomized instead of always starting at offset 0. It's hard to imagine any change from the upcoming series [1] without it.

I want these commits in main even if they wouldn't have provided a performance gain which isn't the case, at least on developer's desk. squid is a different story as 2nd and 3rd points are irrelevant leaving the optimization as the sole driver. I would love to see numbers from your testing there.

[1]: https://github.com/bill-scales/CephErasureCodingDesign and the recent thread on ceph-dev

NitzanMordhai pushed a commit to NitzanMordhai/ceph that referenced this pull request Aug 1, 2024
osd: EC Partial Stripe Reads (Retry of ceph#23138 and ceph#52746)

Reviewed-by: Samuel Just <sjust@redhat.com>
@erichorwitz
Copy link

@rzarzynski Any idea if/when this will make it into Squid? or will this be in next release? Very interested in getting this working and testing with our workflow which would benefit greatly by it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants