osd: EC Partial Stripe Reads (Retry of #23138 and #52746)#55196
osd: EC Partial Stripe Reads (Retry of #23138 and #52746)#55196rzarzynski merged 23 commits intoceph:mainfrom
Conversation
|
std::pair<uint64_t, uint64_t> ECUtil::stripe_info_t::aligned_offset_len_to_chunk(
std::pair<uint64_t, uint64_t> in) const {
// we need to align to stripe again to deal with partial chunk read.
auto aligned =
g_conf()->osd_ec_partial_reads ? offset_len_to_stripe_bounds(in) : in;
return std::make_pair(
aligned_logical_offset_to_chunk_offset(aligned.first),
aligned_logical_offset_to_chunk_offset(aligned.second));
}The global context is not initialized? Will check! |
|
The patch... $ git diff --cached
diff --git a/src/test/osd/CMakeLists.txt b/src/test/osd/CMakeLists.txt
index c9216c42d5c..c7492e238e5 100644
--- a/src/test/osd/CMakeLists.txt
+++ b/src/test/osd/CMakeLists.txt
@@ -55,6 +55,7 @@ target_link_libraries(unittest_osd_types global)
# unittest_ecbackend
add_executable(unittest_ecbackend
TestECBackend.cc
+ $<TARGET_OBJECTS:unit-main>
)
add_ceph_unittest(unittest_ecbackend)
target_link_libraries(unittest_ecbackend osd global)... tackles the unittest's segfault, at least locally. Repushing. |
cfb68a7 to
6d93c15
Compare
|
jenkins retest this please |
|
@rzarzynski Are you still working on this branch? Do you have another one for some of the new work? |
|
@markhpc: I'm working on this branch right now. I'm poking with the reads' wings. |
6d93c15 to
af1e37e
Compare
|
Pushed an early version of the generalized partial reads. WIP. |
3679473 to
6b159a1
Compare
6b159a1 to
aa84e32
Compare
|
@athanatos: would you mind another round? |
athanatos
left a comment
There was a problem hiding this comment.
I think I'm convinced that this is broadly correct and getting close. There are still some specific bits I'm a bit less sure of though.
src/osd/ECUtil.cc
Outdated
| ceph_assert(r == 0); | ||
| ceph_assert(bl.length() == sinfo.get_stripe_width()); | ||
| ceph_assert(bl.length() % sinfo.get_chunk_size() == 0); | ||
| if (!g_conf()->osd_ec_partial_reads) { |
There was a problem hiding this comment.
Given your other changes, let's remove this prior to merge.
| for(uint64_t i = left_shard_index; i < right_shard_index; i++) { | ||
| auto chunk = chunk_mapping.size() > i ? chunk_mapping[i] : i; | ||
| want_to_read->insert(static_cast<int>(chunk)); | ||
| } |
There was a problem hiding this comment.
Suppose we're dealing with a 4+2 code with a chunk size of 1k and therefore a stripe size of 4k.
If the caller passes in offset=1k and length=2k, it seems like this will populate want_to_read correctly with {1, 2}.
However, if the caller passes in offset=1k and length=12k, it seeks like we'd get {1, 2, 3} where we should actually have gotten {0, 1, 2, 3}.
I may be missing something though.
There was a problem hiding this comment.
I think you're right. This was fine while the sub-stripe restriction was in place but requires generalization as well. Pushed a bunch fixups for that.
There was a problem hiding this comment.
With the same 4+2 1k/4k chunk/stripe size code as before, consider a stripe aligned read of 0k~8k.
left_chunk_index = 0
right_chunk_index = 8 (from sinfo.offset_length_to_data_chunk_indices, (1024 - 1 + 0 + 8192) / 1024)
The for loop will therefore try to add 0, 1, 2, 3, 4, 5, 6, 7 to want_to_read. It seems like we're missing a % sinfo.get_data_chunk_count().
Perhaps I'm missing something?
There was a problem hiding this comment.
The index into chunk_mapping is similarly suspect.
There was a problem hiding this comment.
I think this change has proved subtle enough that it's worth a unit test -- would mean converting to a free function, but that doesn't seem too hard.
src/osd/ECUtil.h
Outdated
| uint64_t off, uint64_t len) const { | ||
| assert(chunk_size > 0); | ||
| const auto first_chunk_idx = (off / chunk_size); | ||
| const auto last_chunk_idx = (chunk_size - 1 + len) / chunk_size; |
There was a problem hiding this comment.
Seems like off + len should probably show up in here somewhere.
const auto last_chunk_idx = (chunk_size - 1 + len + off) / chunk_size;
|
@athanatos: pushed a bunch of fixups. I believe it's worth final round. |
|
The two logic bugs are troubling for a PR you intend to quickly backport to squid. How did they make it through testing? |
|
Speaking of testing, I'm happy to jump in and repeat the same tests that were performed for #52746 once folks are ready. I think we're going to need to get as much testing on this as possible if we are really intent on getting this into squid at this late hour. |
|
@markhpc The two bugs in question seem as though they'd have been triggered by any read with a non-stripe aligned offset and a length longer than a stripe. It seems like this should have been covered by our existing automated testing, so at the moment my main interest is why our automated testing didn't catch it. |
@rzarzynski here are the scheduled tests: |
|
jenkins retest this please |
3dac973 to
e8399c0
Compare
|
jenkins retest this please |
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
This is supposed to fix: ``` 2024-05-15T01:19:55.945 INFO:tasks.workunit.client.0.smithi190.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-eio.sh:243: rados_get_data_bad_size: rados_get td/test-erasure- eio pool-jerasure obj-size-81362-1-10 fail 2024-05-15T01:19:55.946 INFO:tasks.workunit.client.0.smithi190.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-eio.sh:104: rados_get: local dir=td/test-erasure-eio 2024-05-15T01:19:55.946 INFO:tasks.workunit.client.0.smithi190.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-eio.sh:105: rados_get: local poolname=pool-jerasure 2024-05-15T01:19:55.946 INFO:tasks.workunit.client.0.smithi190.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-eio.sh:106: rados_get: local objname=obj-size-81362-1-10 2024-05-15T01:19:55.946 INFO:tasks.workunit.client.0.smithi190.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-eio.sh:107: rados_get: local expect=fail 2024-05-15T01:19:55.946 INFO:tasks.workunit.client.0.smithi190.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-eio.sh:112: rados_get: '[' fail = fail ']' 2024-05-15T01:19:55.946 INFO:tasks.workunit.client.0.smithi190.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-eio.sh:114: rados_get: rados --pool pool-jerasure get obj-size- 81362-1-10 td/test-erasure-eio/COPY 2024-05-15T01:19:56.175 INFO:tasks.workunit.client.0.smithi190.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-eio.sh:115: rados_get: return 2024-05-15T01:19:56.175 INFO:tasks.workunit.client.0.smithi190.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-eio.sh:243: rados_get_data_bad_size: return 1 2024-05-15T01:19:56.175 INFO:tasks.workunit.client.0.smithi190.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-eio.sh:323: TEST_rados_get_bad_size_shard_1: return 1 2024-05-15T01:19:56.175 INFO:tasks.workunit.client.0.smithi190.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-eio.sh:41: run: return 1 ``` (https://pulpito.ceph.com/rzarzynski-2024-05-14_22:09:16-rados-wip-osd-ec-partial-reads-distro-default-smithi/7706517/) The failed scenario was exercising a behavior that got truly changed by introduction of partial reads. Before, regardless of read size, OSD was always reading and checking for errors entire stripe. In this test first 4 KB has been read from an EC pool with m=2 k=1 while errors had been injected to shards 1 and 2. Handling the first 4 KB doesn't really require the damaged shards but, because of the full-stripe alignment, EIO was returned. This is not anymore. Signed-off-by: Radosław Zarzyński <rzarzyns@redhat.com>
Fixes: https://tracker.ceph.com/issues/66321 Signed-off-by: Radosław Zarzyński <rzarzyns@redhat.com> (cherry picked from commit 5c821233d097eb6eb4287bbd1d0b6d01638e5f90)
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radosław Zarzyński <rzarzyns@redhat.com>
We want to have better coverage espcially in the matter of: * subchunking, * want_to_read handling, * result bl trimming. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
e8399c0 to
a1084f5
Compare
|
@athanatos: would you mind another round? After a few fixes since the initial teuthology run neither @ljflores nor I see anything obviously related in the rerun. |
|
The make check bot failed due to a known issue: Retriggering. |
|
jenkins test make check |
|
|
jenkins test make check |
athanatos
left a comment
There was a problem hiding this comment.
It seems like my testing concerns above have been addressed.
|
@rzarzynski Ready for performance testing now? |
|
@markhpc: yes! |
|
Why did this merge before we did performance testing? |
|
I don't perceive performance as the main, sole driving force behind having this in
I want these commits in [1]: https://github.com/bill-scales/CephErasureCodingDesign and the recent thread on ceph-dev |
osd: EC Partial Stripe Reads (Retry of ceph#23138 and ceph#52746) Reviewed-by: Samuel Just <sjust@redhat.com>
|
@rzarzynski Any idea if/when this will make it into Squid? or will this be in next release? Very interested in getting this working and testing with our workflow which would benefit greatly by it. |
This PR is a resurrection of #52746 rebased on top of #54930.
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e