osd: fix partial reading during multi-region EC reads by rzarzynski · Pull Request #58749 · ceph/ceph

rzarzynski · 2024-07-23T13:22:45Z

Fixes: https://tracker.ceph.com/issues/67087

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

bill-scales

Changes are good, it fixes the assert and tests the scenario, while still ensuring that a read of more than one stripe in length only inserts shards once.

There are obscure corner cases where too many chunks will be read - if a read op has two regions (extents?), one reading chunk A of shard 0 and the second reading chunk B of shard 1 then the code will create a map of shards <0,1> and then read chunks A0, A1, B0 and B1. Apart from performance there is no downside to reading this extra data, and performance is still vastly better than before the partial read optimization so lets leave improving this for a future PR.

nit - the PR checklist needs correcting

idryomov

Ack from the RBD perspective (based on https://pulpito.ceph.com/dis-2024-07-25_20:35:11-rbd-wip-dis-testing-distro-default-smithi/)

src/test/osd/TestECBackend.cc

Fixes: https://tracker.ceph.com/issues/67087 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

rzarzynski · 2024-07-31T12:26:47Z

Addressed the comments on ASSERT_EQ in the unittest.
I assure there is no other change in the repush.

rzarzynski · 2024-08-05T16:23:45Z

jenkins test make check

idryomov · 2024-08-09T14:33:28Z

@rzarzynski I have pinned down the failures that I mentioned to you before to a single configuration. Unfortunately the job involves an ancient VM image and it's not really introspectable or even particularly reliable, but the issue does seem to lie either in #55196 or in this PR.

I built wip-dis-testing-1 branch on a merge that directly precedes the merge of #55196 and wip-dis-testing-2 branch on the merge of #55196 plus a cherry pick of this fix -- without the cherry pick the OSDs die on ceph_assert(want_to_read->size() == sinfo.get_data_chunk_count()) assert, so testing #55196 in isolation is impossible.

The job is workloads/qemu_xfstests_luks2 in rbd:encryption suite. The filter that I came up with is

-s rbd --filter-all qemu_xfstests_luks2,cache/none,data-pool/ec,features/defaults --filter-out luks2_luks2,luks2_luks1 --seed 6664

It schedules 8 variations of the same job with different objectstore snippets, just to demonstrate that the behavior isn't dependent on that part.

On wip-dis-testing-1, everything passes (mostly -- I still see a failure here and there, but claiming that they are unrelated, see below):

https://pulpito.ceph.com/dis-2024-08-08_20:54:50-rbd-wip-dis-testing-1-distro-default-smithi/
https://pulpito.ceph.com/dis-2024-08-09_07:17:35-rbd-wip-dis-testing-1-distro-default-smithi/

(The second run is the same, but with each variation scheduled twice for a total of 16 jobs.)

On wip-dis-testing-2, everything eventually hits max job timeout (if you want long enough, see below):

https://pulpito.ceph.com/dis-2024-08-09_07:16:12-rbd-wip-dis-testing-2-distro-default-smithi/

Here, each job has signs of bad data returned on reads before the VM even gets to running xfstests (which is the stage at which the "usual" unreliability comes in). Each teuthology.log file exhibits one or more of the following read corruption markers (note that /dev/sda1 is backed by an RBD image):

EXT4-fs error (device sda1): add_dirent_to_buf:1276: inode #30891: block 9580: comm apt-get: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=1034682414, rec_len=29946, name_len=147

EXT4-fs error (device sda1): htree_dirblock_to_tree:587: inode #875: block 5834: comm update-rc.d: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2289242610, rec_len=9803, name_len=77

W: Could not open file '/var/log/apt/term.log' - OpenLog (5: Input/output error)

mount: wrong fs type, bad option, bad superblock on 10.0.2.2:/export/client.0,
       missing codepage or helper program, or other error
       (for several filesystems (e.g. nfs, cifs) you might
       need a /sbin/mount.<type> helper program)
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

Despite the occasional failures on wip-dis-testing-1, these corruption markers weren't observed there (or in any of the prior runs that I did on other branches). This means that rather than unnecessarily consuming test nodes waiting for the max job timeout, it's enough to schedule the "bad" run and check on it 30-40 minutes later with something like:

$ cd /teuthology/dis-2024-08-09_07:16:12-rbd-wip-dis-testing-2-distro-default-smithi
$ find . -name teuthology.log | xargs egrep -l '(EXT4-fs error|Input/output error|wrong fs type, bad option, bad superblock)'
./7845341/teuthology.log
./7845334/teuthology.log
./7845337/teuthology.log
./7845339/teuthology.log
./7845340/teuthology.log
./7845338/teuthology.log
./7845335/teuthology.log
./7845336/teuthology.log

(This grep is for the run on wip-dis-testing-2 referenced above, picking up all 8 failing jobs as expected.)

idryomov · 2024-08-09T14:42:43Z

It schedules 8 variations of the same job with different objectstore snippets, just to demonstrate that the behavior isn't dependent on that part.

What it is dependent on though is LUKS2 and RBD cache being disabled (cache/none snippet). Specifically, with everything else being equal, workloads/qemu_xfstests_luks1 doesn't reproduce this. I suspect that it has to do with the RMW dance that librbd has to do for LUKS2 with its 4096-byte blocks (compared to 512-byte blocks for LUKS1). The guest kernel in the VM isn't aware of that and is likely sending I/Os that are "only" 512-byte aligned.

bill-scales · 2024-08-14T17:06:28Z

@idryomov I've found a defect with multi-region reads and the partial stripe read code, what I haven't proved is whether it is what is causing your test failure - but I think its likely.

The bug is data corruption bug where a multi-region read of an object returns the wrong data - returning data from the object but from the wrong offset.

Here's how I created the issue:

Create a 2+2 EC pool with the default 4K chunk size
Write an object that contains 4K of A's, 4K of B's, 4K of C's, ... 4K of G's - total object size = 28K
Issuing single reads with different offsets + lengths (I tried all valid permutations that were multiples of 1K) always returned correct data
Issuing two-region reads with read 1: offset = multiple 8K, length = 4K and read 2: offset = read1 offset+4K, length= 4K always returned correct data. Note both reads were within the same stripe
Issuing two-region read with read 1: offset = 4K, length = 4K and read 2: offset = 8K, length = 4K returns the wrong data - it returned A's and B's (as if the offsets were 0 and 4K) rather than the expected B's and C's

I think the bug is caused by:

ECBackend::objects_read_async - has code to avoid performing the partial stripe optimization for single region reads that cross stripe boundaries, but not more multi-region reads. So the failure case is performing the partial stripe optimization
ECCommon::ReadPipeline::objects_read_and_reconstruct generates a single want_to_read map that is a union of the requirements of each read. In the failing case this means both reads end up reading both data shards
finish_single_request (I think this is particularly suspect as it used to assert that full-stripes were read) and the callback lambda in objects_read_async try to trim the buffer that was read, but get confused about how much data was read and end up returning the wrong part of the data that was read

This bug at least matches with the symptom that Illya reported of XFS complaining about corruption. It doesn't completely explain why the test case hangs, but we can guess that returning the wrong data (i.e other data that it wrote which might pass any integrity checks it performs) could well cause something like an infinite loop as it follows on disk metadata structures.

I suggest that the short term fix is to disable the partial read optimization if there more than one region in the read request received by ECBackend::objects_read_async - mainly because as discussed with @rzarzynski there is no test coverage of multi-region reads in the core code at the moment. Longer term we need to enhance the test coverage and then debug and re-enable this optimization

The simplest way I found of creating multi-region reads was to hack the test tool src/tools/neorados.cc. This diff creates the failure condition, without the "if (off==0)" it generates mutli-region reads that work:

--- a/src/tools/neorados.cc
+++ b/src/tools/neorados.cc
@@ -209,10 +209,26 @@ ba::awaitable<void> read(R::RADOS& r, const std::vector<std::string>& p)
   }
 
   std::size_t off = 0;
-  ceph::buffer::list bl;
+  ceph::buffer::list bl1;
+  ceph::buffer::list bl2;
+  bool use2;
   while (auto toread = std::min(len - off, io_size)) {
     R::ReadOp op;
-    op.read(off, toread, &bl);
+    if (off==0) {
+      fmt::print("read1: offset {} length 4096\n",off);
+      op.read(off, 4096, &bl1);
+      use2 = false;
+    }else if (toread>=8192) {
+      fmt::print("read2: offset {} length 4096\n",off);
+      fmt::print("read2: offset {} length 4096\n",(off+4096));
+      op.read(off, 4096, &bl1);
+      op.read(off+4096, 4096, &bl2);
+      use2 = true;
+    }else{
+      fmt::print("read1: offset {} length {}\n",off,toread);
+      op.read(off, toread, &bl1);
+      use2 = false;
+    }
     co_await r.execute(obj, pool, std::move(op), nullptr,
                       ba::redirect_error(ba::use_awaitable, ec));
     if (ec)
@@ -221,9 +237,14 @@ ba::awaitable<void> read(R::RADOS& r, const std::vector<std::string>& p)
        fmt::format("when reading from object '{}' in pool '{}'",
                    obj, pool));
 
-    off += bl.length();
-    bl.write_stream(std::cout);
-    bl.clear();
+    off += bl1.length();
+    bl1.write_stream(std::cout);
+    bl1.clear();
+    if (use2) {
+      off += bl2.length();
+      bl2.write_stream(std::cout);
+      bl2.clear();
+    }

idryomov · 2024-08-19T07:05:14Z

It doesn't completely explain why the test case hangs, but we can guess that returning the wrong data (i.e other data that it wrote which might pass any integrity checks it performs) could well cause something like an infinite loop as it follows on disk metadata structures.

I wouldn't worry about the hang. It's an ancient VM image with an ancient kernel inside, from the days when filesystems didn't always react to metadata corruption or EIO errors in a well-defined way and the "go read-only" code paths weren't tested a lot. On top of that, the task is waiting for the VM to shut itself down and there might be issues with error propagation in the test script itself.

I suggest that the short term fix is to disable the partial read optimization if there more than one region in the read request received by ECBackend::objects_read_async - mainly because as discussed with @rzarzynski there is no test coverage of multi-region reads in the core code at the moment.

Eager to test the short term fix as I keep having to remember to include this PR in my integration branches and waive an occasional failure (without this PR OSDs die pretty much unconditionally in any LUKS test, producing many more failures than I'm comfortable dismissing).

idryomov · 2024-08-19T07:09:49Z

(without this PR OSDs die pretty much unconditionally in any LUKS test, producing many more failures that I'm comfortable dismissing).

Based on @bill-scales going through all single-read permutations and finding no issue and this PR being a definite improvement over the status quo in main, I'd request it be merged as is.

idryomov · 2024-08-19T09:41:56Z

jenkins test make check

idryomov · 2024-08-19T11:43:52Z

jenkins test make check

idryomov · 2024-08-19T13:27:06Z

@rzarzynski @ljflores This was labeled needs-qa three weeks ago, has it been through the suite? If not, is that still pending or would you be OK with all the RBD runs that I did?

rzarzynski · 2024-08-19T15:34:54Z

@idryomov: let's add a commit disabling the multi-region reads for now (WIP). Then the rbd + rados suites would be enough.

amathuria · 2024-09-17T12:07:59Z

Rados approved - https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrackercephcomissues67848

erichorwitz · 2025-03-10T16:52:48Z

@rzarzynski Will this eventually get rolled into Squid or will it wait until next release?

rzarzynski requested a review from a team as a code owner July 23, 2024 13:22

github-actions bot added core tests labels Jul 23, 2024

rzarzynski mentioned this pull request Jul 23, 2024

squid: osd: EC Partial Stripe Reads (Retry of #23138 and #52746) #58568

Closed

14 tasks

rzarzynski requested review from athanatos, bill-scales and idryomov July 23, 2024 13:28

rzarzynski added the bug-fix label Jul 23, 2024

bill-scales approved these changes Jul 24, 2024

View reviewed changes

markhpc added the performance label Jul 25, 2024

idryomov approved these changes Jul 26, 2024

View reviewed changes

src/test/osd/TestECBackend.cc Outdated Show resolved Hide resolved

src/test/osd/TestECBackend.cc Outdated Show resolved Hide resolved

rzarzynski added the needs-qa label Jul 29, 2024

osd: fix partial reading during multi-region EC reads

beb4d22

Fixes: https://tracker.ceph.com/issues/67087 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

rzarzynski force-pushed the wip-bug-67087 branch from 8d28553 to beb4d22 Compare July 31, 2024 12:23

athanatos approved these changes Aug 5, 2024

View reviewed changes

idryomov mentioned this pull request Aug 28, 2024

Further EC partial stripe read fixes #59480

Merged

14 tasks

SrinivasaBharath added the wip-bharath2-testing label Aug 31, 2024

SrinivasaBharath added the TESTED label Sep 27, 2024

SrinivasaBharath added the ready-to-merge label Sep 27, 2024

ljflores merged commit 9afb314 into ceph:main Sep 30, 2024

Conversation

rzarzynski commented Jul 23, 2024

Contribution Guidelines

Checklist

Uh oh!

bill-scales left a comment

Choose a reason for hiding this comment

Uh oh!

idryomov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rzarzynski commented Jul 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rzarzynski commented Aug 5, 2024

Uh oh!

idryomov commented Aug 9, 2024

Uh oh!

idryomov commented Aug 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bill-scales commented Aug 14, 2024

Uh oh!

idryomov commented Aug 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

idryomov commented Aug 19, 2024

Uh oh!

idryomov commented Aug 19, 2024

Uh oh!

idryomov commented Aug 19, 2024

Uh oh!

idryomov commented Aug 19, 2024

Uh oh!

rzarzynski commented Aug 19, 2024

Uh oh!

amathuria commented Sep 17, 2024

Uh oh!

erichorwitz commented Mar 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

rzarzynski commented Jul 31, 2024 •

edited

Loading

idryomov commented Aug 9, 2024 •

edited

Loading

idryomov commented Aug 19, 2024 •

edited

Loading