Skip to content

osd: fix partial reading during multi-region EC reads#58749

Merged
ljflores merged 1 commit intoceph:mainfrom
rzarzynski:wip-bug-67087
Sep 30, 2024
Merged

osd: fix partial reading during multi-region EC reads#58749
ljflores merged 1 commit intoceph:mainfrom
rzarzynski:wip-bug-67087

Conversation

@rzarzynski
Copy link
Contributor

Fixes: https://tracker.ceph.com/issues/67087

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

Copy link
Contributor

@bill-scales bill-scales left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are good, it fixes the assert and tests the scenario, while still ensuring that a read of more than one stripe in length only inserts shards once.

There are obscure corner cases where too many chunks will be read - if a read op has two regions (extents?), one reading chunk A of shard 0 and the second reading chunk B of shard 1 then the code will create a map of shards <0,1> and then read chunks A0, A1, B0 and B1. Apart from performance there is no downside to reading this extra data, and performance is still vastly better than before the partial read optimization so lets leave improving this for a future PR.

nit - the PR checklist needs correcting

Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes: https://tracker.ceph.com/issues/67087
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
@rzarzynski
Copy link
Contributor Author

rzarzynski commented Jul 31, 2024

Addressed the comments on ASSERT_EQ in the unittest.
I assure there is no other change in the repush.

@rzarzynski
Copy link
Contributor Author

jenkins test make check

@idryomov
Copy link
Contributor

idryomov commented Aug 9, 2024

@rzarzynski I have pinned down the failures that I mentioned to you before to a single configuration. Unfortunately the job involves an ancient VM image and it's not really introspectable or even particularly reliable, but the issue does seem to lie either in #55196 or in this PR.

I built wip-dis-testing-1 branch on a merge that directly precedes the merge of #55196 and wip-dis-testing-2 branch on the merge of #55196 plus a cherry pick of this fix -- without the cherry pick the OSDs die on ceph_assert(want_to_read->size() == sinfo.get_data_chunk_count()) assert, so testing #55196 in isolation is impossible.

The job is workloads/qemu_xfstests_luks2 in rbd:encryption suite. The filter that I came up with is

-s rbd --filter-all qemu_xfstests_luks2,cache/none,data-pool/ec,features/defaults --filter-out luks2_luks2,luks2_luks1 --seed 6664

It schedules 8 variations of the same job with different objectstore snippets, just to demonstrate that the behavior isn't dependent on that part.

On wip-dis-testing-1, everything passes (mostly -- I still see a failure here and there, but claiming that they are unrelated, see below):

https://pulpito.ceph.com/dis-2024-08-08_20:54:50-rbd-wip-dis-testing-1-distro-default-smithi/
https://pulpito.ceph.com/dis-2024-08-09_07:17:35-rbd-wip-dis-testing-1-distro-default-smithi/

(The second run is the same, but with each variation scheduled twice for a total of 16 jobs.)

On wip-dis-testing-2, everything eventually hits max job timeout (if you want long enough, see below):

https://pulpito.ceph.com/dis-2024-08-09_07:16:12-rbd-wip-dis-testing-2-distro-default-smithi/

Here, each job has signs of bad data returned on reads before the VM even gets to running xfstests (which is the stage at which the "usual" unreliability comes in). Each teuthology.log file exhibits one or more of the following read corruption markers (note that /dev/sda1 is backed by an RBD image):

EXT4-fs error (device sda1): add_dirent_to_buf:1276: inode #30891: block 9580: comm apt-get: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=1034682414, rec_len=29946, name_len=147
EXT4-fs error (device sda1): htree_dirblock_to_tree:587: inode #875: block 5834: comm update-rc.d: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2289242610, rec_len=9803, name_len=77
W: Could not open file '/var/log/apt/term.log' - OpenLog (5: Input/output error)
mount: wrong fs type, bad option, bad superblock on 10.0.2.2:/export/client.0,
       missing codepage or helper program, or other error
       (for several filesystems (e.g. nfs, cifs) you might
       need a /sbin/mount.<type> helper program)
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

Despite the occasional failures on wip-dis-testing-1, these corruption markers weren't observed there (or in any of the prior runs that I did on other branches). This means that rather than unnecessarily consuming test nodes waiting for the max job timeout, it's enough to schedule the "bad" run and check on it 30-40 minutes later with something like:

$ cd /teuthology/dis-2024-08-09_07:16:12-rbd-wip-dis-testing-2-distro-default-smithi
$ find . -name teuthology.log | xargs egrep -l '(EXT4-fs error|Input/output error|wrong fs type, bad option, bad superblock)'
./7845341/teuthology.log
./7845334/teuthology.log
./7845337/teuthology.log
./7845339/teuthology.log
./7845340/teuthology.log
./7845338/teuthology.log
./7845335/teuthology.log
./7845336/teuthology.log

(This grep is for the run on wip-dis-testing-2 referenced above, picking up all 8 failing jobs as expected.)

@idryomov
Copy link
Contributor

idryomov commented Aug 9, 2024

It schedules 8 variations of the same job with different objectstore snippets, just to demonstrate that the behavior isn't dependent on that part.

What it is dependent on though is LUKS2 and RBD cache being disabled (cache/none snippet). Specifically, with everything else being equal, workloads/qemu_xfstests_luks1 doesn't reproduce this. I suspect that it has to do with the RMW dance that librbd has to do for LUKS2 with its 4096-byte blocks (compared to 512-byte blocks for LUKS1). The guest kernel in the VM isn't aware of that and is likely sending I/Os that are "only" 512-byte aligned.

@bill-scales
Copy link
Contributor

@idryomov I've found a defect with multi-region reads and the partial stripe read code, what I haven't proved is whether it is what is causing your test failure - but I think its likely.

The bug is data corruption bug where a multi-region read of an object returns the wrong data - returning data from the object but from the wrong offset.

Here's how I created the issue:

  • Create a 2+2 EC pool with the default 4K chunk size
  • Write an object that contains 4K of A's, 4K of B's, 4K of C's, ... 4K of G's - total object size = 28K
  • Issuing single reads with different offsets + lengths (I tried all valid permutations that were multiples of 1K) always returned correct data
  • Issuing two-region reads with read 1: offset = multiple 8K, length = 4K and read 2: offset = read1 offset+4K, length= 4K always returned correct data. Note both reads were within the same stripe
  • Issuing two-region read with read 1: offset = 4K, length = 4K and read 2: offset = 8K, length = 4K returns the wrong data - it returned A's and B's (as if the offsets were 0 and 4K) rather than the expected B's and C's

I think the bug is caused by:

  1. ECBackend::objects_read_async - has code to avoid performing the partial stripe optimization for single region reads that cross stripe boundaries, but not more multi-region reads. So the failure case is performing the partial stripe optimization
  2. ECCommon::ReadPipeline::objects_read_and_reconstruct generates a single want_to_read map that is a union of the requirements of each read. In the failing case this means both reads end up reading both data shards
  3. finish_single_request (I think this is particularly suspect as it used to assert that full-stripes were read) and the callback lambda in objects_read_async try to trim the buffer that was read, but get confused about how much data was read and end up returning the wrong part of the data that was read

This bug at least matches with the symptom that Illya reported of XFS complaining about corruption. It doesn't completely explain why the test case hangs, but we can guess that returning the wrong data (i.e other data that it wrote which might pass any integrity checks it performs) could well cause something like an infinite loop as it follows on disk metadata structures.

I suggest that the short term fix is to disable the partial read optimization if there more than one region in the read request received by ECBackend::objects_read_async - mainly because as discussed with @rzarzynski there is no test coverage of multi-region reads in the core code at the moment. Longer term we need to enhance the test coverage and then debug and re-enable this optimization

The simplest way I found of creating multi-region reads was to hack the test tool src/tools/neorados.cc. This diff creates the failure condition, without the "if (off==0)" it generates mutli-region reads that work:

--- a/src/tools/neorados.cc
+++ b/src/tools/neorados.cc
@@ -209,10 +209,26 @@ ba::awaitable<void> read(R::RADOS& r, const std::vector<std::string>& p)
   }
 
   std::size_t off = 0;
-  ceph::buffer::list bl;
+  ceph::buffer::list bl1;
+  ceph::buffer::list bl2;
+  bool use2;
   while (auto toread = std::min(len - off, io_size)) {
     R::ReadOp op;
-    op.read(off, toread, &bl);
+    if (off==0) {
+      fmt::print("read1: offset {} length 4096\n",off);
+      op.read(off, 4096, &bl1);
+      use2 = false;
+    }else if (toread>=8192) {
+      fmt::print("read2: offset {} length 4096\n",off);
+      fmt::print("read2: offset {} length 4096\n",(off+4096));
+      op.read(off, 4096, &bl1);
+      op.read(off+4096, 4096, &bl2);
+      use2 = true;
+    }else{
+      fmt::print("read1: offset {} length {}\n",off,toread);
+      op.read(off, toread, &bl1);
+      use2 = false;
+    }
     co_await r.execute(obj, pool, std::move(op), nullptr,
                       ba::redirect_error(ba::use_awaitable, ec));
     if (ec)
@@ -221,9 +237,14 @@ ba::awaitable<void> read(R::RADOS& r, const std::vector<std::string>& p)
        fmt::format("when reading from object '{}' in pool '{}'",
                    obj, pool));
 
-    off += bl.length();
-    bl.write_stream(std::cout);
-    bl.clear();
+    off += bl1.length();
+    bl1.write_stream(std::cout);
+    bl1.clear();
+    if (use2) {
+      off += bl2.length();
+      bl2.write_stream(std::cout);
+      bl2.clear();
+    }

@idryomov
Copy link
Contributor

idryomov commented Aug 19, 2024

It doesn't completely explain why the test case hangs, but we can guess that returning the wrong data (i.e other data that it wrote which might pass any integrity checks it performs) could well cause something like an infinite loop as it follows on disk metadata structures.

I wouldn't worry about the hang. It's an ancient VM image with an ancient kernel inside, from the days when filesystems didn't always react to metadata corruption or EIO errors in a well-defined way and the "go read-only" code paths weren't tested a lot. On top of that, the task is waiting for the VM to shut itself down and there might be issues with error propagation in the test script itself.

I suggest that the short term fix is to disable the partial read optimization if there more than one region in the read request received by ECBackend::objects_read_async - mainly because as discussed with @rzarzynski there is no test coverage of multi-region reads in the core code at the moment.

Eager to test the short term fix as I keep having to remember to include this PR in my integration branches and waive an occasional failure (without this PR OSDs die pretty much unconditionally in any LUKS test, producing many more failures than I'm comfortable dismissing).

@idryomov
Copy link
Contributor

(without this PR OSDs die pretty much unconditionally in any LUKS test, producing many more failures that I'm comfortable dismissing).

Based on @bill-scales going through all single-read permutations and finding no issue and this PR being a definite improvement over the status quo in main, I'd request it be merged as is.

@idryomov
Copy link
Contributor

jenkins test make check

1 similar comment
@idryomov
Copy link
Contributor

jenkins test make check

@idryomov
Copy link
Contributor

@rzarzynski @ljflores This was labeled needs-qa three weeks ago, has it been through the suite? If not, is that still pending or would you be OK with all the RBD runs that I did?

@rzarzynski
Copy link
Contributor Author

@idryomov: let's add a commit disabling the multi-region reads for now (WIP). Then the rbd + rados suites would be enough.

@amathuria
Copy link
Contributor

@ljflores ljflores merged commit 9afb314 into ceph:main Sep 30, 2024
@erichorwitz
Copy link

@rzarzynski Will this eventually get rolled into Squid or will it wait until next release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants