osd: fix partial reading during multi-region EC reads#58749
osd: fix partial reading during multi-region EC reads#58749
Conversation
bill-scales
left a comment
There was a problem hiding this comment.
Changes are good, it fixes the assert and tests the scenario, while still ensuring that a read of more than one stripe in length only inserts shards once.
There are obscure corner cases where too many chunks will be read - if a read op has two regions (extents?), one reading chunk A of shard 0 and the second reading chunk B of shard 1 then the code will create a map of shards <0,1> and then read chunks A0, A1, B0 and B1. Apart from performance there is no downside to reading this extra data, and performance is still vastly better than before the partial read optimization so lets leave improving this for a future PR.
nit - the PR checklist needs correcting
idryomov
left a comment
There was a problem hiding this comment.
Ack from the RBD perspective (based on https://pulpito.ceph.com/dis-2024-07-25_20:35:11-rbd-wip-dis-testing-distro-default-smithi/)
Fixes: https://tracker.ceph.com/issues/67087 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
8d28553 to
beb4d22
Compare
|
Addressed the comments on |
|
jenkins test make check |
|
@rzarzynski I have pinned down the failures that I mentioned to you before to a single configuration. Unfortunately the job involves an ancient VM image and it's not really introspectable or even particularly reliable, but the issue does seem to lie either in #55196 or in this PR. I built The job is
It schedules 8 variations of the same job with different On https://pulpito.ceph.com/dis-2024-08-08_20:54:50-rbd-wip-dis-testing-1-distro-default-smithi/ (The second run is the same, but with each variation scheduled twice for a total of 16 jobs.) On https://pulpito.ceph.com/dis-2024-08-09_07:16:12-rbd-wip-dis-testing-2-distro-default-smithi/ Here, each job has signs of bad data returned on reads before the VM even gets to running xfstests (which is the stage at which the "usual" unreliability comes in). Each Despite the occasional failures on (This grep is for the run on |
What it is dependent on though is LUKS2 and RBD cache being disabled ( |
|
@idryomov I've found a defect with multi-region reads and the partial stripe read code, what I haven't proved is whether it is what is causing your test failure - but I think its likely. The bug is data corruption bug where a multi-region read of an object returns the wrong data - returning data from the object but from the wrong offset. Here's how I created the issue:
I think the bug is caused by:
This bug at least matches with the symptom that Illya reported of XFS complaining about corruption. It doesn't completely explain why the test case hangs, but we can guess that returning the wrong data (i.e other data that it wrote which might pass any integrity checks it performs) could well cause something like an infinite loop as it follows on disk metadata structures. I suggest that the short term fix is to disable the partial read optimization if there more than one region in the read request received by ECBackend::objects_read_async - mainly because as discussed with @rzarzynski there is no test coverage of multi-region reads in the core code at the moment. Longer term we need to enhance the test coverage and then debug and re-enable this optimization The simplest way I found of creating multi-region reads was to hack the test tool src/tools/neorados.cc. This diff creates the failure condition, without the "if (off==0)" it generates mutli-region reads that work: |
I wouldn't worry about the hang. It's an ancient VM image with an ancient kernel inside, from the days when filesystems didn't always react to metadata corruption or EIO errors in a well-defined way and the "go read-only" code paths weren't tested a lot. On top of that, the task is waiting for the VM to shut itself down and there might be issues with error propagation in the test script itself.
Eager to test the short term fix as I keep having to remember to include this PR in my integration branches and waive an occasional failure (without this PR OSDs die pretty much unconditionally in any LUKS test, producing many more failures than I'm comfortable dismissing). |
Based on @bill-scales going through all single-read permutations and finding no issue and this PR being a definite improvement over the status quo in main, I'd request it be merged as is. |
|
jenkins test make check |
1 similar comment
|
jenkins test make check |
|
@rzarzynski @ljflores This was labeled |
|
@idryomov: let's add a commit disabling the multi-region reads for now (WIP). Then the rbd + rados suites would be enough. |
|
@rzarzynski Will this eventually get rolled into Squid or will it wait until next release? |
Fixes: https://tracker.ceph.com/issues/67087
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e