osd: only call stat/getattrs once per object during deep-scrub by athanatos · Pull Request #56995 · ceph/ceph

athanatos · 2024-04-19T03:00:27Z

Fixes: https://tracker.ceph.com/issues/65185

See bug for details.

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

We call back into be_scan_list repeatedly for the same object in deep scrub if the object is too large/has too many omap entries to be done in a single call. This causes us to wastefully call ObjectStore::getattrs and stat multiple times. This has been a long standing, but mostly harmless bug (aside from the extra metadata traffic). However, 1a4d3f0, between reef and squid, switched ScrubMap::object::attrs to a map<string, bufferlist> from a map<string, bufferptr>. This should have been harmless, except that the ObjectStore::getattrs overload for that type uses aset[i->first].append(i->second); rather than aset.emplace(i.first.c_str(), i.second); to populate the map. Combined with this bug, that means that if we need 7 calls into be_scan_list to deep scrub an object, we'll end up with attributes concatenated 7 times each. This didn't cause visible problems with squid/main testing since all of the replicas would end up doing the same thing and they'd still decode ok, but it did become visible during reef->squid upgrade testing as the older osds sent maps with the correct contents. The next commit will fix ObjectStore::getattrs to clear the map. Fixes: https://tracker.ceph.com/issues/65185 Signed-off-by: Samuel Just <sjust@redhat.com>

Passing in a non-empty map would otherwise exhibit quite unexpected behavior. For the bufferptr overload, any preexisting entries would not be overwritten due to how std::map::emplace behaves. For the bufferlist overload, it would result in appending to any pre-existing entries. The prior commit cleans up one such inadvertent caller which resulted in the below bug. Fixes: https://tracker.ceph.com/issues/65185 Signed-off-by: Samuel Just <sjust@redhat.com>

athanatos · 2024-04-19T03:03:41Z

Waiting for test branches to build.

wip-sjust-squid-testing-2024-04-18
wip-sjust-testing-2024-04-18

rzarzynski · 2024-04-19T06:49:29Z

src/osd/PGBackend.cc

-    store->getattrs(
+  int r = 0;
+  ScrubMap::object &o = map.objects[poid];
+  if (!pos.metadata_done) {


ACK. This leads to calling to calling the getattrs() once.

rzarzynski · 2024-04-19T06:55:06Z

src/osd/PGBackend.cc

-    return -EINPROGRESS;
+
+  if (pos.deep) {
+    r = be_deep_scrub(poid, map, pos, o);


ACK for decoupling the error handling of getattrs() and be_deep_scrub() .

rzarzynski · 2024-04-19T07:07:07Z

src/os/bluestore/BlueStore.cc

      r = -ENOENT;
      goto out;
    }
+    aset.clear();


I'm not sure this BlueStore-related commit is actually part of the fix specifically for https://tracker.ceph.com/issues/65185. In my understanding this bug is tackled by the first commit in the PR.

Although I agree BlueStore::getattrs() should clear the aset and this change is very desired to get into main right now, I'm more concerned about backporting it to squid immediately. I'm worried we might have a caller depending on the misbehavior of this method.

Is there a business in treating this wide, BlueStore patch separately (e.g. dedicated tracker)?

I'd be truly shocked if there were any users doing this on purpose. I think the balance of probability is that if there actually were any users that passed in non-empty maps, we'd want this behavior anyway.

I don't want to leave this behavior in the code base. It's quite dangerous and I'm not really worried about backporting it tbh.

yuriw · 2024-04-19T18:52:47Z

This PR is under test in https://tracker.ceph.com/issues/65592.

rzarzynski · 2024-04-19T18:59:54Z

jenkins test windows

athanatos · 2024-04-20T00:58:34Z

https://pulpito.ceph.com/sjust-2024-04-19_02:04:26-upgrade:reef-x:stress-split-wip-sjust-squid-testing-2024-04-18-distro-default-smithi/ doesn't show the scrub error, but all of the tests failed anyway. Is anyone looking into these other upgrade test failures? @rzarzynski

rzarzynski · 2024-04-20T10:35:55Z

The problem behind the first dead-with-hit max job timeout job in the run is failure in pulling a container image:

rzarzynski@teuthology:/a/sjust-2024-04-19_02:04:26-upgrade:reef-x:stress-split-wip-sjust-squid-testing-2024-04-18-distro-default-smithi/7663330$ less teuthology.log 
...
2024-04-19T02:41:02.894 DEBUG:teuthology.orchestra.run.smithi059:> sudo /home/ubuntu/cephtest/cephadm --image quay.ceph.io/ceph-ci/ceph:reef shell -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring --fsid f762b95a-fdf3-11ee-bc92-c7b262605968 -e sha1=0e14f5e6f0b65f450e578e8b6f6baaeb0179d877 -- bash -c 'ceph orch upgrade start --image quay.ceph.io/ceph-ci/ceph:$sha1'
...
2024-04-19T02:41:12.977 INFO:journalctl@ceph.mon.c.smithi059.stdout:Apr 19 02:41:12 smithi059 bash[31138]: cephadm 2024-04-19T02:41:11.008362+0000 mgr.y (mgr.24320) 186 : cephadm 1 Upgrade: First pull of quay.ceph.io/ceph-ci/ceph:0e14f5e6f0b65f450e578e8b6f6baaeb0179d877
...
2024-04-19T02:50:00.389 INFO:journalctl@ceph.mon.b.smithi081.stdout:Apr 19 02:50:00 smithi081 bash[23242]: cluster 2024-04-19T02:50:00.000166+0000 mon.a (mon.0) 1544 : cluster 3 [WRN] UPGRADE_FAILED_PULL: Upgrade: failed to pull target image
...
024-04-19T02:50:00.478 INFO:journalctl@ceph.mon.c.smithi059.stdout:Apr 19 02:50:00 smithi059 bash[31138]: cluster 2024-04-19T02:50:00.000193+0000 mon.a (mon.0) 1545 : cluster 3     host smithi059 `cephadm pull` failed: cephadm exited with an error code: 1, stderr: Pulling container image quay.ceph.io/ceph-ci/ceph:0e14f5e6f0b65f450e578e8b6f6baaeb0179d877...
2024-04-19T02:50:00.478 INFO:journalctl@ceph.mon.c.smithi059.stdout:Apr 19 02:50:00 smithi059 bash[31138]: cluster 2024-04-19T02:50:00.000218+0000 mon.a (mon.0) 1546 : cluster 3 Non-zero exit code 1 from /usr/bin/docker pull quay.ceph.io/ceph-ci/ceph:0e14f5e6f0b65f450e578e8b6f6baaeb0179d877
2024-04-19T02:50:00.478 INFO:journalctl@ceph.mon.c.smithi059.stdout:Apr 19 02:50:00 smithi059 bash[31138]: cluster 2024-04-19T02:50:00.000261+0000 mon.a (mon.0) 1547 : cluster 3 /usr/bin/docker: stderr Error response from daemon: manifest for quay.ceph.io/ceph-ci/ceph:0e14f5e6f0b65f450e578e8b6f6baaeb0179d877 not found: manifest unknown: manifest unknown
2024-04-19T02:50:00.478 INFO:journalctl@ceph.mon.c.smithi059.stdout:Apr 19 02:50:00 smithi059 bash[31138]: cluster 2024-04-19T02:50:00.000281+0000 mon.a (mon.0) 1548 : cluster 3 ERROR: Failed command: /usr/bin/docker pull quay.ceph.io/ceph-ci/ceph:0e14f5e6f0b65f450e578e8b6f6baaeb0179d877
...

rzarzynski · 2024-04-20T10:47:48Z

Grepped specifically for this kind of error in the two other ones:

rzarzynski@teuthology:/a/sjust-2024-04-19_02:04:26-upgrade:reef-x:stress-split-wip-sjust-squid-testing-2024-04-18-distro-default-smithi/7663331$ less teuthology.log
...
2024-04-19T02:50:01.030 INFO:journalctl@ceph.mon.b.smithi179.stdout:Apr 19 02:50:00 smithi179 bash[23642]: cluster 2024-04-19T02:50:00.000215+0000 mon.a (mon.0) 993 : cluster 3 ERROR: Failed command: /usr/bin/docker pull quay.ceph.io/ceph-ci/ceph:0e14f5e6f0b65f450e578e8b6f6baaeb0179d877

rzarzynski@teuthology:/a/sjust-2024-04-19_02:04:26-upgrade:reef-x:stress-split-wip-sjust-squid-testing-2024-04-18-distro-default-smithi/7663334$ less teuthology.log
...
2024-04-19T03:00:01.024 INFO:journalctl@ceph.mon.c.smithi031.stdout:Apr 19 03:00:00 smithi031 bash[31409]: cluster 2024-04-19T03:00:00.000187+0000 mon.a (mon.0) 1796 : cluster 3 ERROR: Failed command: /usr/bin/docker pull quay.ceph.io/ceph-ci/ceph:0e14f5e6f0b65f450e578e8b6f6baaeb0179d877

Everything else in red because a package problem: yum-utils-4.5.0-1.el10.noarch from CentOS-BaseOS\ \ requires python3-dnf >= 4.19.0, but none of the providers can be installed.

CC: @athanatos.

ronen-fr · 2024-04-24T12:02:40Z

I think we can merge this now. As Sam is on PTO, I'll wait a bit for possible objections (@rzarzynski ?), then probably merge

rzarzynski · 2024-04-29T15:03:23Z

@ronen-fr: https://tracker.ceph.com/issues/65592 tracks the QA. @yuriw, @ljflores: ping.

ljflores · 2024-04-30T23:02:00Z

Rados approved: https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrackercephcomissues65592

athanatos added 2 commits April 19, 2024 02:49

athanatos requested a review from a team as a code owner April 19, 2024 03:00

github-actions bot added bluestore core labels Apr 19, 2024

athanatos requested a review from rzarzynski April 19, 2024 03:03

rzarzynski approved these changes Apr 19, 2024

View reviewed changes

yuriw added wip-yuri10-testing wip-yuri-testing and removed wip-yuri10-testing labels Apr 19, 2024

rzarzynski mentioned this pull request Apr 20, 2024

squid: osd/PGBackend::be_scan_list: only call stat, getattrs once per object #57006

Merged

14 tasks

yuriw merged commit 90bab02 into ceph:main Apr 30, 2024

Conversation

athanatos commented Apr 19, 2024

Uh oh!

athanatos commented Apr 19, 2024

Uh oh!

rzarzynski Apr 19, 2024

Choose a reason for hiding this comment

Uh oh!

rzarzynski Apr 19, 2024

Choose a reason for hiding this comment

Uh oh!

rzarzynski Apr 19, 2024

Choose a reason for hiding this comment

Uh oh!

athanatos Apr 20, 2024

Choose a reason for hiding this comment

Uh oh!

athanatos Apr 20, 2024

Choose a reason for hiding this comment

Uh oh!

yuriw commented Apr 19, 2024

Uh oh!

rzarzynski commented Apr 19, 2024

Uh oh!

athanatos commented Apr 20, 2024

Uh oh!

rzarzynski commented Apr 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rzarzynski commented Apr 20, 2024

Uh oh!

ronen-fr commented Apr 24, 2024

Uh oh!

rzarzynski commented Apr 29, 2024

Uh oh!

ljflores commented Apr 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rzarzynski commented Apr 20, 2024 •

edited

Loading