Skip to content

osd: only call stat/getattrs once per object during deep-scrub#56995

Merged
yuriw merged 2 commits intoceph:mainfrom
athanatos:sjust/wip-65185-scrub-attr-error
Apr 30, 2024
Merged

osd: only call stat/getattrs once per object during deep-scrub#56995
yuriw merged 2 commits intoceph:mainfrom
athanatos:sjust/wip-65185-scrub-attr-error

Conversation

@athanatos
Copy link
Copy Markdown
Contributor

Fixes: https://tracker.ceph.com/issues/65185

See bug for details.

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

We call back into be_scan_list repeatedly for the same object in deep
scrub if the object is too large/has too many omap entries to be done
in a single call.  This causes us to wastefully call
ObjectStore::getattrs and stat multiple times.

This has been a long standing, but mostly harmless bug (aside from the
extra metadata traffic).  However, 1a4d3f0, between reef and squid,
switched ScrubMap::object::attrs to a map<string, bufferlist> from a
map<string, bufferptr>.  This should have been harmless, except that
the ObjectStore::getattrs overload for that type uses

      aset[i->first].append(i->second);

rather than

      aset.emplace(i.first.c_str(), i.second);

to populate the map.  Combined with this bug, that means that if we
need 7 calls into be_scan_list to deep scrub an object, we'll end up
with attributes concatenated 7 times each.  This didn't cause visible
problems with squid/main testing since all of the replicas would end
up doing the same thing and they'd still decode ok, but it did become
visible during reef->squid upgrade testing as the older osds sent
maps with the correct contents.

The next commit will fix ObjectStore::getattrs to clear the map.

Fixes: https://tracker.ceph.com/issues/65185
Signed-off-by: Samuel Just <sjust@redhat.com>
Passing in a non-empty map would otherwise exhibit quite unexpected
behavior.  For the bufferptr overload, any preexisting entries would
not be overwritten due to how std::map::emplace behaves.  For the
bufferlist overload, it would result in appending to any pre-existing
entries.

The prior commit cleans up one such inadvertent caller which resulted
in the below bug.

Fixes: https://tracker.ceph.com/issues/65185
Signed-off-by: Samuel Just <sjust@redhat.com>
@athanatos athanatos requested a review from a team as a code owner April 19, 2024 03:00
@athanatos athanatos requested a review from rzarzynski April 19, 2024 03:03
@athanatos
Copy link
Copy Markdown
Contributor Author

Waiting for test branches to build.

wip-sjust-squid-testing-2024-04-18
wip-sjust-testing-2024-04-18

store->getattrs(
int r = 0;
ScrubMap::object &o = map.objects[poid];
if (!pos.metadata_done) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK. This leads to calling to calling the getattrs() once.

return -EINPROGRESS;

if (pos.deep) {
r = be_deep_scrub(poid, map, pos, o);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK for decoupling the error handling of getattrs() and be_deep_scrub() .

r = -ENOENT;
goto out;
}
aset.clear();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this BlueStore-related commit is actually part of the fix specifically for https://tracker.ceph.com/issues/65185. In my understanding this bug is tackled by the first commit in the PR.

Although I agree BlueStore::getattrs() should clear the aset and this change is very desired to get into main right now, I'm more concerned about backporting it to squid immediately. I'm worried we might have a caller depending on the misbehavior of this method.

Is there a business in treating this wide, BlueStore patch separately (e.g. dedicated tracker)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be truly shocked if there were any users doing this on purpose. I think the balance of probability is that if there actually were any users that passed in non-empty maps, we'd want this behavior anyway.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to leave this behavior in the code base. It's quite dangerous and I'm not really worried about backporting it tbh.

@yuriw
Copy link
Copy Markdown
Contributor

yuriw commented Apr 19, 2024

This PR is under test in https://tracker.ceph.com/issues/65592.

@rzarzynski
Copy link
Copy Markdown
Contributor

jenkins test windows

@athanatos
Copy link
Copy Markdown
Contributor Author

https://pulpito.ceph.com/sjust-2024-04-19_02:04:26-upgrade:reef-x:stress-split-wip-sjust-squid-testing-2024-04-18-distro-default-smithi/ doesn't show the scrub error, but all of the tests failed anyway. Is anyone looking into these other upgrade test failures? @rzarzynski

@rzarzynski
Copy link
Copy Markdown
Contributor

rzarzynski commented Apr 20, 2024

The problem behind the first dead-with-hit max job timeout job in the run is failure in pulling a container image:

rzarzynski@teuthology:/a/sjust-2024-04-19_02:04:26-upgrade:reef-x:stress-split-wip-sjust-squid-testing-2024-04-18-distro-default-smithi/7663330$ less teuthology.log 
...
2024-04-19T02:41:02.894 DEBUG:teuthology.orchestra.run.smithi059:> sudo /home/ubuntu/cephtest/cephadm --image quay.ceph.io/ceph-ci/ceph:reef shell -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring --fsid f762b95a-fdf3-11ee-bc92-c7b262605968 -e sha1=0e14f5e6f0b65f450e578e8b6f6baaeb0179d877 -- bash -c 'ceph orch upgrade start --image quay.ceph.io/ceph-ci/ceph:$sha1'
...
2024-04-19T02:41:12.977 INFO:journalctl@ceph.mon.c.smithi059.stdout:Apr 19 02:41:12 smithi059 bash[31138]: cephadm 2024-04-19T02:41:11.008362+0000 mgr.y (mgr.24320) 186 : cephadm 1 Upgrade: First pull of quay.ceph.io/ceph-ci/ceph:0e14f5e6f0b65f450e578e8b6f6baaeb0179d877
...
2024-04-19T02:50:00.389 INFO:journalctl@ceph.mon.b.smithi081.stdout:Apr 19 02:50:00 smithi081 bash[23242]: cluster 2024-04-19T02:50:00.000166+0000 mon.a (mon.0) 1544 : cluster 3 [WRN] UPGRADE_FAILED_PULL: Upgrade: failed to pull target image
...
024-04-19T02:50:00.478 INFO:journalctl@ceph.mon.c.smithi059.stdout:Apr 19 02:50:00 smithi059 bash[31138]: cluster 2024-04-19T02:50:00.000193+0000 mon.a (mon.0) 1545 : cluster 3     host smithi059 `cephadm pull` failed: cephadm exited with an error code: 1, stderr: Pulling container image quay.ceph.io/ceph-ci/ceph:0e14f5e6f0b65f450e578e8b6f6baaeb0179d877...
2024-04-19T02:50:00.478 INFO:journalctl@ceph.mon.c.smithi059.stdout:Apr 19 02:50:00 smithi059 bash[31138]: cluster 2024-04-19T02:50:00.000218+0000 mon.a (mon.0) 1546 : cluster 3 Non-zero exit code 1 from /usr/bin/docker pull quay.ceph.io/ceph-ci/ceph:0e14f5e6f0b65f450e578e8b6f6baaeb0179d877
2024-04-19T02:50:00.478 INFO:journalctl@ceph.mon.c.smithi059.stdout:Apr 19 02:50:00 smithi059 bash[31138]: cluster 2024-04-19T02:50:00.000261+0000 mon.a (mon.0) 1547 : cluster 3 /usr/bin/docker: stderr Error response from daemon: manifest for quay.ceph.io/ceph-ci/ceph:0e14f5e6f0b65f450e578e8b6f6baaeb0179d877 not found: manifest unknown: manifest unknown
2024-04-19T02:50:00.478 INFO:journalctl@ceph.mon.c.smithi059.stdout:Apr 19 02:50:00 smithi059 bash[31138]: cluster 2024-04-19T02:50:00.000281+0000 mon.a (mon.0) 1548 : cluster 3 ERROR: Failed command: /usr/bin/docker pull quay.ceph.io/ceph-ci/ceph:0e14f5e6f0b65f450e578e8b6f6baaeb0179d877
...

@rzarzynski
Copy link
Copy Markdown
Contributor

Grepped specifically for this kind of error in the two other ones:

rzarzynski@teuthology:/a/sjust-2024-04-19_02:04:26-upgrade:reef-x:stress-split-wip-sjust-squid-testing-2024-04-18-distro-default-smithi/7663331$ less teuthology.log
...
2024-04-19T02:50:01.030 INFO:journalctl@ceph.mon.b.smithi179.stdout:Apr 19 02:50:00 smithi179 bash[23642]: cluster 2024-04-19T02:50:00.000215+0000 mon.a (mon.0) 993 : cluster 3 ERROR: Failed command: /usr/bin/docker pull quay.ceph.io/ceph-ci/ceph:0e14f5e6f0b65f450e578e8b6f6baaeb0179d877
rzarzynski@teuthology:/a/sjust-2024-04-19_02:04:26-upgrade:reef-x:stress-split-wip-sjust-squid-testing-2024-04-18-distro-default-smithi/7663334$ less teuthology.log
...
2024-04-19T03:00:01.024 INFO:journalctl@ceph.mon.c.smithi031.stdout:Apr 19 03:00:00 smithi031 bash[31409]: cluster 2024-04-19T03:00:00.000187+0000 mon.a (mon.0) 1796 : cluster 3 ERROR: Failed command: /usr/bin/docker pull quay.ceph.io/ceph-ci/ceph:0e14f5e6f0b65f450e578e8b6f6baaeb0179d877

Everything else in red because a package problem: yum-utils-4.5.0-1.el10.noarch from CentOS-BaseOS\ \ requires python3-dnf >= 4.19.0, but none of the providers can be installed.

CC: @athanatos.

@ronen-fr
Copy link
Copy Markdown
Contributor

I think we can merge this now. As Sam is on PTO, I'll wait a bit for possible objections (@rzarzynski ?), then probably merge

@rzarzynski
Copy link
Copy Markdown
Contributor

@ronen-fr: https://tracker.ceph.com/issues/65592 tracks the QA. @yuriw, @ljflores: ping.

@ljflores
Copy link
Copy Markdown
Member

@yuriw yuriw merged commit 90bab02 into ceph:main Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants