Bug #65768
closedrados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log
0%
Description
This is observed on squid. I couldn't find a tracker on main related to this test.
A more proper analysis on whether this needs to be fixed on main branch is needed as well.
If analysis shows that the fix is needed on main as well, this tracker can probably be
clubbed with https://tracker.ceph.com/issues/65521 which is tracking a bunch of other
trackers related to adding cluster log warnings to the ignorelist.
/a/yuriw-2024-04-30_03:21:19-rados-wip-yuri4-testing-2024-04-29-0642-distro-default-smithi/7680387
Description:
rados/verify/{centos_latest ceph clusters/{fixed-2 openstack} d-thrash/none mon_election/classic msgr-failures/few msgr/async objectstore/bluestore-low-osd-mem-target rados tasks/rados_api_tests validater/valgrind}
The OSD_DOWN is expected since it is taken down as part of the thrasher. The warning is eventually cleared.
This warning must therefore be added to the ignorelist.
2024-04-30T11:35:49.898+0000 11fae640 10 mon.a@0(leader).log v419 logging 2024-04-30T11:35:49.821422+0000 mon.a (mon.0) 1015 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN) ... 2024-04-30T11:35:55.298+0000 d3a4640 0 log_channel(cluster) log [INF] : Health check cleared: OSD_DOWN (was: 1 osds down)
Updated by Laura Flores almost 2 years ago
- Related to Bug #65521: Add expected warnings in cluster log to ignorelists added
Updated by Radoslaw Zarzynski almost 2 years ago
Sridhar, are you working on this?
Updated by Sridhar Seshasayee almost 2 years ago
- Assignee set to Sridhar Seshasayee
@Radoslaw Zarzynski I found this during a review of a squid run that included a couple of my PRs.
I wasn't working on this, but to help out I can take it up and come up with a fix.
Updated by Laura Flores almost 2 years ago
@Sridhar Seshasayee you can add me as a reviewer if you raise a PR to add this to the ignorelist, or otherwise.
Updated by Sridhar Seshasayee almost 2 years ago ยท Edited
Further analysis of the logs show that the OSD_DOWN warning was generated because osd.1 exceeded the heartbeat grace timeout period
as shown below and the warning was cleared a few secs later:
2024-04-30T11:35:48.878+0000 fba9640 10 mon.a@0(leader).log v418 logging 2024-04-30T11:35:48.875903+0000 mon.a (mon.0) 1013 : cluster [INF] osd.1 failed (root=default,host=smithi005) (2 reporters from different osd after 87.150472 >= grace 80.000000) ... 2024-04-30T11:35:49.809+0000 d3a4640 2 mon.a@0(leader).osd e308 osd.1 DOWN 2024-04-30T11:35:49.810+0000 d3a4640 10 mon.a@0(leader).osd e308 encode_pending encoding full map with squid features 1080873258835847684 2024-04-30T11:35:49.813+0000 d3a4640 20 mon.a@0(leader).osd e308 encode_pending mon is running version: 19.0.0-2455-g09dbd6bb 2024-04-30T11:35:49.813+0000 d3a4640 20 mon.a@0(leader).osd e308 full_crc 1068048625 inc_crc 3555704646 2024-04-30T11:35:49.819+0000 d3a4640 10 mon.a@0(leader) e1 log_health updated 1 previous 0 2024-04-30T11:35:49.819+0000 d3a4640 0 log_channel(cluster) log [WRN] : Health check failed: 1 osds down (OSD_DOWN) ... 2024-04-30T11:35:55.298+0000 d3a4640 0 log_channel(cluster) log [INF] : Health check cleared: OSD_DOWN (was: 1 osds down)
The test sets osd_heartbeat_grace to 80. Historically, this value was set to 40 secs and was
increased to 80 secs a while ago as part of the following commit:
https://github.com/ceph/ceph/pull/34011/commits/4fda9d50f09d527262fd65eab9b9cff3fd700aad
Considering the nature of the test, the osd_heartbeat_grace timeout can be further increased to 90 secs
on main and backported just to Squid to begin with.
@Radoslaw Zarzynski, what do you think?
Updated by Laura Flores almost 2 years ago
/a/yuriw-2024-04-11_17:03:54-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7652461
Updated by Laura Flores almost 2 years ago
Laura Flores wrote in #note-7:
/a/yuriw-2024-04-11_17:03:54-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7652461
Oops, this is not rados/verify; ignore
Updated by Radoslaw Zarzynski almost 2 years ago
- Status changed from New to Fix Under Review
Will be taken into next QA batch,
Updated by Yuri Weinstein over 1 year ago
Updated by Sridhar Seshasayee over 1 year ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to squid, reef
Updated by Sridhar Seshasayee over 1 year ago
- Copied to Backport #67108: squid: rados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log added
Updated by Sridhar Seshasayee over 1 year ago
- Copied to Backport #67109: reef: rados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log added
Updated by Upkeep Bot over 1 year ago
- Tags (freeform) set to backport_processed
Updated by Laura Flores over 1 year ago
- Related to Bug #67181: rados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log' added
Updated by Sridhar Seshasayee over 1 year ago
- Status changed from Pending Backport to Resolved
Updated by Upkeep Bot 8 months ago
- Merge Commit set to 1fa959e9826c66c3b55f3dff524aee9e1ec00c6a
- Fixed In set to v19.3.0-3675-g1fa959e9826
- Upkeep Timestamp set to 2025-07-11T11:09:11+00:00
Updated by Upkeep Bot 8 months ago
- Fixed In changed from v19.3.0-3675-g1fa959e9826 to v19.3.0-3675-g1fa959e982
- Upkeep Timestamp changed from 2025-07-11T11:09:11+00:00 to 2025-07-14T23:09:10+00:00
Updated by Upkeep Bot 5 months ago
- Released In set to v20.2.0~2414
- Upkeep Timestamp changed from 2025-07-14T23:09:10+00:00 to 2025-11-01T01:33:43+00:00