Project

General

Profile

Actions

Bug #65768

closed

rados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log

Added by Sridhar Seshasayee almost 2 years ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Backport:
squid, reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v19.3.0-3675-g1fa959e982
Released In:
v20.2.0~2414
Upkeep Timestamp:
2025-11-01T01:33:43+00:00

Description

This is observed on squid. I couldn't find a tracker on main related to this test.
A more proper analysis on whether this needs to be fixed on main branch is needed as well.
If analysis shows that the fix is needed on main as well, this tracker can probably be
clubbed with https://tracker.ceph.com/issues/65521 which is tracking a bunch of other
trackers related to adding cluster log warnings to the ignorelist.

/a/yuriw-2024-04-30_03:21:19-rados-wip-yuri4-testing-2024-04-29-0642-distro-default-smithi/7680387

Description:
rados/verify/{centos_latest ceph clusters/{fixed-2 openstack} d-thrash/none mon_election/classic msgr-failures/few msgr/async objectstore/bluestore-low-osd-mem-target rados tasks/rados_api_tests validater/valgrind}

The OSD_DOWN is expected since it is taken down as part of the thrasher. The warning is eventually cleared.
This warning must therefore be added to the ignorelist.

2024-04-30T11:35:49.898+0000 11fae640 10 mon.a@0(leader).log v419  logging 2024-04-30T11:35:49.821422+0000 mon.a (mon.0) 1015 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)

...

2024-04-30T11:35:55.298+0000 d3a4640  0 log_channel(cluster) log [INF] : Health check cleared: OSD_DOWN (was: 1 osds down)

Related issues 4 (1 open3 closed)

Related to RADOS - Bug #65521: Add expected warnings in cluster log to ignorelistsClosedLaura Flores

Actions
Related to RADOS - Bug #67181: rados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log'Pending BackportLaura Flores

Actions
Copied to RADOS - Backport #67108: squid: rados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log ResolvedSridhar SeshasayeeActions
Copied to RADOS - Backport #67109: reef: rados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log ResolvedSridhar SeshasayeeActions
Actions #1

Updated by Laura Flores almost 2 years ago

  • Related to Bug #65521: Add expected warnings in cluster log to ignorelists added
Actions #2

Updated by Radoslaw Zarzynski almost 2 years ago

Sridhar, are you working on this?

Actions #3

Updated by Sridhar Seshasayee almost 2 years ago

  • Assignee set to Sridhar Seshasayee

@Radoslaw Zarzynski I found this during a review of a squid run that included a couple of my PRs.
I wasn't working on this, but to help out I can take it up and come up with a fix.

Actions #4

Updated by Laura Flores almost 2 years ago

@Sridhar Seshasayee you can add me as a reviewer if you raise a PR to add this to the ignorelist, or otherwise.

Actions #5

Updated by Sridhar Seshasayee almost 2 years ago ยท Edited

Further analysis of the logs show that the OSD_DOWN warning was generated because osd.1 exceeded the heartbeat grace timeout period
as shown below and the warning was cleared a few secs later:

2024-04-30T11:35:48.878+0000 fba9640 10 mon.a@0(leader).log v418  logging 2024-04-30T11:35:48.875903+0000 mon.a (mon.0) 1013 : cluster [INF] osd.1 failed (root=default,host=smithi005) (2 reporters from different osd after 87.150472 >= grace 80.000000)

...

2024-04-30T11:35:49.809+0000 d3a4640  2 mon.a@0(leader).osd e308  osd.1 DOWN
2024-04-30T11:35:49.810+0000 d3a4640 10 mon.a@0(leader).osd e308 encode_pending encoding full map with squid features 1080873258835847684
2024-04-30T11:35:49.813+0000 d3a4640 20 mon.a@0(leader).osd e308 encode_pending mon is running version: 19.0.0-2455-g09dbd6bb
2024-04-30T11:35:49.813+0000 d3a4640 20 mon.a@0(leader).osd e308  full_crc 1068048625 inc_crc 3555704646
2024-04-30T11:35:49.819+0000 d3a4640 10 mon.a@0(leader) e1 log_health updated 1 previous 0
2024-04-30T11:35:49.819+0000 d3a4640  0 log_channel(cluster) log [WRN] : Health check failed: 1 osds down (OSD_DOWN)

...

2024-04-30T11:35:55.298+0000 d3a4640  0 log_channel(cluster) log [INF] : Health check cleared: OSD_DOWN (was: 1 osds down)

The test sets osd_heartbeat_grace to 80. Historically, this value was set to 40 secs and was
increased to 80 secs a while ago as part of the following commit:
https://github.com/ceph/ceph/pull/34011/commits/4fda9d50f09d527262fd65eab9b9cff3fd700aad

Considering the nature of the test, the osd_heartbeat_grace timeout can be further increased to 90 secs
on main and backported just to Squid to begin with.

@Radoslaw Zarzynski, what do you think?

Actions #6

Updated by Sridhar Seshasayee almost 2 years ago

  • Pull request ID set to 57485
Actions #7

Updated by Laura Flores almost 2 years ago

/a/yuriw-2024-04-11_17:03:54-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7652461

Actions #8

Updated by Laura Flores almost 2 years ago

Laura Flores wrote in #note-7:

/a/yuriw-2024-04-11_17:03:54-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7652461

Oops, this is not rados/verify; ignore

Actions #9

Updated by Radoslaw Zarzynski almost 2 years ago

  • Status changed from New to Fix Under Review

Will be taken into next QA batch,

Actions #11

Updated by Sridhar Seshasayee over 1 year ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to squid, reef
Actions #12

Updated by Sridhar Seshasayee over 1 year ago

  • Copied to Backport #67108: squid: rados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log added
Actions #13

Updated by Sridhar Seshasayee over 1 year ago

  • Copied to Backport #67109: reef: rados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log added
Actions #14

Updated by Upkeep Bot over 1 year ago

  • Tags (freeform) set to backport_processed
Actions #15

Updated by Laura Flores over 1 year ago

  • Related to Bug #67181: rados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log' added
Actions #16

Updated by Sridhar Seshasayee over 1 year ago

  • Status changed from Pending Backport to Resolved
Actions #17

Updated by Laura Flores about 1 year ago

  • Tags set to cluster-log-warning
Actions #18

Updated by Upkeep Bot 8 months ago

  • Merge Commit set to 1fa959e9826c66c3b55f3dff524aee9e1ec00c6a
  • Fixed In set to v19.3.0-3675-g1fa959e9826
  • Upkeep Timestamp set to 2025-07-11T11:09:11+00:00
Actions #19

Updated by Upkeep Bot 8 months ago

  • Fixed In changed from v19.3.0-3675-g1fa959e9826 to v19.3.0-3675-g1fa959e982
  • Upkeep Timestamp changed from 2025-07-11T11:09:11+00:00 to 2025-07-14T23:09:10+00:00
Actions #20

Updated by Upkeep Bot 5 months ago

  • Released In set to v20.2.0~2414
  • Upkeep Timestamp changed from 2025-07-14T23:09:10+00:00 to 2025-11-01T01:33:43+00:00
Actions

Also available in: Atom PDF