Bug #65768: rados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log - RADOS - Ceph

Actions

Copy link

Bug #65768

closed

rados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log

Added by Sridhar Seshasayee almost 2 years ago. Updated 5 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Sridhar Seshasayee

Category:

Target version:

% Done:

Source:

Backport:

squid, reef

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

57485

Tags (freeform):

backport_processed

Merge Commit:

1fa959e9826c66c3b55f3dff524aee9e1ec00c6a

Fixed In:

v19.3.0-3675-g1fa959e982

Released In:

v20.2.0~2414

Upkeep Timestamp:

2025-11-01T01:33:43+00:00

Tags:

cluster-log-warning

Description

This is observed on squid. I couldn't find a tracker on main related to this test.
A more proper analysis on whether this needs to be fixed on main branch is needed as well.
If analysis shows that the fix is needed on main as well, this tracker can probably be
clubbed with https://tracker.ceph.com/issues/65521 which is tracking a bunch of other
trackers related to adding cluster log warnings to the ignorelist.

/a/yuriw-2024-04-30_03:21:19-rados-wip-yuri4-testing-2024-04-29-0642-distro-default-smithi/7680387

Description:
rados/verify/{centos_latest ceph clusters/{fixed-2 openstack} d-thrash/none mon_election/classic msgr-failures/few msgr/async objectstore/bluestore-low-osd-mem-target rados tasks/rados_api_tests validater/valgrind}

The OSD_DOWN is expected since it is taken down as part of the thrasher. The warning is eventually cleared.
This warning must therefore be added to the ignorelist.

2024-04-30T11:35:49.898+0000 11fae640 10 mon.a@0(leader).log v419  logging 2024-04-30T11:35:49.821422+0000 mon.a (mon.0) 1015 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)

...

2024-04-30T11:35:55.298+0000 d3a4640  0 log_channel(cluster) log [INF] : Health check cleared: OSD_DOWN (was: 1 osds down)

Related issues 4 (1 open — 3 closed)

Actions

Copy link

Updated by Laura Flores almost 2 years ago

Related to Bug #65521: Add expected warnings in cluster log to ignorelists added

Actions

Copy link

Updated by Radoslaw Zarzynski almost 2 years ago

Sridhar, are you working on this?

Actions

Copy link

Updated by Sridhar Seshasayee almost 2 years ago

Assignee set to Sridhar Seshasayee

@Radoslaw Zarzynski I found this during a review of a squid run that included a couple of my PRs.
I wasn't working on this, but to help out I can take it up and come up with a fix.

Actions

Copy link

Updated by Laura Flores almost 2 years ago

@Sridhar Seshasayee you can add me as a reviewer if you raise a PR to add this to the ignorelist, or otherwise.

Actions

Copy link

Updated by Sridhar Seshasayee almost 2 years ago · Edited

Further analysis of the logs show that the OSD_DOWN warning was generated because osd.1 exceeded the heartbeat grace timeout period
as shown below and the warning was cleared a few secs later:

2024-04-30T11:35:48.878+0000 fba9640 10 mon.a@0(leader).log v418  logging 2024-04-30T11:35:48.875903+0000 mon.a (mon.0) 1013 : cluster [INF] osd.1 failed (root=default,host=smithi005) (2 reporters from different osd after 87.150472 >= grace 80.000000)

...

2024-04-30T11:35:49.809+0000 d3a4640  2 mon.a@0(leader).osd e308  osd.1 DOWN
2024-04-30T11:35:49.810+0000 d3a4640 10 mon.a@0(leader).osd e308 encode_pending encoding full map with squid features 1080873258835847684
2024-04-30T11:35:49.813+0000 d3a4640 20 mon.a@0(leader).osd e308 encode_pending mon is running version: 19.0.0-2455-g09dbd6bb
2024-04-30T11:35:49.813+0000 d3a4640 20 mon.a@0(leader).osd e308  full_crc 1068048625 inc_crc 3555704646
2024-04-30T11:35:49.819+0000 d3a4640 10 mon.a@0(leader) e1 log_health updated 1 previous 0
2024-04-30T11:35:49.819+0000 d3a4640  0 log_channel(cluster) log [WRN] : Health check failed: 1 osds down (OSD_DOWN)

...

2024-04-30T11:35:55.298+0000 d3a4640  0 log_channel(cluster) log [INF] : Health check cleared: OSD_DOWN (was: 1 osds down)

The test sets osd_heartbeat_grace to 80. Historically, this value was set to 40 secs and was
increased to 80 secs a while ago as part of the following commit:
https://github.com/ceph/ceph/pull/34011/commits/4fda9d50f09d527262fd65eab9b9cff3fd700aad

Considering the nature of the test, the osd_heartbeat_grace timeout can be further increased to 90 secs
on main and backported just to Squid to begin with.

@Radoslaw Zarzynski, what do you think?

Actions

Copy link

Updated by Sridhar Seshasayee almost 2 years ago

Pull request ID set to 57485

Actions

Copy link

Updated by Laura Flores almost 2 years ago

/a/yuriw-2024-04-11_17:03:54-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7652461

Actions

Copy link

Updated by Laura Flores almost 2 years ago

Laura Flores wrote in #note-7:

/a/yuriw-2024-04-11_17:03:54-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7652461

Oops, this is not rados/verify; ignore

Actions

Copy link

Updated by Radoslaw Zarzynski almost 2 years ago

Status changed from New to Fix Under Review

Will be taken into next QA batch,

Actions

Copy link

#10

Updated by Yuri Weinstein over 1 year ago

https://github.com/ceph/ceph/pull/57485 merged

Actions

Copy link

#11

Updated by Sridhar Seshasayee over 1 year ago

Status changed from Fix Under Review to Pending Backport
Backport set to squid, reef

Actions

Copy link

#12

Updated by Sridhar Seshasayee over 1 year ago

Copied to Backport #67108: squid: rados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log added

Actions

Copy link

#13

Updated by Sridhar Seshasayee over 1 year ago

Copied to Backport #67109: reef: rados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log added

Actions

Copy link

#14

Updated by Upkeep Bot over 1 year ago

Tags (freeform) set to backport_processed

Actions

Copy link

#15

Updated by Laura Flores over 1 year ago

Related to Bug #67181: rados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log' added

Actions

Copy link

#16

Updated by Sridhar Seshasayee over 1 year ago

Status changed from Pending Backport to Resolved

Actions

Copy link

#17

Updated by Laura Flores about 1 year ago

Tags set to cluster-log-warning

Actions

Copy link

#18

Updated by Upkeep Bot 8 months ago

Merge Commit set to 1fa959e9826c66c3b55f3dff524aee9e1ec00c6a
Fixed In set to v19.3.0-3675-g1fa959e9826
Upkeep Timestamp set to 2025-07-11T11:09:11+00:00

Actions

Copy link

#19

Updated by Upkeep Bot 8 months ago

Fixed In changed from v19.3.0-3675-g1fa959e9826 to v19.3.0-3675-g1fa959e982
Upkeep Timestamp changed from 2025-07-11T11:09:11+00:00 to 2025-07-14T23:09:10+00:00

Actions

Copy link

#20

Updated by Upkeep Bot 5 months ago

Released In set to v20.2.0~2414
Upkeep Timestamp changed from 2025-07-14T23:09:10+00:00 to 2025-11-01T01:33:43+00:00

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Tags

Custom queries

Bug #65768

rados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log

Updated by Laura Flores almost 2 years ago

Updated by Radoslaw Zarzynski almost 2 years ago

Updated by Sridhar Seshasayee almost 2 years ago

Updated by Laura Flores almost 2 years ago

Updated by Sridhar Seshasayee almost 2 years ago · Edited

Updated by Sridhar Seshasayee almost 2 years ago

Updated by Laura Flores almost 2 years ago

Updated by Laura Flores almost 2 years ago

Updated by Radoslaw Zarzynski almost 2 years ago

Updated by Yuri Weinstein over 1 year ago

Updated by Sridhar Seshasayee over 1 year ago

Updated by Sridhar Seshasayee over 1 year ago

Updated by Sridhar Seshasayee over 1 year ago

Updated by Upkeep Bot over 1 year ago

Updated by Laura Flores over 1 year ago

Updated by Sridhar Seshasayee over 1 year ago

Updated by Laura Flores about 1 year ago

Updated by Upkeep Bot 8 months ago

Updated by Upkeep Bot 8 months ago

Updated by Upkeep Bot 5 months ago