Project

General

Profile

Actions

Bug #65521

closed

Add expected warnings in cluster log to ignorelists

Added by Laura Flores almost 2 years ago. Updated about 1 year ago.

Status:
Closed
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:

Description

Relevant Slack conversation:

Hey all, as I brought up in today's RADOS call, there has been a surge of cluster warnings in the rados and upgrade suites due to the merge of https://github.com/ceph/ceph/pull/54312 to main and squid.

Here are recent main baselines, where we have a huge percentage of failures due to cluster warnings: Squid doesn't look nearly as bad, but still needs some attention especially in the upgrade suite: I've been making tracker issues to fix a lot of these warnings, but since there are so many and they are non-deterministic, I think this will need to be a group effort.
Here are some I've opened lately:

Any ideas on how we can effectively divide up the work and fix the suites is welcome. The idea is to go through each failure, identify whether the warning is expected (i.e. OSD_DOWN warnings are expected in thrash tests), and add it to the correct ignorelist in a PR like this: https://github.com/ceph/ceph/pull/56619

The mon_cluster_log_to_file change has not yet been backported to Quincy or Reef, but the same work will need to be done for these. I think we should run all suites against these patches and merge them along with ignorelist changes, rather than merging first and fixing second.

Related issues 22 (12 open10 closed)

Related to RADOS - Bug #65422: upgrade/quincy-x: "1 pg degraded (PG_DEGRADED)" in cluster logResolvedNitzan Mordechai

Actions
Related to Orchestrator - Bug #64868: cephadm/osds, cephadm/workunits: Health check failed: 1 pool(s) do not have an application enabled (POOL_APP_NOT_ENABLED) in cluster logNewLaura Flores

Actions
Related to RADOS - Bug #65235: upgrade/reef-x/stress-split: "OSDMAP_FLAGS: noscrub flag(s) set" warning in cluster logResolvedBrad Hubbard

Actions
Related to RADOS - Bug #62776: rados/basic: cluster [WRN] overall HEALTH_WARN - do not have an application enabledPending BackportLaura Flores

Actions
Related to Dashboard - Bug #64870: Health check failed: 1 osds down (OSD_DOWN)" in cluster log Pending BackportNitzan Mordechai

Actions
Related to Orchestrator - Bug #64872: rados/cephadm: Health check failed: 1 stray daemon(s) not managed by cephadm (CEPHADM_STRAY_DAEMON) in cluster logPending BackportNitzan Mordechai

Actions
Related to Orchestrator - Bug #65728: Daemon managed by cephadm in an unknown state (CEPHADM_FAILED_DAEMON)NewAdam King

Actions
Related to RADOS - Bug #65768: rados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log ResolvedSridhar Seshasayee

Actions
Related to Orchestrator - Bug #65824: rados/thrash-old-clients: cluster [WRN] Health detail: HEALTH_WARN noscrub flag(s) set" in cluster logPending BackportKamoltat (Junior) Sirivadhna

Actions
Related to RADOS - Bug #66474: rados/thrash-old-clients: HEALTH_WARN noscrub,nodeep-scrub flag(s) set; Degraded data redundancyDuplicateLaura Flores

Actions
Related to RADOS - Bug #66602: rados/upgrade: Health check failed: 1 pool(s) do not have an application enabled (POOL_APP_NOT_ENABLED)Pending BackportBrad Hubbard

Actions
Related to Orchestrator - Bug #66603: rados/cephadm/smoke: CEPHADM_AGENT_DOWN: 2 Cephadm Agent(s) are not reporting. Hosts may be offlineNewAdam King

Actions
Related to RADOS - Bug #66604: rados/thrash-old-clients: SLOW_OPS: 17 slow ops, oldest one blocked for 213 sec, osd.11 has slow opsResolvedNitzan Mordechai

Actions
Related to RADOS - Bug #66809: upgrade/quincy-x; upgrade/reef-x: Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)" in cluster logPending BackportLaura Flores

Actions
Related to RADOS - Bug #66811: upgrade/reef-x/stress-split: Health check failed: 1/3 mons down, quorum a,b (MON_DOWN)" in cluster logDuplicate

Actions
Related to RADOS - Bug #67181: rados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log'Pending BackportLaura Flores

Actions
Related to RADOS - Bug #67182: rados/upgrade: Health check failed: Degraded data redundancy: 2/6 objects degraded (33.333%), 1 pg degraded (PG_DEGRADED)" in cluster logResolvedPere Díaz Bou

Actions
Related to RADOS - Bug #67281: rados/upgrade/parallel - Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)Pending BackportLaura Flores

Actions
Related to RADOS - Bug #67584: upgrade:quincy-x: cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)" in cluster logWon't Fix - EOLLaura Flores

Actions
Related to RADOS - Bug #67879: upgrade/cephfs/mds_upgrade_sequence: Health detail: HEALTH_WARN 1 osds down" in cluster logPending BackportKamoltat (Junior) Sirivadhna

Actions
Related to RADOS - Bug #67970: rados/thrash-old-clients: "HEALTH_WARN Degraded data redundancy: 7/134 objects degraded (5.224%), 1 pg degraded" in cluster logResolvedNitzan Mordechai

Actions
Related to RADOS - Bug #68602: rados/thrash-old-clients: [WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive, 1 pg peeringResolvedKamoltat (Junior) Sirivadhna

Actions
Actions #1

Updated by Laura Flores almost 2 years ago

  • Related to Bug #65422: upgrade/quincy-x: "1 pg degraded (PG_DEGRADED)" in cluster log added
Actions #2

Updated by Laura Flores almost 2 years ago

  • Related to Bug #64868: cephadm/osds, cephadm/workunits: Health check failed: 1 pool(s) do not have an application enabled (POOL_APP_NOT_ENABLED) in cluster log added
Actions #3

Updated by Laura Flores almost 2 years ago

  • Related to Bug #65235: upgrade/reef-x/stress-split: "OSDMAP_FLAGS: noscrub flag(s) set" warning in cluster log added
  • Related to Bug #62776: rados/basic: cluster [WRN] overall HEALTH_WARN - do not have an application enabled added
  • Related to Bug #64870: Health check failed: 1 osds down (OSD_DOWN)" in cluster log added
Actions #4

Updated by Matan Breizman almost 2 years ago

/a/yuriw-2024-04-16_23:25:35-rados-wip-yuriw-testing-20240416.150233-distro-default-smithi/7659305

Actions #6

Updated by Laura Flores almost 2 years ago

/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664685

"2024-04-20T16:10:00.000158+0000 mon.a (mon.0) 1407 : cluster [WRN] Health detail: HEALTH_WARN nodeep-scrub flag(s) set; Reduced data availability: 1 pg peering" in cluster log
Actions #7

Updated by Laura Flores almost 2 years ago

In this one, we are intentionally setting OSDs down, so the warning is expected.

/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664689

2024-04-20T16:04:18.136 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:17 smithi012 ceph-mon[17195]: from='mgr.14152 172.21.15.12:0/2573228170' entity='mgr.a' cmd=[{"prefix": "config generate-minimal-conf"}]: dispatch
2024-04-20T16:04:18.136 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:17 smithi012 ceph-mon[17195]: from='mgr.14152 172.21.15.12:0/2573228170' entity='mgr.a' cmd=[{"prefix": "auth get", "entity": "client.admin"}]: dispatch
2024-04-20T16:04:18.136 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:17 smithi012 ceph-mon[17195]: from='mgr.14152 172.21.15.12:0/2573228170' entity='mgr.a' cmd=[{"prefix": "osd df", "format": "json"}]: dispatch
2024-04-20T16:04:18.136 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:17 smithi012 ceph-mon[17195]: from='mgr.14152 172.21.15.12:0/2573228170' entity='mgr.a' cmd=[{"prefix": "osd df", "format": "json"}]: dispatch
2024-04-20T16:04:18.136 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:17 smithi012 ceph-mon[17195]: from='mgr.14152 172.21.15.12:0/2573228170' entity='mgr.a' cmd=[{"prefix": "osd safe-to-destroy", "ids": ["3"]}]: dispatch
2024-04-20T16:04:18.136 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:17 smithi012 ceph-mon[17195]: from='mgr.14152 172.21.15.12:0/2573228170' entity='mgr.a' cmd=[{"prefix": "osd down", "ids": ["3"]}]: dispatch
2024-04-20T16:04:19.135 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:18 smithi012 ceph-mon[17195]: from='mon.0 -' entity='mon.' cmd=[{"prefix": "osd df", "format": "json"}]: dispatch
2024-04-20T16:04:19.135 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:18 smithi012 ceph-mon[17195]: from='mon.0 -' entity='mon.' cmd=[{"prefix": "osd df", "format": "json"}]: dispatch
2024-04-20T16:04:19.135 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:18 smithi012 ceph-mon[17195]: from='mon.0 -' entity='mon.' cmd=[{"prefix": "osd safe-to-destroy", "ids": ["3"]}]: dispatch
2024-04-20T16:04:19.135 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:18 smithi012 ceph-mon[17195]: Health check failed: 1 osds down (OSD_DOWN)

Actions #8

Updated by Laura Flores almost 2 years ago

  • Related to Bug #64872: rados/cephadm: Health check failed: 1 stray daemon(s) not managed by cephadm (CEPHADM_STRAY_DAEMON) in cluster log added
Actions #9

Updated by Laura Flores almost 2 years ago

/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664686

2024-04-20T16:26:04.659 INFO:teuthology.orchestra.run.smithi144.stdout:2024-04-20T16:10:00.000158+0000 mon.a (mon.0) 1407 : cluster [WRN] Health detail: HEALTH_WARN nodeep-scrub flag(s) set; Reduced data availability: 1 pg peering

Actions #10

Updated by Laura Flores almost 2 years ago · Edited

/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664765
/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664810

Actions #11

Updated by Laura Flores almost 2 years ago · Edited

/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664854
/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664891

POOL_APP_NOT_ENABLED

Actions #12

Updated by Laura Flores almost 2 years ago

/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664903

2024-04-20T17:46:51.770 INFO:teuthology.orchestra.run.smithi012.stdout:2024-04-20T17:44:38.893501+0000 mon.a (mon.0) 1023 : cluster [WRN] Health check failed: 2 Cephadm Agent(s) are not reporting. Hosts may be offline (CEPHADM_AGENT_DOWN)

Actions #13

Updated by Laura Flores almost 2 years ago

/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664940

OSD_DOWN

Actions #14

Updated by Laura Flores almost 2 years ago

  • Related to Bug #65728: Daemon managed by cephadm in an unknown state (CEPHADM_FAILED_DAEMON) added
Actions #15

Updated by Matan Breizman almost 2 years ago · Edited

/a/yuriw-2024-04-20_01:10:46-rados-wip-yuri7-testing-2024-04-18-1351-reef-distro-default-smithi/7664127
/a/yuriw-2024-04-20_01:10:46-rados-wip-yuri7-testing-2024-04-18-1351-reef-distro-default-smithi/7664245

Actions #16

Updated by Laura Flores almost 2 years ago

Partial fix for some of the warnings: https://github.com/ceph/ceph/pull/57218

Actions #17

Updated by Laura Flores almost 2 years ago

  • Related to Bug #65768: rados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log added
Actions #18

Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago

  • Related to Bug #65824: rados/thrash-old-clients: cluster [WRN] Health detail: HEALTH_WARN noscrub flag(s) set" in cluster log added
Actions #19

Updated by Nitzan Mordechai almost 2 years ago

/a/yuriw-2024-05-04_16:45:43-rados-wip-yuriw-testing-20240503.213524-main-distro-default-smithi/7691265

Actions #20

Updated by Laura Flores almost 2 years ago

/a/yuriw-2024-04-11_17:03:54-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7652461

Actions #21

Updated by Laura Flores almost 2 years ago

/a/yuriw-2024-04-11_17:03:54-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7652465

Actions #22

Updated by Laura Flores almost 2 years ago

/a/yuriw-2024-04-11_17:03:54-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7652467

Actions #23

Updated by Laura Flores almost 2 years ago

/a/yuriw-2024-04-11_17:03:54-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7652474

Actions #24

Updated by Laura Flores almost 2 years ago

/a/yuriw-2024-04-11_17:03:54-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7652477

Actions #26

Updated by Laura Flores almost 2 years ago

  • Related to Bug #66474: rados/thrash-old-clients: HEALTH_WARN noscrub,nodeep-scrub flag(s) set; Degraded data redundancy added
Actions #27

Updated by Laura Flores almost 2 years ago

  • Priority changed from Normal to Urgent
Actions #28

Updated by Laura Flores almost 2 years ago

  • Tracker changed from Cleanup to Bug
  • Regression set to No
  • Severity set to 3 - minor
Actions #30

Updated by Laura Flores over 1 year ago

  • Related to Bug #66602: rados/upgrade: Health check failed: 1 pool(s) do not have an application enabled (POOL_APP_NOT_ENABLED) added
Actions #31

Updated by Laura Flores over 1 year ago

  • Related to Bug #66603: rados/cephadm/smoke: CEPHADM_AGENT_DOWN: 2 Cephadm Agent(s) are not reporting. Hosts may be offline added
Actions #32

Updated by Laura Flores over 1 year ago

  • Related to Bug #66604: rados/thrash-old-clients: SLOW_OPS: 17 slow ops, oldest one blocked for 213 sec, osd.11 has slow ops added
Actions #33

Updated by Radoslaw Zarzynski over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to Laura Flores

Note from bug scrub: I think @Laura Flores was already working on this.

Actions #34

Updated by Laura Flores over 1 year ago

  • Related to Bug #66809: upgrade/quincy-x; upgrade/reef-x: Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)" in cluster log added
Actions #35

Updated by Laura Flores over 1 year ago

  • Related to Bug #66810: upgrade/reef-x: "1 pg degraded (PG_DEGRADED)" in cluster log added
Actions #36

Updated by Laura Flores over 1 year ago

  • Related to deleted (Bug #66810: upgrade/reef-x: "1 pg degraded (PG_DEGRADED)" in cluster log)
Actions #37

Updated by Laura Flores over 1 year ago

  • Related to Bug #66811: upgrade/reef-x/stress-split: Health check failed: 1/3 mons down, quorum a,b (MON_DOWN)" in cluster log added
Actions #38

Updated by Laura Flores over 1 year ago

  • Related to Bug #67181: rados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log' added
Actions #39

Updated by Laura Flores over 1 year ago

  • Related to Bug #67182: rados/upgrade: Health check failed: Degraded data redundancy: 2/6 objects degraded (33.333%), 1 pg degraded (PG_DEGRADED)" in cluster log added
Actions #40

Updated by Nitzan Mordechai over 1 year ago

  • Related to Bug #67281: rados/upgrade/parallel - Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY) added
Actions #41

Updated by Laura Flores over 1 year ago

  • Related to Bug #67584: upgrade:quincy-x: cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)" in cluster log added
Actions #42

Updated by Brad Hubbard over 1 year ago

@Laura Flores I've submitted a PR for https://tracker.ceph.com/issues/65235 which should address the reef-x tests. Might need your help to review which trackers need to be updated, thanks.

Actions #43

Updated by Laura Flores over 1 year ago

  • Related to Bug #67879: upgrade/cephfs/mds_upgrade_sequence: Health detail: HEALTH_WARN 1 osds down" in cluster log added
Actions #44

Updated by Laura Flores over 1 year ago

Thanks @Brad Hubbard will take a look

Actions #45

Updated by Laura Flores over 1 year ago

  • Related to Bug #67970: rados/thrash-old-clients: "HEALTH_WARN Degraded data redundancy: 7/134 objects degraded (5.224%), 1 pg degraded" in cluster log added
Actions #46

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

  • Related to Bug #68602: rados/thrash-old-clients: [WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive, 1 pg peering added
Actions #47

Updated by Laura Flores over 1 year ago

  • Status changed from In Progress to Resolved

Marking this as Resolved since most remaining issues are tracked individually.

Actions #48

Updated by Laura Flores about 1 year ago

  • Status changed from Resolved to In Progress

Reusing this tracker to track warnings that need to be backported to reef now.

Take this wip run as an example: https://pulpito.ceph.com/yuriw-2025-03-07_23:09:12-rados-wip-yuri5-testing-2025-03-07-1307-reef-distro-default-smithi/

Actions #49

Updated by Laura Flores about 1 year ago

  • Status changed from In Progress to Closed
Actions

Also available in: Atom PDF