Project

General

Profile

Actions

Bug #68005

open

smoke/basic: "Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)" in cluster log

Added by Laura Flores over 1 year ago. Updated 28 days ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Backport:
squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:

Description

/a/sangadi-2024-09-06_06:53:40-smoke-squid-release-distro-default-smithi/7891716

In the teuthology.log, we can see that the warning occurred in mon.a on smithi100:

2024-09-06T08:06:15.563 INFO:teuthology.orchestra.run.smithi100.stdout:2024-09-06T07:57:42.942097+0000 mon.a (mon.0) 231 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)

Checking into the cluster log on smithi100, we can see that the osdmap changed epochs right before the PG_AVAILABILITY warning popped up. The warning stayed for 6 seconds before the cluster went back to a healthy state:

2024-09-06T07:57:40.939334+0000 mon.a (mon.0) 229 : cluster [DBG] osdmap e37: 4 total, 4 up, 4 in
2024-09-06T07:57:41.944173+0000 mon.a (mon.0) 230 : cluster [DBG] osdmap e38: 4 total, 4 up, 4 in
2024-09-06T07:57:42.880113+0000 mgr.x (mgr.4104) 63 : cluster [DBG] pgmap v93: 393 pgs: 1 peering, 392 active+clean; 587 KiB data, 137 MiB used, 360 GiB / 360 GiB avail; 36 KiB/s rd, 0 B/s wr, 59 op/s
2024-09-06T07:57:42.942097+0000 mon.a (mon.0) 231 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2024-09-06T07:57:44.880723+0000 mgr.x (mgr.4104) 64 : cluster [DBG] pgmap v94: 393 pgs: 1 peering, 392 active+clean; 587 KiB data, 137 MiB used, 360 GiB / 360 GiB avail; 36 KiB/s rd, 0 B/s wr, 59 op/s
2024-09-06T07:57:46.881949+0000 mgr.x (mgr.4104) 65 : cluster [DBG] pgmap v95: 393 pgs: 393 active+clean; 587 KiB data, 137 MiB used, 360 GiB / 360 GiB avail; 36 KiB/s rd, 0 B/s wr, 59 op/s
2024-09-06T07:57:46.932914+0000 mon.a (mon.0) 232 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2024-09-06T07:57:46.932956+0000 mon.a (mon.0) 233 : cluster [INF] Cluster is now healthy

Looking into the mon.a log, we can see that the warning comes from pg 4.16, which was stuck peering for 63 seconds. This indicates that the OSDs involved in pg 4.16 (0 and 2 according to the last reported acting set) might not yet agree on their contents.

2024-09-06T07:57:44.880+0000 7f169623f640 20 mon.a@0(leader).mgrstat health checks:
{
    "PG_AVAILABILITY": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Reduced data availability: 1 pg peering",
            "count": 1
        },
        "detail": [
            {
                "message": "pg 4.19 is stuck peering for 63s, current state peering, last acting [0,2]" 
            }
        ]
    }
}

Farther up in mon.a.log, we can see that `pg-upmpap-items`, from the balancer, was applied to pg 4.19, effectively changing the OSDs the acting set.

2024-09-06T07:57:41.937+0000 7f1693a3a640  7 mon.a@0(leader).log v89 update_from_paxos applying incremental log 89 2024-09-06T07:57:40.879430+0000 mgr.x (mgr.4104) 62 : cluster [DBG] pgmap v90: 393 pgs: 393 active+clean; 587 KiB data, 137 MiB used, 360 GiB / 360 GiB avail; 24 KiB/s rd, 0 B/s wr, 39 op/s
2024-09-06T07:57:41.937+0000 7f1693a3a640  7 mon.a@0(leader).log v89 update_from_paxos applying incremental log 89 2024-09-06T07:57:40.936977+0000 mon.a (mon.0) 228 : audit [INF] from='mgr.4104 172.21.15.100:0/2983478942' entity='mgr.x' cmd='[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "4.19", "id": [1, 2]}]': finished

This explains why PG_AVAILABILITY went down for a short time, while the new acting set got up to speed.

Actions #1

Updated by Laura Flores over 1 year ago

This has happened previously with Reef, so it is not a blocker for Squid 19.2.0:

/a/teuthology-2024-09-08_05:16:02-smoke-reef-distro-default-smithi/7894222

Actions #2

Updated by Laura Flores over 1 year ago ยท Edited

Note from bug scrub: The pg in question likely exceeded the 60 second threshold before the PG_AVAILABILTY warning comes up, but it was really only stuck for 6 seconds. This falls in line with the pg-upmap-items mapping that was applied to it. The solution for this tracker would likely be to whitelist the warning in the smoke suite.

Actions #3

Updated by Laura Flores over 1 year ago

  • Tags set to low-hanging-fruit
Actions #4

Updated by Ashrita Kollipara over 1 year ago

  • Assignee set to Ashrita Kollipara
Actions #5

Updated by Laura Flores over 1 year ago

Hey @Ashrita Kollipara how are you doing on this issue? Do you need any help?

Actions #6

Updated by Laura Flores over 1 year ago

@Ashrita Kollipara any update here?

Actions #7

Updated by Laura Flores over 1 year ago

Bump up

Actions #8

Updated by Laura Flores over 1 year ago

/a/yuriw-2024-12-13_17:07:44-smoke-squid-release-distro-default-smithi/8034622

Actions #10

Updated by Laura Flores over 1 year ago

  • Assignee changed from Ashrita Kollipara to Laura Flores
Actions #11

Updated by Laura Flores over 1 year ago

@Ronen Friedman let's file a new tracker for that one then.

Actions #12

Updated by Laura Flores about 1 year ago

  • Backport set to squid
Actions #13

Updated by Laura Flores about 1 year ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 61879
Actions #14

Updated by Laura Flores about 1 year ago

Waiting for review..

Actions #15

Updated by Radoslaw Zarzynski about 1 year ago

scrub note: approved, sent to QA.

Actions #16

Updated by Laura Flores about 1 year ago

Still in QA...

Actions #17

Updated by Laura Flores about 1 year ago

  • Status changed from Fix Under Review to In Progress
Actions #18

Updated by Radoslaw Zarzynski about 1 year ago

  • Status changed from In Progress to Fix Under Review

scrub note: review in progress.

Actions #19

Updated by Laura Flores about 1 year ago

  • Backport changed from squid to squid,reef
Actions #20

Updated by Laura Flores about 1 year ago

  • Tags changed from low-hanging-fruit to low-hanging-fruit, cluster-log-warning
Actions #21

Updated by Laura Flores about 1 year ago

Preparing backports...

Actions #22

Updated by Laura Flores about 1 year ago

/a/yuriw-2025-05-01_16:01:04-smoke-reef-release-distro-default-smithi/8267527

Actions #23

Updated by Laura Flores 12 months ago

Bump up

Actions #24

Updated by Laura Flores 3 months ago

/a/yuriw-2026-01-26_20:16:51-smoke-reef-release-distro-default-trial/18966

Actions #25

Updated by Konstantin Shalygin 28 days ago

  • Backport changed from squid,reef to squid
Actions

Also available in: Atom PDF