Bug #68005: smoke/basic: "Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)" in cluster log - RADOS - Ceph

Actions

Copy link

Bug #68005

open

smoke/basic: "Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)" in cluster log

Added by Laura Flores over 1 year ago. Updated 28 days ago.

Status:

Fix Under Review

Priority:

Normal

Assignee:

Laura Flores

Category:

Target version:

% Done:

Source:

Backport:

squid

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

61879

Tags (freeform):

Merge Commit:

Fixed In:

Released In:

Upkeep Timestamp:

Tags:

low-hanging-fruit cluster-log-warning

Description

/a/sangadi-2024-09-06_06:53:40-smoke-squid-release-distro-default-smithi/7891716

In the teuthology.log, we can see that the warning occurred in mon.a on smithi100:

2024-09-06T08:06:15.563 INFO:teuthology.orchestra.run.smithi100.stdout:2024-09-06T07:57:42.942097+0000 mon.a (mon.0) 231 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)

Checking into the cluster log on smithi100, we can see that the osdmap changed epochs right before the PG_AVAILABILITY warning popped up. The warning stayed for 6 seconds before the cluster went back to a healthy state:

2024-09-06T07:57:40.939334+0000 mon.a (mon.0) 229 : cluster [DBG] osdmap e37: 4 total, 4 up, 4 in
2024-09-06T07:57:41.944173+0000 mon.a (mon.0) 230 : cluster [DBG] osdmap e38: 4 total, 4 up, 4 in
2024-09-06T07:57:42.880113+0000 mgr.x (mgr.4104) 63 : cluster [DBG] pgmap v93: 393 pgs: 1 peering, 392 active+clean; 587 KiB data, 137 MiB used, 360 GiB / 360 GiB avail; 36 KiB/s rd, 0 B/s wr, 59 op/s
2024-09-06T07:57:42.942097+0000 mon.a (mon.0) 231 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2024-09-06T07:57:44.880723+0000 mgr.x (mgr.4104) 64 : cluster [DBG] pgmap v94: 393 pgs: 1 peering, 392 active+clean; 587 KiB data, 137 MiB used, 360 GiB / 360 GiB avail; 36 KiB/s rd, 0 B/s wr, 59 op/s
2024-09-06T07:57:46.881949+0000 mgr.x (mgr.4104) 65 : cluster [DBG] pgmap v95: 393 pgs: 393 active+clean; 587 KiB data, 137 MiB used, 360 GiB / 360 GiB avail; 36 KiB/s rd, 0 B/s wr, 59 op/s
2024-09-06T07:57:46.932914+0000 mon.a (mon.0) 232 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2024-09-06T07:57:46.932956+0000 mon.a (mon.0) 233 : cluster [INF] Cluster is now healthy

Looking into the mon.a log, we can see that the warning comes from pg 4.16, which was stuck peering for 63 seconds. This indicates that the OSDs involved in pg 4.16 (0 and 2 according to the last reported acting set) might not yet agree on their contents.

2024-09-06T07:57:44.880+0000 7f169623f640 20 mon.a@0(leader).mgrstat health checks:
{
    "PG_AVAILABILITY": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Reduced data availability: 1 pg peering",
            "count": 1
        },
        "detail": [
            {
                "message": "pg 4.19 is stuck peering for 63s, current state peering, last acting [0,2]" 
            }
        ]
    }
}

Farther up in mon.a.log, we can see that `pg-upmpap-items`, from the balancer, was applied to pg 4.19, effectively changing the OSDs the acting set.

2024-09-06T07:57:41.937+0000 7f1693a3a640  7 mon.a@0(leader).log v89 update_from_paxos applying incremental log 89 2024-09-06T07:57:40.879430+0000 mgr.x (mgr.4104) 62 : cluster [DBG] pgmap v90: 393 pgs: 393 active+clean; 587 KiB data, 137 MiB used, 360 GiB / 360 GiB avail; 24 KiB/s rd, 0 B/s wr, 39 op/s
2024-09-06T07:57:41.937+0000 7f1693a3a640  7 mon.a@0(leader).log v89 update_from_paxos applying incremental log 89 2024-09-06T07:57:40.936977+0000 mon.a (mon.0) 228 : audit [INF] from='mgr.4104 172.21.15.100:0/2983478942' entity='mgr.x' cmd='[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "4.19", "id": [1, 2]}]': finished

This explains why PG_AVAILABILITY went down for a short time, while the new acting set got up to speed.

Actions

Copy link

Updated by Laura Flores over 1 year ago

This has happened previously with Reef, so it is not a blocker for Squid 19.2.0:

/a/teuthology-2024-09-08_05:16:02-smoke-reef-distro-default-smithi/7894222

Actions

Copy link

Updated by Laura Flores over 1 year ago · Edited

Note from bug scrub: The pg in question likely exceeded the 60 second threshold before the PG_AVAILABILTY warning comes up, but it was really only stuck for 6 seconds. This falls in line with the pg-upmap-items mapping that was applied to it. The solution for this tracker would likely be to whitelist the warning in the smoke suite.

Actions

Copy link