Bug #68005
opensmoke/basic: "Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)" in cluster log
0%
Description
/a/sangadi-2024-09-06_06:53:40-smoke-squid-release-distro-default-smithi/7891716
In the teuthology.log, we can see that the warning occurred in mon.a on smithi100:
2024-09-06T08:06:15.563 INFO:teuthology.orchestra.run.smithi100.stdout:2024-09-06T07:57:42.942097+0000 mon.a (mon.0) 231 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
Checking into the cluster log on smithi100, we can see that the osdmap changed epochs right before the PG_AVAILABILITY warning popped up. The warning stayed for 6 seconds before the cluster went back to a healthy state:
2024-09-06T07:57:40.939334+0000 mon.a (mon.0) 229 : cluster [DBG] osdmap e37: 4 total, 4 up, 4 in 2024-09-06T07:57:41.944173+0000 mon.a (mon.0) 230 : cluster [DBG] osdmap e38: 4 total, 4 up, 4 in 2024-09-06T07:57:42.880113+0000 mgr.x (mgr.4104) 63 : cluster [DBG] pgmap v93: 393 pgs: 1 peering, 392 active+clean; 587 KiB data, 137 MiB used, 360 GiB / 360 GiB avail; 36 KiB/s rd, 0 B/s wr, 59 op/s 2024-09-06T07:57:42.942097+0000 mon.a (mon.0) 231 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY) 2024-09-06T07:57:44.880723+0000 mgr.x (mgr.4104) 64 : cluster [DBG] pgmap v94: 393 pgs: 1 peering, 392 active+clean; 587 KiB data, 137 MiB used, 360 GiB / 360 GiB avail; 36 KiB/s rd, 0 B/s wr, 59 op/s 2024-09-06T07:57:46.881949+0000 mgr.x (mgr.4104) 65 : cluster [DBG] pgmap v95: 393 pgs: 393 active+clean; 587 KiB data, 137 MiB used, 360 GiB / 360 GiB avail; 36 KiB/s rd, 0 B/s wr, 59 op/s 2024-09-06T07:57:46.932914+0000 mon.a (mon.0) 232 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering) 2024-09-06T07:57:46.932956+0000 mon.a (mon.0) 233 : cluster [INF] Cluster is now healthy
Looking into the mon.a log, we can see that the warning comes from pg 4.16, which was stuck peering for 63 seconds. This indicates that the OSDs involved in pg 4.16 (0 and 2 according to the last reported acting set) might not yet agree on their contents.
2024-09-06T07:57:44.880+0000 7f169623f640 20 mon.a@0(leader).mgrstat health checks:
{
"PG_AVAILABILITY": {
"severity": "HEALTH_WARN",
"summary": {
"message": "Reduced data availability: 1 pg peering",
"count": 1
},
"detail": [
{
"message": "pg 4.19 is stuck peering for 63s, current state peering, last acting [0,2]"
}
]
}
}
Farther up in mon.a.log, we can see that `pg-upmpap-items`, from the balancer, was applied to pg 4.19, effectively changing the OSDs the acting set.
2024-09-06T07:57:41.937+0000 7f1693a3a640 7 mon.a@0(leader).log v89 update_from_paxos applying incremental log 89 2024-09-06T07:57:40.879430+0000 mgr.x (mgr.4104) 62 : cluster [DBG] pgmap v90: 393 pgs: 393 active+clean; 587 KiB data, 137 MiB used, 360 GiB / 360 GiB avail; 24 KiB/s rd, 0 B/s wr, 39 op/s
2024-09-06T07:57:41.937+0000 7f1693a3a640 7 mon.a@0(leader).log v89 update_from_paxos applying incremental log 89 2024-09-06T07:57:40.936977+0000 mon.a (mon.0) 228 : audit [INF] from='mgr.4104 172.21.15.100:0/2983478942' entity='mgr.x' cmd='[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "4.19", "id": [1, 2]}]': finished
This explains why PG_AVAILABILITY went down for a short time, while the new acting set got up to speed.
Updated by Laura Flores over 1 year ago
This has happened previously with Reef, so it is not a blocker for Squid 19.2.0:
/a/teuthology-2024-09-08_05:16:02-smoke-reef-distro-default-smithi/7894222
Updated by Laura Flores over 1 year ago ยท Edited
Note from bug scrub: The pg in question likely exceeded the 60 second threshold before the PG_AVAILABILTY warning comes up, but it was really only stuck for 6 seconds. This falls in line with the pg-upmap-items mapping that was applied to it. The solution for this tracker would likely be to whitelist the warning in the smoke suite.
Updated by Laura Flores over 1 year ago
Hey @Ashrita Kollipara how are you doing on this issue? Do you need any help?
Updated by Laura Flores over 1 year ago
/a/yuriw-2024-12-13_17:07:44-smoke-squid-release-distro-default-smithi/8034622
Updated by Ronen Friedman over 1 year ago
@Laura Flores
What seems to me to be the same problems appears in the upgrade suite (not 'smoke'):
https://pulpito.ceph.com/skanta-2024-12-20_03:35:31-rados-wip-bharath6-testing-2024-12-19-0956-squid-distro-default-smithi/8045472
Updated by Laura Flores over 1 year ago
- Assignee changed from Ashrita Kollipara to Laura Flores
Updated by Laura Flores over 1 year ago
@Ronen Friedman let's file a new tracker for that one then.
Updated by Laura Flores about 1 year ago
- Status changed from New to Fix Under Review
- Pull request ID set to 61879
Updated by Radoslaw Zarzynski about 1 year ago
scrub note: approved, sent to QA.
Updated by Laura Flores about 1 year ago
- Status changed from Fix Under Review to In Progress
Updated by Radoslaw Zarzynski about 1 year ago
- Status changed from In Progress to Fix Under Review
scrub note: review in progress.
Updated by Laura Flores about 1 year ago
- Backport changed from squid to squid,reef
Updated by Laura Flores about 1 year ago
- Tags changed from low-hanging-fruit to low-hanging-fruit, cluster-log-warning
Updated by Laura Flores about 1 year ago
/a/yuriw-2025-05-01_16:01:04-smoke-reef-release-distro-default-smithi/8267527
Updated by Laura Flores 3 months ago
/a/yuriw-2026-01-26_20:16:51-smoke-reef-release-distro-default-trial/18966
Updated by Konstantin Shalygin 28 days ago
- Backport changed from squid,reef to squid