Bug #63198
openrados/thrash: AssertionError: wait_for_recovery: failed before timeout expired
0%
Description
/a/ksirivad-2023-10-13_01:58:36-rados-wip-ksirivad-fix-63183-distro-default-smithi/7423809/teuthology.log
2023-10-13T05:29:39.337 DEBUG:teuthology.orchestra.run:got remote process result: 124
2023-10-13T05:29:39.338 ERROR:teuthology.run_tasks:Manager failed: thrashosds
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_teuthology_8cdab074dcca9a68965bc5a50e9c30b691949723/teuthology/run_tasks.py", line 154, in run_tasks
suppress = manager.__exit__(*exc_info)
File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
next(self.gen)
File "/home/teuthworker/src/github.com_ceph_ceph-c_5d1b5da21591c57cb0cbbbc8775b6ea0ced953a4/qa/tasks/thrashosds.py", line 215, in task
cluster_manager.wait_for_all_osds_up()
File "/home/teuthworker/src/github.com_ceph_ceph-c_5d1b5da21591c57cb0cbbbc8775b6ea0ced953a4/qa/tasks/ceph_manager.py", line 2792, in wait_for_all_osds_up
while not self.are_all_osds_up():
File "/home/teuthworker/src/github.com_ceph_ceph-c_5d1b5da21591c57cb0cbbbc8775b6ea0ced953a4/qa/tasks/ceph_manager.py", line 2782, in are_all_osds_up
x = self.get_osd_dump()
File "/home/teuthworker/src/github.com_ceph_ceph-c_5d1b5da21591c57cb0cbbbc8775b6ea0ced953a4/qa/tasks/ceph_manager.py", line 2545, in get_osd_dump
return self.get_osd_dump_json()['osds']
File "/home/teuthworker/src/github.com_ceph_ceph-c_5d1b5da21591c57cb0cbbbc8775b6ea0ced953a4/qa/tasks/ceph_manager.py", line 2537, in get_osd_dump_json
out = self.raw_cluster_cmd('osd', 'dump', '--format=json')
File "/home/teuthworker/src/github.com_ceph_ceph-c_5d1b5da21591c57cb0cbbbc8775b6ea0ced953a4/qa/tasks/ceph_manager.py", line 1611, in raw_cluster_cmd
return self.run_cluster_cmd(**kwargs).stdout.getvalue()
File "/home/teuthworker/src/github.com_ceph_ceph-c_5d1b5da21591c57cb0cbbbc8775b6ea0ced953a4/qa/tasks/ceph_manager.py", line 1602, in run_cluster_cmd
return self.controller.run(**kwargs)
File "/home/teuthworker/src/git.ceph.com_teuthology_8cdab074dcca9a68965bc5a50e9c30b691949723/teuthology/orchestra/remote.py", line 522, in run
r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
File "/home/teuthworker/src/git.ceph.com_teuthology_8cdab074dcca9a68965bc5a50e9c30b691949723/teuthology/orchestra/run.py", line 455, in run
r.wait()
File "/home/teuthworker/src/git.ceph.com_teuthology_8cdab074dcca9a68965bc5a50e9c30b691949723/teuthology/orchestra/run.py", line 161, in wait
self._raise_for_status()
File "/home/teuthworker/src/git.ceph.com_teuthology_8cdab074dcca9a68965bc5a50e9c30b691949723/teuthology/orchestra/run.py", line 181, in _raise_for_status
raise CommandFailedError(
teuthology.exceptions.CommandFailedError: Command failed on smithi072 with status 124: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd dump --format=json'
2023-10-13T05:29:39.347 DEBUG:teuthology.run_tasks:Unwinding manager ceph
2023-10-13T05:29:39.356 INFO:tasks.ceph.ceph_manager.ceph:waiting for clean
Updated by Neha Ojha over 2 years ago
Looks like the test already failed much earlier
2023-10-13T04:28:32.682 INFO:tasks.ceph.ceph_manager.ceph:PG 46.3 is not active+clean
2023-10-13T04:28:32.683 INFO:tasks.ceph.ceph_manager.ceph:PG 22.0 is not active+clean
2023-10-13T04:28:32.683 INFO:tasks.ceph.ceph_manager.ceph:PG 22.12 is not active+clean
2023-10-13T04:28:32.684 INFO:tasks.ceph.ceph_manager.ceph:PG 21.1a is not active+clean
...
2023-10-13T04:28:32.685 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
File "/home/teuthworker/src/github.com_ceph_ceph-c_5d1b5da21591c57cb0cbbbc8775b6ea0ced953a4/qa/tasks/ceph_manager.py", line 190, in wrapper
return func(self)
File "/home/teuthworker/src/github.com_ceph_ceph-c_5d1b5da21591c57cb0cbbbc8775b6ea0ced953a4/qa/tasks/ceph_manager.py", line 1409, in _do_thrash
self.ceph_manager.wait_for_recovery(
File "/home/teuthworker/src/github.com_ceph_ceph-c_5d1b5da21591c57cb0cbbbc8775b6ea0ced953a4/qa/tasks/ceph_manager.py", line 2854, in wait_for_recovery
assert now - start < timeout, \
AssertionError: wait_for_recovery: failed before timeout expired
Updated by Laura Flores about 2 years ago
- Subject changed from Command failed: ceph --cluster ceph osd dump --format=json to rados/thrash: AssertionError: wait_for_recovery: failed before timeout expired
Updated by Laura Flores about 2 years ago
/a/yuriw-2024-02-15_15:09:46-rados-wip-yuri5-testing-2024-02-14-1335-distro-default-smithi/7561084
Updated by Aishwarya Mathuria about 2 years ago
/a/yuriw-2024-03-19_00:09:45-rados-wip-yuri5-testing-2024-03-18-1144-distro-default-smithi/7609848
Updated by Radoslaw Zarzynski almost 2 years ago
Bump up but not terribly high prio.
Updated by Laura Flores almost 2 years ago
/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664923
2024-04-20T18:32:31.959 INFO:tasks.ceph.ceph_manager.ceph:PG 7.1b is not active+clean
2024-04-20T18:32:31.959 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '7.1b', 'version': "0'0", 'reported_seq': 1304, 'reported_epoch': 1285, 'state': 'active+clean+remapped', 'last_fresh': '2024-04-20T18:32:25.425840+0000', 'last_change': '2024-04-20T18:29:34.627792+0000', 'last_active': '2024-04-20T18:32:25.425840+0000', 'last_peered': '2024-04-20T18:32:25.425840+0000', 'last_clean': '2024-04-20T18:32:25.425840+0000', 'last_became_active': '2024-04-20T18:12:06.247961+0000', 'last_became_peered': '2024-04-20T18:12:06.247961+0000', 'last_unstale': '2024-04-20T18:32:25.425840+0000', 'last_undegraded': '2024-04-20T18:32:25.425840+0000', 'last_fullsized': '2024-04-20T18:32:25.425840+0000', 'mapping_epoch': 649, 'log_start': "0'0", 'ondisk_log_start': "0'0", 'created': 62, 'last_epoch_clean': 650, 'parent': '0.0', 'parent_split_bits': 0, 'last_scrub': "0'0", 'last_scrub_stamp': '2024-04-20T18:29:34.627661+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2024-04-20T18:08:16.466604+0000', 'last_clean_scrub_stamp': '2024-04-20T18:29:34.627661+0000', 'objects_scrubbed': 0, 'log_size': 0, 'log_dups_size': 0, 'ondisk_log_size': 0, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 0, 'last_scrub_duration': 1, 'scrub_schedule': 'periodic scrub scheduled @ 2024-04-20T18:32:27.754898+0000', 'scrub_duration': 3, 'objects_trimmed': 0, 'snaptrim_duration': 0, 'stat_sum': {'num_bytes': 0, 'num_objects': 0, 'num_object_clones': 0, 'num_object_copies': 0, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 0, 'num_objects_misplaced': 0, 'num_objects_unfound': 0, 'num_objects_dirty': 0, 'num_whiteouts': 0, 'num_read': 0, 'num_read_kb': 0, 'num_write': 0, 'num_write_kb': 0, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 0, 'num_bytes_recovered': 0, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [8], 'acting': [8, 9], 'avail_no_missing': ['8', '9'], 'object_location_counts': [], 'blocked_by': [], 'up_primary': 8, 'acting_primary': 8, 'purged_snaps': []}
2024-04-20T18:32:31.960 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
File "/home/teuthworker/src/github.com_ceph_ceph-c_36c371567dddacf6207ea36f2535396ab31415fc/qa/tasks/ceph_manager.py", line 190, in wrapper
return func(self)
File "/home/teuthworker/src/github.com_ceph_ceph-c_36c371567dddacf6207ea36f2535396ab31415fc/qa/tasks/ceph_manager.py", line 1453, in _do_thrash
self.ceph_manager.wait_for_recovery(
File "/home/teuthworker/src/github.com_ceph_ceph-c_36c371567dddacf6207ea36f2535396ab31415fc/qa/tasks/ceph_manager.py", line 2918, in wait_for_recovery
assert now - start < timeout, \
AssertionError: wait_for_recovery: failed before timeout expired
Updated by Aishwarya Mathuria almost 2 years ago · Edited
/a/yuriw-2024-04-30_14:17:59-rados-wip-yuri5-testing-2024-04-17-1400-distro-default-smithi/7680980/
/a/yuriw-2024-04-30_14:17:59-rados-wip-yuri5-testing-2024-04-17-1400-distro-default-smithi/7680983/
Updated by Radoslaw Zarzynski almost 2 years ago
- Status changed from New to In Progress
Updated by Laura Flores almost 2 years ago
/a/yuriw-2024-05-02_23:59:28-rados-wip-yuriw11-testing-20240501.200505-squid-distro-default-smithi/7686989
Updated by Laura Flores almost 2 years ago
Note from bug scrub: Analysis in progress.
Updated by Sridhar Seshasayee almost 2 years ago
Seen on Squid:
/a/yuriw-2024-05-14_00:32:08-rados-wip-yuri4-testing-2024-04-29-0642-distro-default-smithi/7705408
Issue https://tracker.ceph.com/issues/64866 was also seen in the above test.
Updated by Laura Flores almost 2 years ago
In job /a/yuriw-2024-04-30_14:17:59-rados-wip-yuri5-testing-2024-04-17-1400-distro-default-smithi/7680983/, the last health state showed three "undersized+peered" pgs:
/a/yuriw-2024-04-30_14:17:59-rados-wip-yuri5-testing-2024-04-17-1400-distro-default-smithi/7680983/
{
"fsid": "d7b7ec4f-715e-4dce-9022-a54f64cff2a8",
"health": {
"status": "HEALTH_WARN",
"checks": {
"OSDMAP_FLAGS": {
"severity": "HEALTH_WARN",
"summary": {
"message": "noscrub,nodeep-scrub flag(s) set",
"count": 20
},
"muted": false
},
"PG_AVAILABILITY": {
"severity": "HEALTH_WARN",
"summary": {
"message": "Reduced data availability: 3 pgs inactive",
"count": 3
},
"muted": false
},
"PG_DEGRADED": {
"severity": "HEALTH_WARN",
"summary": {
"message": "Degraded data redundancy: 1/141 objects degraded (0.709%), 1 pg degraded, 32 pgs undersized",
"count": 33
},
"muted": false
},
"POOL_APP_NOT_ENABLED": {
"severity": "HEALTH_WARN",
"summary": {
"message": "3 pool(s) do not have an application enabled",
"count": 3
},
"muted": false
}
},
"mutes": []
},
"election_epoch": 8,
"quorum": [
0,
1,
2
],
"quorum_names": [
"a",
"b",
"c"
],
"quorum_age": 2921,
"monmap": {
"epoch": 1,
"min_mon_release_name": "squid",
"num_mons": 3
},
"osdmap": {
"epoch": 2122,
"num_osds": 16,
"num_up_osds": 16,
"osd_up_since": 1714491421,
"num_in_osds": 2,
"osd_in_since": 1714491680,
"num_remapped_pgs": 49
},
"pgmap": {
"pgs_by_state": [
{
"state_name": "active+clean+remapped",
"count": 49
},
{
"state_name": "active+undersized",
"count": 28
},
{
"state_name": "undersized+peered",
"count": 3
},
{
"state_name": "active+undersized+degraded",
"count": 1
}
],
"num_pgs": 81,
"num_pools": 5,
"num_objects": 47,
"data_bytes": 4590667,
"bytes_used": 174608384,
"bytes_avail": 193098919936,
"bytes_total": 193273528320,
"inactive_pgs_ratio": 0.03703703731298447,
"degraded_objects": 1,
"degraded_total": 141,
"degraded_ratio": 0.0070921985815602835,
"misplaced_objects": 46,
"misplaced_total": 141,
"misplaced_ratio": 0.3262411347517731
},
"fsmap": {
"epoch": 1,
"by_rank": [],
"up:standby": 0
},
"mgrmap": {
"available": true,
"num_standbys": 1,
"modules": [
"iostat",
"nfs",
"restful"
],
"services": {}
},
"servicemap": {
"epoch": 30,
"modified": "2024-04-30T16:00:49.220070+0000",
"services": {}
},
"progress_events": {}
}
I chose to focus on pg 9.9, one of the "undersized+peered" pgs:
2024-04-30T15:48:39.312 INFO:tasks.ceph.ceph_manager.ceph:PG 9.9 is not active+clean
2024-04-30T15:48:39.312 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '9.9', 'version': "0'0", 'reported_seq': 1223, 'reported_epoch': 1152, 'state': 'undersized+peered', 'last_fresh': '2024-04-30T15:48:32.572917+0000', 'last_change': '2024-04-30T15:28:33.234925+0000', 'last_active': '2024-04-30T15:28:32.224347+0000', 'last_peered': '2024-04-30T15:48:32.572917+0000', 'last_clean': '2024-04-30T15:28:32.224347+0000', 'last_became_active': '0.000000', 'last_became_peered': '2024-04-30T15:28:33.234925+0000', 'last_unstale': '2024-04-30T15:48:32.572917+0000', 'last_undegraded': '2024-04-30T15:48:32.572917+0000', 'last_fullsized': '2024-04-30T15:28:33.233883+0000', 'mapping_epoch': 536, 'log_start': "0'0", 'ondisk_log_start': "0'0", 'created': 536, 'last_epoch_clean': 0, 'parent': '0.0', 'parent_split_bits': 0, 'last_scrub': "0'0", 'last_scrub_stamp': '2024-04-30T15:28:32.224347+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2024-04-30T15:28:32.224347+0000', 'last_clean_scrub_stamp': '2024-04-30T15:28:32.224347+0000', 'objects_scrubbed': 0, 'log_size': 0, 'log_dups_size': 0, 'ondisk_log_size': 0, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 0, 'last_scrub_duration': 0, 'scrub_schedule': 'periodic scrub scheduled @ 2024-04-30T15:48:37.486139+0000', 'scrub_duration': 0, 'objects_trimmed': 0, 'snaptrim_duration': 0, 'stat_sum': {'num_bytes': 0, 'num_objects': 0, 'num_object_clones': 0, 'num_object_copies': 0, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 0, 'num_objects_misplaced': 0, 'num_objects_unfound': 0, 'num_objects_dirty': 0, 'num_whiteouts': 0, 'num_read': 0, 'num_read_kb': 0, 'num_write': 0, 'num_write_kb': 0, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 0, 'num_bytes_recovered': 0, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [8], 'acting': [8], 'avail_no_missing': ['8'], 'object_location_counts': [], 'blocked_by': [], 'up_primary': 8, 'acting_primary': 8, 'purged_snaps': []}
In the log on osd.5 (the primary of pg 9.9), I found a message that said recovery raced:
2024-04-30T15:48:38.558+0000 7f564dfff640 10 osd.8 pg_epoch: 1156 pg[9.9( empty local-lis/les=0/0 n=0 ec=536/536 lis/c=0/0 les/c/f=0/0/0 sis=536) [8] r=0 lpr=536 crt=0'0 mlcod 0'0 undersized+peered mbc={}] recovery raced and were queued twice, ignoring!
I found the same issue in job /a/yuriw-2024-04-30_14:17:59-rados-wip-yuri5-testing-2024-04-17-1400-distro-default-smithi/7680983. Health reported two "undresized+peered" pgs:
{
"fsid": "1956bbde-6bdd-462a-8670-32191708a966",
"health": {
"status": "HEALTH_WARN",
"checks": {
"OSDMAP_FLAGS": {
"severity": "HEALTH_WARN",
"summary": {
"message": "noscrub,nodeep-scrub flag(s) set",
"count": 20
},
"muted": false
},
"PG_AVAILABILITY": {
"severity": "HEALTH_WARN",
"summary": {
"message": "Reduced data availability: 2 pgs inactive",
"count": 2
},
"muted": false
},
"PG_DEGRADED": {
"severity": "HEALTH_WARN",
"summary": {
"message": "Degraded data redundancy: 2 pgs undersized",
"count": 2
},
"muted": false
}
},
"mutes": []
},
"election_epoch": 6,
"quorum": [
0,
1,
2
],
"quorum_names": [
"a",
"b",
"c"
],
"quorum_age": 1891,
"monmap": {
"epoch": 1,
"min_mon_release_name": "squid",
"num_mons": 3
},
"osdmap": {
"epoch": 1156,
"num_osds": 16,
"num_up_osds": 16,
"osd_up_since": 1714490624,
"num_in_osds": 2,
"osd_in_since": 1714490877,
"num_remapped_pgs": 0
},
"pgmap": {
"pgs_by_state": [
{
"state_name": "active+clean",
"count": 23
},
{
"state_name": "undersized+peered",
"count": 2
}
],
"num_pgs": 25,
"num_pools": 3,
"num_objects": 14,
"data_bytes": 1376800,
"bytes_used": 982900736,
"bytes_avail": 192290627584,
"bytes_total": 193273528320,
"inactive_pgs_ratio": 0.07999999821186066
},
"fsmap": {
"epoch": 1,
"by_rank": [],
"up:standby": 0
},
"mgrmap": {
"available": true,
"num_standbys": 1,
"modules": [
"iostat",
"nfs",
"restful"
],
"services": {}
},
"servicemap": {
"epoch": 38,
"modified": "2024-04-30T15:48:35.863706+0000",
"services": {
"osd": {
"daemons": {
"3": {
"start_epoch": 0,
"start_stamp": "0.000000",
"gid": 0,
"addr": "(unrecognized address family 0)/0",
"metadata": {},
"task_status": {}
},
"15": {
"start_epoch": 0,
"start_stamp": "0.000000",
"gid": 0,
"addr": "(unrecognized address family 0)/0",
"metadata": {},
"task_status": {}
},
"summary": ""
}
}
}
},
"progress_events": {}
}
I chose to focus on pg 338.19:
2024-04-30T16:03:18.082 INFO:tasks.ceph.ceph_manager.ceph:PG 338.19 is not active+clean
2024-04-30T16:03:18.083 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '338.19', 'version': "0'0", 'reported_seq': 1370, 'reported_epoch': 2118, 'state': 'undersized+peered', 'last_fresh': '2024-04-30T16:03:12.216629+0000', 'last_change': '2024-04-30T15:41:32.624219+0000', 'last_active': '2024-04-30T15:41:31.404002+0000', 'last_peered': '2024-04-30T16:03:12.216629+0000', 'last_clean': '2024-04-30T15:41:31.404002+0000', 'last_became_active': '0.000000', 'last_became_peered': '2024-04-30T15:41:32.624219+0000', 'last_unstale': '2024-04-30T16:03:12.216629+0000', 'last_undegraded': '2024-04-30T16:03:12.216629+0000', 'last_fullsized': '2024-04-30T15:41:32.523134+0000', 'mapping_epoch': 1420, 'log_start': "0'0", 'ondisk_log_start': "0'0", 'created': 1420, 'last_epoch_clean': 0, 'parent': '0.0', 'parent_split_bits': 0, 'last_scrub': "0'0", 'last_scrub_stamp': '2024-04-30T15:41:31.404002+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2024-04-30T15:41:31.404002+0000', 'last_clean_scrub_stamp': '2024-04-30T15:41:31.404002+0000', 'objects_scrubbed': 0, 'log_size': 0, 'log_dups_size': 0, 'ondisk_log_size': 0, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 0, 'last_scrub_duration': 0, 'scrub_schedule': 'periodic scrub scheduled @ 2024-04-30T16:03:14.055752+0000', 'scrub_duration': 0, 'objects_trimmed': 0, 'snaptrim_duration': 0, 'stat_sum': {'num_bytes': 0, 'num_objects': 0, 'num_object_clones': 0, 'num_object_copies': 0, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 0, 'num_objects_misplaced': 0, 'num_objects_unfound': 0, 'num_objects_dirty': 0, 'num_whiteouts': 0, 'num_read': 0, 'num_read_kb': 0, 'num_write': 0, 'num_write_kb': 0, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 0, 'num_bytes_recovered': 0, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [5], 'acting': [5], 'avail_no_missing': ['5'], 'object_location_counts': [], 'blocked_by': [], 'up_primary': 5, 'acting_primary': 5, 'purged_snaps': []}
I found the same "recovery raced" message in the primary osd log:
2024-04-30T15:41:47.137+0000 7f826170c640 10 osd.5 pg_epoch: 1433 pg[338.19( empty local-lis/les=0/0 n=0 ec=1420/1420 lis/c=0/0 les/c/f=0/0/0 sis=1420) [5] r=0 lpr=1420 crt=0'0 mlcod 0'0 undersized+peered mbc={}] recovery raced and were queued twice, ignoring!
I'm still in progress investigating what this means and/or whether this is a true bug.
Updated by Laura Flores almost 2 years ago
/a/lflores-2024-06-18_23:03:53-rados-squid-release-distro-default-smithi/7762797
Updated by Laura Flores over 1 year ago
/a/yuriw-2024-07-17_13:32:02-rados-wip-yuri12-testing-2024-07-16-1122-distro-default-smithi/7805496
Updated by Ronen Friedman over 1 year ago
(seems to me that it's a real bug)
Updated by Radoslaw Zarzynski over 1 year ago
From the log entry quoted in comment 13 we can see the reason why the PG went `undersized` – the acting :
'up': [5], 'acting': [5], 'avail_no_missing': ['5'], 'object_location_counts': [], 'blocked_by': [], 'up_primary': 5, 'acting_primary': 5
As the problem was pretty isolated (2 undersized, many more active), this might be an instance of CRUSH running out of available retries (eee: https://bugzilla.redhat.com/show_bug.cgi?id=2310830). If we had the OSDMap, we would have been able to verify with the `crushtool --show-choose-tries`.
Updated by Laura Flores over 1 year ago
I might need to modify the test to dump the json of the OSDMap.