Bug #63198: rados/thrash: AssertionError: wait_for_recovery: failed before timeout expired - RADOS - Ceph

Actions

Copy link

Bug #63198

open

rados/thrash: AssertionError: wait_for_recovery: failed before timeout expired

Added by Kamoltat (Junior) Sirivadhna over 2 years ago. Updated over 1 year ago.

Status:

In Progress

Priority:

Normal

Assignee:

Laura Flores

Category:

Target version:

% Done:

Source:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Tags (freeform):

Merge Commit:

Fixed In:

Released In:

Upkeep Timestamp:

Description

/a/ksirivad-2023-10-13_01:58:36-rados-wip-ksirivad-fix-63183-distro-default-smithi/7423809/teuthology.log

2023-10-13T05:29:39.337 DEBUG:teuthology.orchestra.run:got remote process result: 124
2023-10-13T05:29:39.338 ERROR:teuthology.run_tasks:Manager failed: thrashosds
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_8cdab074dcca9a68965bc5a50e9c30b691949723/teuthology/run_tasks.py", line 154, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
    next(self.gen)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_5d1b5da21591c57cb0cbbbc8775b6ea0ced953a4/qa/tasks/thrashosds.py", line 215, in task
    cluster_manager.wait_for_all_osds_up()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_5d1b5da21591c57cb0cbbbc8775b6ea0ced953a4/qa/tasks/ceph_manager.py", line 2792, in wait_for_all_osds_up
    while not self.are_all_osds_up():
  File "/home/teuthworker/src/github.com_ceph_ceph-c_5d1b5da21591c57cb0cbbbc8775b6ea0ced953a4/qa/tasks/ceph_manager.py", line 2782, in are_all_osds_up
    x = self.get_osd_dump()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_5d1b5da21591c57cb0cbbbc8775b6ea0ced953a4/qa/tasks/ceph_manager.py", line 2545, in get_osd_dump
    return self.get_osd_dump_json()['osds']
  File "/home/teuthworker/src/github.com_ceph_ceph-c_5d1b5da21591c57cb0cbbbc8775b6ea0ced953a4/qa/tasks/ceph_manager.py", line 2537, in get_osd_dump_json
    out = self.raw_cluster_cmd('osd', 'dump', '--format=json')
  File "/home/teuthworker/src/github.com_ceph_ceph-c_5d1b5da21591c57cb0cbbbc8775b6ea0ced953a4/qa/tasks/ceph_manager.py", line 1611, in raw_cluster_cmd
    return self.run_cluster_cmd(**kwargs).stdout.getvalue()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_5d1b5da21591c57cb0cbbbc8775b6ea0ced953a4/qa/tasks/ceph_manager.py", line 1602, in run_cluster_cmd
    return self.controller.run(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_teuthology_8cdab074dcca9a68965bc5a50e9c30b691949723/teuthology/orchestra/remote.py", line 522, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_teuthology_8cdab074dcca9a68965bc5a50e9c30b691949723/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_teuthology_8cdab074dcca9a68965bc5a50e9c30b691949723/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_teuthology_8cdab074dcca9a68965bc5a50e9c30b691949723/teuthology/orchestra/run.py", line 181, in _raise_for_status
    raise CommandFailedError(
teuthology.exceptions.CommandFailedError: Command failed on smithi072 with status 124: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd dump --format=json'
2023-10-13T05:29:39.347 DEBUG:teuthology.run_tasks:Unwinding manager ceph
2023-10-13T05:29:39.356 INFO:tasks.ceph.ceph_manager.ceph:waiting for clean

Actions

Copy link

Updated by Neha Ojha over 2 years ago

Looks like the test already failed much earlier

2023-10-13T04:28:32.682 INFO:tasks.ceph.ceph_manager.ceph:PG 46.3 is not active+clean
2023-10-13T04:28:32.683 INFO:tasks.ceph.ceph_manager.ceph:PG 22.0 is not active+clean
2023-10-13T04:28:32.683 INFO:tasks.ceph.ceph_manager.ceph:PG 22.12 is not active+clean
2023-10-13T04:28:32.684 INFO:tasks.ceph.ceph_manager.ceph:PG 21.1a is not active+clean
...
2023-10-13T04:28:32.685 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_5d1b5da21591c57cb0cbbbc8775b6ea0ced953a4/qa/tasks/ceph_manager.py", line 190, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_5d1b5da21591c57cb0cbbbc8775b6ea0ced953a4/qa/tasks/ceph_manager.py", line 1409, in _do_thrash
    self.ceph_manager.wait_for_recovery(
  File "/home/teuthworker/src/github.com_ceph_ceph-c_5d1b5da21591c57cb0cbbbc8775b6ea0ced953a4/qa/tasks/ceph_manager.py", line 2854, in wait_for_recovery
    assert now - start < timeout, \
AssertionError: wait_for_recovery: failed before timeout expired

Actions

Copy link

Updated by Laura Flores about 2 years ago

Subject changed from Command failed: ceph --cluster ceph osd dump --format=json to rados/thrash: AssertionError: wait_for_recovery: failed before timeout expired

Actions

Copy link

Updated by Laura Flores about 2 years ago

/a/yuriw-2024-02-15_15:09:46-rados-wip-yuri5-testing-2024-02-14-1335-distro-default-smithi/7561084

Actions

Copy link

Updated by Radoslaw Zarzynski about 2 years ago

Assignee set to Laura Flores

Actions

Copy link

Updated by Aishwarya Mathuria about 2 years ago

/a/yuriw-2024-03-19_00:09:45-rados-wip-yuri5-testing-2024-03-18-1144-distro-default-smithi/7609848

Actions

Copy link

Updated by Radoslaw Zarzynski almost 2 years ago

Bump up but not terribly high prio.

Actions

Copy link

Updated by Laura Flores almost 2 years ago

/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664923

2024-04-20T18:32:31.959 INFO:tasks.ceph.ceph_manager.ceph:PG 7.1b is not active+clean
2024-04-20T18:32:31.959 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '7.1b', 'version': "0'0", 'reported_seq': 1304, 'reported_epoch': 1285, 'state': 'active+clean+remapped', 'last_fresh': '2024-04-20T18:32:25.425840+0000', 'last_change': '2024-04-20T18:29:34.627792+0000', 'last_active': '2024-04-20T18:32:25.425840+0000', 'last_peered': '2024-04-20T18:32:25.425840+0000', 'last_clean': '2024-04-20T18:32:25.425840+0000', 'last_became_active': '2024-04-20T18:12:06.247961+0000', 'last_became_peered': '2024-04-20T18:12:06.247961+0000', 'last_unstale': '2024-04-20T18:32:25.425840+0000', 'last_undegraded': '2024-04-20T18:32:25.425840+0000', 'last_fullsized': '2024-04-20T18:32:25.425840+0000', 'mapping_epoch': 649, 'log_start': "0'0", 'ondisk_log_start': "0'0", 'created': 62, 'last_epoch_clean': 650, 'parent': '0.0', 'parent_split_bits': 0, 'last_scrub': "0'0", 'last_scrub_stamp': '2024-04-20T18:29:34.627661+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2024-04-20T18:08:16.466604+0000', 'last_clean_scrub_stamp': '2024-04-20T18:29:34.627661+0000', 'objects_scrubbed': 0, 'log_size': 0, 'log_dups_size': 0, 'ondisk_log_size': 0, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 0, 'last_scrub_duration': 1, 'scrub_schedule': 'periodic scrub scheduled @ 2024-04-20T18:32:27.754898+0000', 'scrub_duration': 3, 'objects_trimmed': 0, 'snaptrim_duration': 0, 'stat_sum': {'num_bytes': 0, 'num_objects': 0, 'num_object_clones': 0, 'num_object_copies': 0, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 0, 'num_objects_misplaced': 0, 'num_objects_unfound': 0, 'num_objects_dirty': 0, 'num_whiteouts': 0, 'num_read': 0, 'num_read_kb': 0, 'num_write': 0, 'num_write_kb': 0, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 0, 'num_bytes_recovered': 0, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [8], 'acting': [8, 9], 'avail_no_missing': ['8', '9'], 'object_location_counts': [], 'blocked_by': [], 'up_primary': 8, 'acting_primary': 8, 'purged_snaps': []}
2024-04-20T18:32:31.960 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_36c371567dddacf6207ea36f2535396ab31415fc/qa/tasks/ceph_manager.py", line 190, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_36c371567dddacf6207ea36f2535396ab31415fc/qa/tasks/ceph_manager.py", line 1453, in _do_thrash
    self.ceph_manager.wait_for_recovery(
  File "/home/teuthworker/src/github.com_ceph_ceph-c_36c371567dddacf6207ea36f2535396ab31415fc/qa/tasks/ceph_manager.py", line 2918, in wait_for_recovery
    assert now - start < timeout, \
AssertionError: wait_for_recovery: failed before timeout expired

Actions

Copy link

Updated by Aishwarya Mathuria almost 2 years ago · Edited

/a/yuriw-2024-04-30_14:17:59-rados-wip-yuri5-testing-2024-04-17-1400-distro-default-smithi/7680980/
/a/yuriw-2024-04-30_14:17:59-rados-wip-yuri5-testing-2024-04-17-1400-distro-default-smithi/7680983/

Actions

Copy link

Updated by Radoslaw Zarzynski almost 2 years ago

Status changed from New to In Progress

Actions

Copy link

#10

Updated by Laura Flores almost 2 years ago

/a/yuriw-2024-05-02_23:59:28-rados-wip-yuriw11-testing-20240501.200505-squid-distro-default-smithi/7686989

Actions

Copy link

#11

Updated by Laura Flores almost 2 years ago

Note from bug scrub: Analysis in progress.

Actions

Copy link

#12

Updated by Sridhar Seshasayee almost 2 years ago

Seen on Squid:
/a/yuriw-2024-05-14_00:32:08-rados-wip-yuri4-testing-2024-04-29-0642-distro-default-smithi/7705408

Issue https://tracker.ceph.com/issues/64866 was also seen in the above test.

Actions

Copy link

#13

Updated by Laura Flores almost 2 years ago

In job /a/yuriw-2024-04-30_14:17:59-rados-wip-yuri5-testing-2024-04-17-1400-distro-default-smithi/7680983/, the last health state showed three "undersized+peered" pgs:

/a/yuriw-2024-04-30_14:17:59-rados-wip-yuri5-testing-2024-04-17-1400-distro-default-smithi/7680983/

{
  "fsid": "d7b7ec4f-715e-4dce-9022-a54f64cff2a8",
  "health": {
    "status": "HEALTH_WARN",
    "checks": {
      "OSDMAP_FLAGS": {
        "severity": "HEALTH_WARN",
        "summary": {
          "message": "noscrub,nodeep-scrub flag(s) set",
          "count": 20
        },
        "muted": false
      },
      "PG_AVAILABILITY": {
        "severity": "HEALTH_WARN",
        "summary": {
          "message": "Reduced data availability: 3 pgs inactive",
          "count": 3
        },
        "muted": false
      },
      "PG_DEGRADED": {
        "severity": "HEALTH_WARN",
        "summary": {
          "message": "Degraded data redundancy: 1/141 objects degraded (0.709%), 1 pg degraded, 32 pgs undersized",
          "count": 33
        },
        "muted": false
      },
      "POOL_APP_NOT_ENABLED": {
        "severity": "HEALTH_WARN",
        "summary": {
          "message": "3 pool(s) do not have an application enabled",
          "count": 3
        },
        "muted": false
      }
    },
    "mutes": []
  },
  "election_epoch": 8,
  "quorum": [
    0,
    1,
    2
  ],
  "quorum_names": [
    "a",
    "b",
    "c" 
  ],
  "quorum_age": 2921,
  "monmap": {
    "epoch": 1,
    "min_mon_release_name": "squid",
    "num_mons": 3
  },
  "osdmap": {
    "epoch": 2122,
    "num_osds": 16,
    "num_up_osds": 16,
    "osd_up_since": 1714491421,
    "num_in_osds": 2,
    "osd_in_since": 1714491680,
    "num_remapped_pgs": 49
  },
  "pgmap": {
    "pgs_by_state": [
      {
        "state_name": "active+clean+remapped",
        "count": 49
      },
      {
        "state_name": "active+undersized",
        "count": 28
      },
      {
        "state_name": "undersized+peered",
        "count": 3
      },
      {
        "state_name": "active+undersized+degraded",
        "count": 1
      }
    ],
    "num_pgs": 81,
    "num_pools": 5,
    "num_objects": 47,
    "data_bytes": 4590667,
    "bytes_used": 174608384,
    "bytes_avail": 193098919936,
    "bytes_total": 193273528320,
    "inactive_pgs_ratio": 0.03703703731298447,
    "degraded_objects": 1,
    "degraded_total": 141,
    "degraded_ratio": 0.0070921985815602835,
    "misplaced_objects": 46,
    "misplaced_total": 141,
    "misplaced_ratio": 0.3262411347517731
  },
  "fsmap": {
    "epoch": 1,
    "by_rank": [],
    "up:standby": 0
  },
  "mgrmap": {
    "available": true,
    "num_standbys": 1,
    "modules": [
      "iostat",
      "nfs",
      "restful" 
    ],
    "services": {}
  },
  "servicemap": {
    "epoch": 30,
    "modified": "2024-04-30T16:00:49.220070+0000",
    "services": {}
  },
  "progress_events": {}
}

I chose to focus on pg 9.9, one of the "undersized+peered" pgs:

2024-04-30T15:48:39.312 INFO:tasks.ceph.ceph_manager.ceph:PG 9.9 is not active+clean
2024-04-30T15:48:39.312 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '9.9', 'version': "0'0", 'reported_seq': 1223, 'reported_epoch': 1152, 'state': 'undersized+peered', 'last_fresh': '2024-04-30T15:48:32.572917+0000', 'last_change': '2024-04-30T15:28:33.234925+0000', 'last_active': '2024-04-30T15:28:32.224347+0000', 'last_peered': '2024-04-30T15:48:32.572917+0000', 'last_clean': '2024-04-30T15:28:32.224347+0000', 'last_became_active': '0.000000', 'last_became_peered': '2024-04-30T15:28:33.234925+0000', 'last_unstale': '2024-04-30T15:48:32.572917+0000', 'last_undegraded': '2024-04-30T15:48:32.572917+0000', 'last_fullsized': '2024-04-30T15:28:33.233883+0000', 'mapping_epoch': 536, 'log_start': "0'0", 'ondisk_log_start': "0'0", 'created': 536, 'last_epoch_clean': 0, 'parent': '0.0', 'parent_split_bits': 0, 'last_scrub': "0'0", 'last_scrub_stamp': '2024-04-30T15:28:32.224347+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2024-04-30T15:28:32.224347+0000', 'last_clean_scrub_stamp': '2024-04-30T15:28:32.224347+0000', 'objects_scrubbed': 0, 'log_size': 0, 'log_dups_size': 0, 'ondisk_log_size': 0, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 0, 'last_scrub_duration': 0, 'scrub_schedule': 'periodic scrub scheduled @ 2024-04-30T15:48:37.486139+0000', 'scrub_duration': 0, 'objects_trimmed': 0, 'snaptrim_duration': 0, 'stat_sum': {'num_bytes': 0, 'num_objects': 0, 'num_object_clones': 0, 'num_object_copies': 0, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 0, 'num_objects_misplaced': 0, 'num_objects_unfound': 0, 'num_objects_dirty': 0, 'num_whiteouts': 0, 'num_read': 0, 'num_read_kb': 0, 'num_write': 0, 'num_write_kb': 0, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 0, 'num_bytes_recovered': 0, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [8], 'acting': [8], 'avail_no_missing': ['8'], 'object_location_counts': [], 'blocked_by': [], 'up_primary': 8, 'acting_primary': 8, 'purged_snaps': []}

In the log on osd.5 (the primary of pg 9.9), I found a message that said recovery raced:

2024-04-30T15:48:38.558+0000 7f564dfff640 10 osd.8 pg_epoch: 1156 pg[9.9( empty local-lis/les=0/0 n=0 ec=536/536 lis/c=0/0 les/c/f=0/0/0 sis=536) [8] r=0 lpr=536 crt=0'0 mlcod 0'0 undersized+peered mbc={}] recovery raced and were queued twice, ignoring!

I found the same issue in job /a/yuriw-2024-04-30_14:17:59-rados-wip-yuri5-testing-2024-04-17-1400-distro-default-smithi/7680983. Health reported two "undresized+peered" pgs:

{
  "fsid": "1956bbde-6bdd-462a-8670-32191708a966",
  "health": {
    "status": "HEALTH_WARN",
    "checks": {
      "OSDMAP_FLAGS": {
        "severity": "HEALTH_WARN",
        "summary": {
          "message": "noscrub,nodeep-scrub flag(s) set",
          "count": 20
        },
        "muted": false
      },
      "PG_AVAILABILITY": {
        "severity": "HEALTH_WARN",
        "summary": {
          "message": "Reduced data availability: 2 pgs inactive",
          "count": 2
        },
        "muted": false
      },
      "PG_DEGRADED": {
        "severity": "HEALTH_WARN",
        "summary": {
          "message": "Degraded data redundancy: 2 pgs undersized",
          "count": 2
        },
        "muted": false
      }
    },
    "mutes": []
  },
  "election_epoch": 6,
  "quorum": [
    0,
    1,
    2
  ],
  "quorum_names": [
    "a",
    "b",
    "c" 
  ],
  "quorum_age": 1891,
  "monmap": {
    "epoch": 1,
    "min_mon_release_name": "squid",
    "num_mons": 3
  },
  "osdmap": {
    "epoch": 1156,
    "num_osds": 16,
    "num_up_osds": 16,
    "osd_up_since": 1714490624,
    "num_in_osds": 2,
    "osd_in_since": 1714490877,
    "num_remapped_pgs": 0
  },
  "pgmap": {
    "pgs_by_state": [
      {
        "state_name": "active+clean",
        "count": 23
      },
      {
        "state_name": "undersized+peered",
        "count": 2
      }
    ],
    "num_pgs": 25,
    "num_pools": 3,
    "num_objects": 14,
    "data_bytes": 1376800,
    "bytes_used": 982900736,
    "bytes_avail": 192290627584,
    "bytes_total": 193273528320,
    "inactive_pgs_ratio": 0.07999999821186066
  },
  "fsmap": {
    "epoch": 1,
    "by_rank": [],
    "up:standby": 0
  },
  "mgrmap": {
    "available": true,
    "num_standbys": 1,
    "modules": [
      "iostat",
      "nfs",
      "restful" 
    ],
    "services": {}
  },
  "servicemap": {
    "epoch": 38,
    "modified": "2024-04-30T15:48:35.863706+0000",
    "services": {
      "osd": {
        "daemons": {
          "3": {
            "start_epoch": 0,
            "start_stamp": "0.000000",
            "gid": 0,
            "addr": "(unrecognized address family 0)/0",
            "metadata": {},
            "task_status": {}
          },
          "15": {
            "start_epoch": 0,
            "start_stamp": "0.000000",
            "gid": 0,
            "addr": "(unrecognized address family 0)/0",
            "metadata": {},
            "task_status": {}
          },
          "summary": "" 
        }
      }
    }
  },
  "progress_events": {}
}

I chose to focus on pg 338.19:

2024-04-30T16:03:18.082 INFO:tasks.ceph.ceph_manager.ceph:PG 338.19 is not active+clean
2024-04-30T16:03:18.083 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '338.19', 'version': "0'0", 'reported_seq': 1370, 'reported_epoch': 2118, 'state': 'undersized+peered', 'last_fresh': '2024-04-30T16:03:12.216629+0000', 'last_change': '2024-04-30T15:41:32.624219+0000', 'last_active': '2024-04-30T15:41:31.404002+0000', 'last_peered': '2024-04-30T16:03:12.216629+0000', 'last_clean': '2024-04-30T15:41:31.404002+0000', 'last_became_active': '0.000000', 'last_became_peered': '2024-04-30T15:41:32.624219+0000', 'last_unstale': '2024-04-30T16:03:12.216629+0000', 'last_undegraded': '2024-04-30T16:03:12.216629+0000', 'last_fullsized': '2024-04-30T15:41:32.523134+0000', 'mapping_epoch': 1420, 'log_start': "0'0", 'ondisk_log_start': "0'0", 'created': 1420, 'last_epoch_clean': 0, 'parent': '0.0', 'parent_split_bits': 0, 'last_scrub': "0'0", 'last_scrub_stamp': '2024-04-30T15:41:31.404002+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2024-04-30T15:41:31.404002+0000', 'last_clean_scrub_stamp': '2024-04-30T15:41:31.404002+0000', 'objects_scrubbed': 0, 'log_size': 0, 'log_dups_size': 0, 'ondisk_log_size': 0, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 0, 'last_scrub_duration': 0, 'scrub_schedule': 'periodic scrub scheduled @ 2024-04-30T16:03:14.055752+0000', 'scrub_duration': 0, 'objects_trimmed': 0, 'snaptrim_duration': 0, 'stat_sum': {'num_bytes': 0, 'num_objects': 0, 'num_object_clones': 0, 'num_object_copies': 0, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 0, 'num_objects_misplaced': 0, 'num_objects_unfound': 0, 'num_objects_dirty': 0, 'num_whiteouts': 0, 'num_read': 0, 'num_read_kb': 0, 'num_write': 0, 'num_write_kb': 0, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 0, 'num_bytes_recovered': 0, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [5], 'acting': [5], 'avail_no_missing': ['5'], 'object_location_counts': [], 'blocked_by': [], 'up_primary': 5, 'acting_primary': 5, 'purged_snaps': []}

I found the same "recovery raced" message in the primary osd log:

2024-04-30T15:41:47.137+0000 7f826170c640 10 osd.5 pg_epoch: 1433 pg[338.19( empty local-lis/les=0/0 n=0 ec=1420/1420 lis/c=0/0 les/c/f=0/0/0 sis=1420) [5] r=0 lpr=1420 crt=0'0 mlcod 0'0 undersized+peered mbc={}] recovery raced and were queued twice, ignoring!

I'm still in progress investigating what this means and/or whether this is a true bug.

Actions

Copy link

#14

Updated by Laura Flores almost 2 years ago

/a/lflores-2024-06-18_23:03:53-rados-squid-release-distro-default-smithi/7762797

Actions

Copy link

#15

Updated by Laura Flores over 1 year ago

/a/yuriw-2024-07-17_13:32:02-rados-wip-yuri12-testing-2024-07-16-1122-distro-default-smithi/7805496

Actions

Copy link

#16

Updated by Ronen Friedman over 1 year ago

https://pulpito.ceph.com/rfriedma-2024-09-21_17:35:20-rados-wip-rf-qprio-w-qa-distro-default-smithi/7914223/

(seems to me that it's a real bug)

Actions

Copy link

#17

Updated by Radoslaw Zarzynski over 1 year ago

From the log entry quoted in comment 13 we can see the reason why the PG went `undersized` – the acting :

'up': [5], 'acting': [5], 'avail_no_missing': ['5'], 'object_location_counts': [], 'blocked_by': [], 'up_primary': 5, 'acting_primary': 5

As the problem was pretty isolated (2 undersized, many more active), this might be an instance of CRUSH running out of available retries (eee: https://bugzilla.redhat.com/show_bug.cgi?id=2310830). If we had the OSDMap, we would have been able to verify with the `crushtool --show-choose-tries`.

Actions

Copy link

#18

Updated by Laura Flores over 1 year ago

I might need to modify the test to dump the json of the OSDMap.

Actions

Copy link

#19

Updated by Radoslaw Zarzynski over 1 year ago

scrub note: bump up.

Actions

Copy link

#20

Updated by Radoslaw Zarzynski over 1 year ago

scrub note: bump up.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Tags

Custom queries

Bug #63198

rados/thrash: AssertionError: wait_for_recovery: failed before timeout expired

Updated by Neha Ojha over 2 years ago

Updated by Laura Flores about 2 years ago

Updated by Laura Flores about 2 years ago

Updated by Radoslaw Zarzynski about 2 years ago

Updated by Aishwarya Mathuria about 2 years ago

Updated by Radoslaw Zarzynski almost 2 years ago

Updated by Laura Flores almost 2 years ago

Updated by Aishwarya Mathuria almost 2 years ago · Edited

Updated by Radoslaw Zarzynski almost 2 years ago

Updated by Laura Flores almost 2 years ago

Updated by Laura Flores almost 2 years ago

Updated by Sridhar Seshasayee almost 2 years ago

Updated by Laura Flores almost 2 years ago

Updated by Laura Flores almost 2 years ago

Updated by Laura Flores over 1 year ago

Updated by Ronen Friedman over 1 year ago

Updated by Radoslaw Zarzynski over 1 year ago

Updated by Laura Flores over 1 year ago

Updated by Radoslaw Zarzynski over 1 year ago

Updated by Radoslaw Zarzynski over 1 year ago