Project

General

Profile

Actions

Bug #56149

open

thrash-erasure-code: AssertionError: wait_for_recovery timeout due to "active+recovering+undersized" pg

Added by Laura Flores almost 4 years ago. Updated over 1 year ago.

Status:
New
Priority:
Low
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:

Description

Description:
rados/thrash-erasure-code/{ceph clusters/{fixed-2 openstack} fast/normal mon_election/connectivity msgr-failures/few objectstore/filestore-xfs rados recovery-overrides/{more-async-recovery} supported-random-distro$/{centos_8} thrashers/fastread thrashosds-health workloads/ec-radosbench}

/a/yuriw-2022-06-17_13:58:31-rados-wip-yuri7-testing-2022-06-16-1051-pacific-distro-default-smithi/6884394

2022-06-17T15:36:17.287 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '5.13', 'version': "880'5475", 'reported_seq': 1214, 'reported_epoch': 1855, 'state': 'active+recovery_wait+undersized+degraded+remapped', 'last_fresh': '2022-06-17T15:36:11.235599+0000', 'last_change': '2022-06-17T14:53:18.577038+0000', 'last_active': '2022-06-17T15:36:11.235599+0000', 'last_peered': '2022-06-17T15:36:11.235599+0000', 'last_clean': '2022-06-17T14:49:28.443280+0000', 'last_became_active': '2022-06-17T14:53:17.177381+0000', 'last_became_peered': '2022-06-17T14:53:17.177381+0000', 'last_unstale': '2022-06-17T15:36:11.235599+0000', 'last_undegraded': '2022-06-17T14:53:17.116488+0000', 'last_fullsized': '2022-06-17T14:53:16.633990+0000', 'mapping_epoch': 832, 'log_start': "0'0", 'ondisk_log_start': "0'0", 'created': 811, 'last_epoch_clean': 705, 'parent': '0.0', 'parent_split_bits': 5, 'last_scrub': "0'0", 'last_scrub_stamp': '2022-06-17T14:46:48.316282+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2022-06-17T14:46:48.316282+0000', 'last_clean_scrub_stamp': '2022-06-17T14:46:48.316282+0000', 'log_size': 5475, 'ondisk_log_size': 5475, 'stats_invalid': True, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 0, 'stat_sum': {'num_bytes': 111214592, 'num_objects': 1697, 'num_object_clones': 0, 'num_object_copies': 5091, 'num_objects_missing_on_primary': 2665, 'num_objects_missing': 2665, 'num_objects_degraded': 7995, 'num_objects_misplaced': 0, 'num_objects_unfound': 0, 'num_objects_dirty': 1697, 'num_whiteouts': 0, 'num_read': 0, 'num_read_kb': 0, 'num_write': 1849, 'num_write_kb': 113451, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 1286, 'num_bytes_recovered': 84257451, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [3, 0, 1], 'acting': [2147483647, 0, 1], 'avail_no_missing': ['0(0)', '3(1)', '6(2)'], 'object_location_counts': [{'shards': '0(0),3(1),5(2),6(2)', 'objects': 1748}, {'shards': '0(0),3(1),6(2)', 'objects': 868}, {'shards': '0(1),1(2)', 'objects': -919}], 'blocked_by': [], 'up_primary': 0, 'acting_primary': 0, 'purged_snaps': []}
2022-06-17T15:36:17.287 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_583dad03a4c486407143b2fa31042148953bda62/qa/tasks/ceph_manager.py", line 188, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_583dad03a4c486407143b2fa31042148953bda62/qa/tasks/ceph_manager.py", line 1404, in _do_thrash
    timeout=self.config.get('timeout')
  File "/home/teuthworker/src/github.com_ceph_ceph-c_583dad03a4c486407143b2fa31042148953bda62/qa/tasks/ceph_manager.py", line 2809, in wait_for_recovery
    'wait_for_recovery: failed before timeout expired'
AssertionError: wait_for_recovery: failed before timeout expired

Last pg map before Traceback:

{
  "pgs_by_state": [
    {
      "state_name": "active+clean",
      "count": 47
    },
    {
      "state_name": "active+recovery_wait+undersized+degraded+remapped",
      "count": 4
    },
    {
      "state_name": "active+recovering+undersized+degraded+remapped",
      "count": 2
    },
    {
      "state_name": "active+recovering+undersized",
      "count": 1
    },
    {
      "state_name": "active+recovery_wait+undersized+degraded",
      "count": 1
    }
  ],
  "num_pgs": 55,
  "num_pools": 3,
  "num_objects": 80600,
  "data_bytes": 5282005016,
  "bytes_used": 15347728384,
  "bytes_avail": 656289427456,
  "bytes_total": 671637155840,
  "degraded_objects": 21240,
  "degraded_total": 241798,
  "degraded_ratio": 0.08784191763372733,
  "misplaced_objects": 9003,
  "misplaced_total": 241798,
  "misplaced_ratio": 0.037233558590228207
}


Related issues 3 (2 open1 closed)

Related to RADOS - Bug #51076: "wait_for_recovery: failed before timeout expired" during thrashosd test with EC backend.ResolvedLaura Flores

Actions
Copied to RADOS - Bug #57599: thrash-erasure-code: AssertionError: wait_for_recovery timeout due to "recovering+forced_recovery+undersized+remapped+peered" pgNewLaura Flores

Actions
Copied to RADOS - Bug #57600: thrash-erasure-code: AssertionError: wait_for_recovery timeout due to "active+recovery_unfound+degraded+remapped" pgNewLaura Flores

Actions
Actions #1

Updated by Laura Flores almost 4 years ago

  • Related to Bug #51076: "wait_for_recovery: failed before timeout expired" during thrashosd test with EC backend. added
Actions #2

Updated by Laura Flores almost 4 years ago

  • Subject changed from thrash-erasure-code to thrash-erasure-code: AssertionError: wait_for_recovery: failed before timeout expired
Actions #3

Updated by Laura Flores over 3 years ago

  • Priority changed from Normal to Low

According to some tests I ran on main and pacific, this kind of failure does not happen very frequently:

Main:
http://pulpito.front.sepia.ceph.com/lflores-2022-06-21_17:43:27-rados:thrash-erasure-code-main-distro-default-smithi/ - 100/100 tests passed

Pacific:
http://pulpito.front.sepia.ceph.com/lflores-2022-06-21_17:44:35-rados:thrash-erasure-code-pacific-distro-default-smithi/ - 100/100 tests passed

This failure is definitely annoying and something to look into, but the low frequency of it means that we can lower the priority a bit.

Actions #4

Updated by Laura Flores over 3 years ago

  • Tags set to test-failure
Actions #5

Updated by Matan Breizman over 3 years ago

/a/yuriw-2022-09-14_13:16:11-rados-wip-yuri6-testing-2022-09-13-1352-distro-default-smithi/7032356

Actions #6

Updated by Laura Flores over 3 years ago

Matan Breizman wrote:

/a/yuriw-2022-09-14_13:16:11-rados-wip-yuri6-testing-2022-09-13-1352-distro-default-smithi/7032356

This one is https://tracker.ceph.com/issues/57546:

active+clean+remapped+laggy

Actions #7

Updated by Laura Flores over 3 years ago

  • Subject changed from thrash-erasure-code: AssertionError: wait_for_recovery: failed before timeout expired to thrash-erasure-code: AssertionError: wait_for_recovery timeout due to "active+recovering+undersized" pg
Actions #8

Updated by Laura Flores over 3 years ago

  • Copied to Bug #57599: thrash-erasure-code: AssertionError: wait_for_recovery timeout due to "recovering+forced_recovery+undersized+remapped+peered" pg added
Actions #9

Updated by Laura Flores over 3 years ago

  • Copied to Bug #57600: thrash-erasure-code: AssertionError: wait_for_recovery timeout due to "active+recovery_unfound+degraded+remapped" pg added
Actions #10

Updated by Laura Flores over 1 year ago

/a/yuriw-2024-08-09_02:35:19-rados-squid-release-distro-default-smithi/7845280

Actions

Also available in: Atom PDF