Bug #46180: qa: Scrubbing terminated -- not all pgs were active and clean. - RADOS - Ceph

Actions

Copy link

Bug #46180

closed

qa: Scrubbing terminated -- not all pgs were active and clean.

Added by Casey Bodley over 5 years ago. Updated 8 months ago.

Status:

Resolved

Priority:

Immediate

Assignee:

Ilya Dryomov

Category:

Correctness/Safety

Target version:

Ceph - v16.0.0

% Done:

Source:

Q/A

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Messenger

Pull request ID:

35816

Tags (freeform):

Merge Commit:

0195fd53f1c6b720c29ee978dc31fb1303494072

Fixed In:

v16.0.0-3118-g0195fd53f1

Released In:

v16.2.0~2105

Upkeep Timestamp:

2025-07-14T19:39:05+00:00

Description

Seeing several test failures in the rgw suite:

2020-06-24T06:55:00.513 INFO:tasks.ceph:Waiting for all PGs to be active+clean and split+merged, waiting on ['2.0'] to go clean and/or [] to split/merge
2020-06-24T06:55:20.514 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph.py", line 1830, in task
    healthy(ctx=ctx, config=dict(cluster=config['cluster']))
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph.py", line 1419, in healthy
    manager.wait_for_clean()
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph_manager.py", line 2518, in wait_for_clean
    'wait_for_clean: failed before timeout expired'
AssertionError: wait_for_clean: failed before timeout expired

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/contextutil.py", line 33, in nested
    yield vars
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph.py", line 1839, in task
    osd_scrub_pgs(ctx, config)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph.py", line 1209, in osd_scrub_pgs
    raise RuntimeError("Scrubbing terminated -- not all pgs were active and clean.")
RuntimeError: Scrubbing terminated -- not all pgs were active and clean.

http://pulpito.ceph.com/yuvalif-2020-06-23_14:40:15-rgw-wip-yuval-test-35331-35155-distro-basic-smithi/

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by Sridhar Seshasayee over 5 years ago

Saw the same error during this run:
http://pulpito.ceph.com/sseshasa-2020-06-24_17:46:09-rados-wip-sseshasa-testing-2020-06-24-1858-distro-basic-smithi/

Job ID: 5176265

Failure Reason:
Scrubbing terminated -- not all pgs were active and clean.

Actions

Copy link

Updated by Patrick Donnelly over 5 years ago

Has duplicate Bug #46211: qa: pools stuck in creating added

Actions

Copy link

Updated by Patrick Donnelly over 5 years ago

Priority changed from Normal to Immediate
Target version set to v16.0.0

Actions

Copy link

Updated by Neha Ojha over 5 years ago

The common thing in all of these is that the tests are all failing while running the ceph task, no thrashing or anything going on at all at this time.

From /a/sseshasa-2020-06-24_17:46:09-rados-wip-sseshasa-testing-2020-06-24-1858-distro-basic-smithi/5176265

This is where we start seeing PGs getting stuck:

2020-06-24T21:05:50.958+0000 7fa42b7ce700 10 mgr.server operator() 8 pgs: 2 creating+peering, 6 active+clean; 0 B data, 320 KiB used, 297 GiB / 300 GiB avail

Just before that there seem to be issues in the msgr which is leading to reap_dead start

2020-06-24T21:05:50.181+0000 7fa442f31700  1 --2- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 0x5647f6ed3900 secure :-1 s=READY pgs=3 cs=0 l=1 rev1=1 rx=0x5647f6eb1740 tx=0x5647f6dd2120).ready entity=client.4179 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
2020-06-24T21:05:50.199+0000 7fa4227c8700  1 -- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] <== osd.0 v2:172.21.15.205:6806/20452 6 ==== pg_stats(1 pgs tid 0 v 0) v2 ==== 1199+0+0 (secure 0 0 0) 0x5647f6de0900 con 0x5647f6d47400
2020-06-24T21:05:50.200+0000 7fa442f31700  1 -- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 msgr2=0x5647f6ed3900 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 42
2020-06-24T21:05:50.200+0000 7fa442f31700  1 -- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 msgr2=0x5647f6ed3900 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2020-06-24T21:05:50.200+0000 7fa442f31700  1 --2- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 0x5647f6ed3900 secure :-1 s=READY pgs=3 cs=0 l=1 rev1=1 rx=0x5647f6eb1740 tx=0x5647f6dd2120).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted)
2020-06-24T21:05:50.200+0000 7fa442f31700  1 --2- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 0x5647f6ed3900 secure :-1 s=READY pgs=3 cs=0 l=1 rev1=1 rx=0x5647f6eb1740 tx=0x5647f6dd2120).stop
2020-06-24T21:05:50.200+0000 7fa442f31700  1 -- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] reap_dead start

Actions

Copy link

Updated by Sebastian Wagner over 5 years ago

Related to Bug #46178: slow request osd_op(... (undecoded) ondisk+retry+read+ignore_overlay+known_if_redirected e49) added

Actions

Copy link

Updated by Ilya Dryomov over 5 years ago

Category set to Correctness/Safety
Assignee set to Ilya Dryomov
Component(RADOS) Messenger added

This is a msgr2.1 issue.

Actions

Copy link

Updated by Ilya Dryomov over 5 years ago

I think it has to do with reconnect handling and how connections are reused.

This part of ProtocolV2 is pretty fragile, evidenced by steadily accumulating workarounds for invalid memory use issues during msgr2.0 development and after (the last just in March). Most likely what happens is FrameAssembler is_rev1 state gets lost and a 2.0 frame ends up being assembled while the peer is expecting a 2.1 frame. I'll confirm and put out a fix ASAP.

Actions

Copy link