Bug #46180
closedqa: Scrubbing terminated -- not all pgs were active and clean.
0%
Description
Seeing several test failures in the rgw suite:
2020-06-24T06:55:00.513 INFO:tasks.ceph:Waiting for all PGs to be active+clean and split+merged, waiting on ['2.0'] to go clean and/or [] to split/merge
2020-06-24T06:55:20.514 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph.py", line 1830, in task
healthy(ctx=ctx, config=dict(cluster=config['cluster']))
File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph.py", line 1419, in healthy
manager.wait_for_clean()
File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph_manager.py", line 2518, in wait_for_clean
'wait_for_clean: failed before timeout expired'
AssertionError: wait_for_clean: failed before timeout expired
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/contextutil.py", line 33, in nested
yield vars
File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph.py", line 1839, in task
osd_scrub_pgs(ctx, config)
File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuval-test-35331-35155/qa/tasks/ceph.py", line 1209, in osd_scrub_pgs
raise RuntimeError("Scrubbing terminated -- not all pgs were active and clean.")
RuntimeError: Scrubbing terminated -- not all pgs were active and clean.
Updated by Sridhar Seshasayee over 5 years ago
Saw the same error during this run:
http://pulpito.ceph.com/sseshasa-2020-06-24_17:46:09-rados-wip-sseshasa-testing-2020-06-24-1858-distro-basic-smithi/
Job ID: 5176265
Failure Reason:
Scrubbing terminated -- not all pgs were active and clean.
Updated by Patrick Donnelly over 5 years ago
- Has duplicate Bug #46211: qa: pools stuck in creating added
Updated by Patrick Donnelly over 5 years ago
- Priority changed from Normal to Immediate
- Target version set to v16.0.0
Updated by Neha Ojha over 5 years ago
The common thing in all of these is that the tests are all failing while running the ceph task, no thrashing or anything going on at all at this time.
From /a/sseshasa-2020-06-24_17:46:09-rados-wip-sseshasa-testing-2020-06-24-1858-distro-basic-smithi/5176265
This is where we start seeing PGs getting stuck:
2020-06-24T21:05:50.958+0000 7fa42b7ce700 10 mgr.server operator() 8 pgs: 2 creating+peering, 6 active+clean; 0 B data, 320 KiB used, 297 GiB / 300 GiB avail
Just before that there seem to be issues in the msgr which is leading to reap_dead start
2020-06-24T21:05:50.181+0000 7fa442f31700 1 --2- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 0x5647f6ed3900 secure :-1 s=READY pgs=3 cs=0 l=1 rev1=1 rx=0x5647f6eb1740 tx=0x5647f6dd2120).ready entity=client.4179 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0 2020-06-24T21:05:50.199+0000 7fa4227c8700 1 -- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] <== osd.0 v2:172.21.15.205:6806/20452 6 ==== pg_stats(1 pgs tid 0 v 0) v2 ==== 1199+0+0 (secure 0 0 0) 0x5647f6de0900 con 0x5647f6d47400 2020-06-24T21:05:50.200+0000 7fa442f31700 1 -- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 msgr2=0x5647f6ed3900 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 42 2020-06-24T21:05:50.200+0000 7fa442f31700 1 -- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 msgr2=0x5647f6ed3900 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed 2020-06-24T21:05:50.200+0000 7fa442f31700 1 --2- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 0x5647f6ed3900 secure :-1 s=READY pgs=3 cs=0 l=1 rev1=1 rx=0x5647f6eb1740 tx=0x5647f6dd2120).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted) 2020-06-24T21:05:50.200+0000 7fa442f31700 1 --2- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] >> 172.21.15.205:0/627840989 conn(0x5647f6eb8000 0x5647f6ed3900 secure :-1 s=READY pgs=3 cs=0 l=1 rev1=1 rx=0x5647f6eb1740 tx=0x5647f6dd2120).stop 2020-06-24T21:05:50.200+0000 7fa442f31700 1 -- [v2:172.21.15.205:6824/19930,v1:172.21.15.205:6825/19930] reap_dead start
Updated by Sebastian Wagner over 5 years ago
- Related to Bug #46178: slow request osd_op(... (undecoded) ondisk+retry+read+ignore_overlay+known_if_redirected e49) added
Updated by Ilya Dryomov over 5 years ago
- Category set to Correctness/Safety
- Assignee set to Ilya Dryomov
- Component(RADOS) Messenger added
This is a msgr2.1 issue.
Updated by Ilya Dryomov over 5 years ago
I think it has to do with reconnect handling and how connections are reused.
This part of ProtocolV2 is pretty fragile, evidenced by steadily accumulating workarounds for invalid memory use issues during msgr2.0 development and after (the last just in March). Most likely what happens is FrameAssembler is_rev1 state gets lost and a 2.0 frame ends up being assembled while the peer is expecting a 2.1 frame. I'll confirm and put out a fix ASAP.
Updated by Neha Ojha over 5 years ago
- Related to Bug #46179: Health check failed: Reduced data availability: PG_AVAILABILITY added
Updated by Neha Ojha over 5 years ago
- Related to Bug #46225: Health check failed: 1 osds down (OSD_DOWN) added
Updated by Neha Ojha over 5 years ago
Here's a reliable reproducer for the issue:
-s rados/singleton-nomsgr -c master --filter 'all/health-warnings rados' -N 20
Updated by Ilya Dryomov over 5 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 35816
Updated by Ilya Dryomov over 5 years ago
- Status changed from Fix Under Review to Resolved
Will be cherry-picked into https://github.com/ceph/ceph/pull/35720 and https://github.com/ceph/ceph/pull/35733.
Updated by Ilya Dryomov over 5 years ago
- Related to deleted (Bug #46178: slow request osd_op(... (undecoded) ondisk+retry+read+ignore_overlay+known_if_redirected e49) )
Updated by Ilya Dryomov over 5 years ago
- Has duplicate Bug #46178: slow request osd_op(... (undecoded) ondisk+retry+read+ignore_overlay+known_if_redirected e49) added
Updated by Neha Ojha over 5 years ago
- Related to deleted (Bug #46225: Health check failed: 1 osds down (OSD_DOWN))
Updated by Neha Ojha over 5 years ago
- Has duplicate Bug #46225: Health check failed: 1 osds down (OSD_DOWN) added
Updated by Neha Ojha over 5 years ago
- Related to deleted (Bug #46179: Health check failed: Reduced data availability: PG_AVAILABILITY)
Updated by Neha Ojha over 5 years ago
- Related to Bug #46179: Health check failed: Reduced data availability: PG_AVAILABILITY added
Updated by Upkeep Bot 8 months ago
- Merge Commit set to 0195fd53f1c6b720c29ee978dc31fb1303494072
- Fixed In set to v16.0.0-3118-g0195fd53f1
- Released In set to v16.2.0~2105
- Upkeep Timestamp set to 2025-07-14T19:39:05+00:00