Bug #74004
openqa/standalone/ceph-helpers: ceph pg query hangs indefinitely
0%
Description
ksirivad-2025-11-25_16:56:48-rados:standalone-wip-ksirivad-fix-67093-distro-default-smithi/8625131/teuthology.log
025-11-25T18:00:53.017 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1709: wait_for_pg_clean: echo '#---------- 1.0 loop 12' 2025-11-25T18:00:53.017 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1710: wait_for_pg_clean: is_pg_clean 1.0 2025-11-25T18:00:53.017 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1572: is_pg_clean: local pgid=1.0 2025-11-25T18:00:53.018 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1573: is_pg_clean: local pg_state 2025-11-25T18:00:53.018 INFO:tasks.workunit.client.0.smithi153.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1574: is_pg_clean: ceph pg 1.0 query 2025-11-25T18:00:53.018 INFO:tasks.workunit.client.0.smithi153.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1574: is_pg_clean: jq -r '.state ' 2025-11-25T20:31:42.739 DEBUG:teuthology.orchestra.run:got remote process result: 124
ksirivad-2025-11-25_16:56:48-rados:standalone-wip-ksirivad-fix-67093-distro-default-smithi/8625113/teuthology.log
2025-11-25T18:00:45.551 INFO:tasks.workunit.client.0.smithi126.stdout:#---------- 1.0 loop 12 2025-11-25T18:00:45.551 INFO:tasks.workunit.client.0.smithi126.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1709: wait_for_pg_clean: echo '#---------- 1.0 loop 12' 2025-11-25T18:00:45.551 INFO:tasks.workunit.client.0.smithi126.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1710: wait_for_pg_clean: is_pg_clean 1.0 2025-11-25T18:00:45.551 INFO:tasks.workunit.client.0.smithi126.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1572: is_pg_clean: local pgid=1.0 2025-11-25T18:00:45.552 INFO:tasks.workunit.client.0.smithi126.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1573: is_pg_clean: local pg_state 2025-11-25T18:00:45.552 INFO:tasks.workunit.client.0.smithi126.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1574: is_pg_clean: ceph pg 1.0 query 2025-11-25T18:00:45.552 INFO:tasks.workunit.client.0.smithi126.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1574: is_pg_clean: jq -r '.state ' 2025-11-25T20:31:27.520 DEBUG:teuthology.orchestra.run:got remote process result: 124 2025-11-25T20:31:27.522 INFO:tasks.workunit.client.0.smithi126.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1574: is_pg_clean: pg_state=
Updated by Kamoltat (Junior) Sirivadhna 4 months ago
Mostly seen in rados/standalone/{supported-random-distro$/{ubuntu_latest} workloads/misc}
Updated by Nitzan Mordechai 4 months ago
it looks like the killing of the osd was too fast, or maybe the check after killing osd was wrong?
2025-11-25T17:45:31.977 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1950: test_pg_scrub: kill_daemons td/ceph-helpers KILL osd 2025-11-25T17:45:31.977 INFO:tasks.workunit.client.0.smithi153.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:336: kill_daemons: shopt -q -o xtrace 2025-11-25T17:45:31.977 INFO:tasks.workunit.client.0.smithi153.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:336: kill_daemons: echo true 2025-11-25T17:45:31.977 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:336: kill_daemons: local trace=true 2025-11-25T17:45:31.977 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:337: kill_daemons: true 2025-11-25T17:45:31.977 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:337: kill_daemons: shopt -u -o xtrace 2025-11-25T17:45:32.081 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:353: kill_daemons: return 0 2025-11-25T17:45:32.081 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1951: test_pg_scrub: TIMEOUT=2 2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1951: test_pg_scrub: pg_scrub 1.0 2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1923: pg_scrub: local pgid=1.0 2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1925: pg_scrub: wait_for_pg_clean 1.0 2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1702: wait_for_pg_clean: local pg_id=1.0 2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1703: wait_for_pg_clean: get_timeout_delays 90 1 3 2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1593: get_timeout_delays: shopt -q -o xtrace 2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1593: get_timeout_delays: echo true
we recentlly added some more function calls after the kill, it sounds like those are the one that causing the issue
Updated by Nitzan Mordechai 4 months ago
- Status changed from New to Fix Under Review
- Assignee set to Nitzan Mordechai
- Pull request ID set to 66457
Updated by Laura Flores 4 months ago
For the sake of establishing a history, I had logged something that looks like this here: https://tracker.ceph.com/issues/64435#note-4
Updated by Radoslaw Zarzynski 4 months ago
scrub note: @Kamoltat (Junior) Sirivadhna is going to take a look on the fix.
Updated by Laura Flores 4 months ago
Snippet from the mgr log:
/a/ksirivad-2025-11-25_16:56:48-rados:standalone-wip-ksirivad-fix-67093-distro-default-smithi/8625131/remote/smithi153/log/mgr.x.log.gz
2025-11-25T20:31:41.337+0000 7f91b6280640 20 mgr.server operator() health checks:
{
"PG_AVAILABILITY": {
"severity": "HEALTH_WARN",
"summary": {
"message": "Reduced data availability: 4 pgs stale",
"count": 4
},
"detail": [
{
"message": "pg 1.0 is stuck stale for 2h, current state stale+active+clean, last acting [0]"
},
{
"message": "pg 1.1 is stuck stale for 2h, current state stale+active+clean, last acting [0]"
},
{
"message": "pg 1.2 is stuck stale for 2h, current state stale+active+clean, last acting [0]"
},
{
"message": "pg 1.3 is stuck stale for 2h, current state stale+active+clean, last acting [0]"
}
]
},
"TOO_FEW_OSDS": {
"severity": "HEALTH_WARN",
"summary": {
"message": "OSD count 1 < osd_pool_default_size 3",
"count": 2
},
"detail": []
}
}
Updated by Kamoltat (Junior) Sirivadhna 4 months ago
Laura's snippet supports the case where we are over-thrashing the OSDs, leading pg 1.0 to be stale, that's why when tried to query pg 1.0 we are stuck.
Updated by Nitzan Mordechai 4 months ago
@Kamoltat (Junior) Sirivadhna thanks for checking that!
The stuck stale pgs warning happened because the osd is already killed (part of the test). We are finally failing because of a timeout.
Let's take a look before the "stuck stale" happened:
the code (before my changes):
function test_pg_scrub() {
local dir=$1
setup $dir || return 1
run_mon $dir a --osd_pool_default_size=1 --mon_allow_pool_size_one=true || return 1
run_mgr $dir x || return 1
run_osd $dir 0 || return 1
create_rbd_pool || return 1
wait_for_clean || return 1
pg_scrub 1.0 || return 1
kill_daemons $dir KILL osd || return 1
! TIMEOUT=2 pg_scrub 1.0 || return 1
teardown $dir || return 1
}
the teuthology log shows:
2025-11-25T17:45:31.977 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1950: test_pg_scrub: kill_daemons td/ceph-helpers KILL osd
2025-11-25T17:45:31.977 INFO:tasks.workunit.client.0.smithi153.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:336: kill_daemons: shopt -q -o xtrace
2025-11-25T17:45:31.977 INFO:tasks.workunit.client.0.smithi153.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:336: kill_daemons: echo true
2025-11-25T17:45:31.977 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:336: kill_daemons: local trace=true
2025-11-25T17:45:31.977 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:337: kill_daemons: true
2025-11-25T17:45:31.977 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:337: kill_daemons: shopt -u -o xtrace
2025-11-25T17:45:32.081 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:353: kill_daemons: return 0
2025-11-25T17:45:32.081 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1951: test_pg_scrub: TIMEOUT=2
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1951: test_pg_scrub: pg_scrub 1.0
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1923: pg_scrub: local pgid=1.0
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1925: pg_scrub: wait_for_pg_clean 1.0
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1702: wait_for_pg_clean: local pg_id=1.0
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1703: wait_for_pg_clean: get_timeout_delays 90 1 3
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1593: get_timeout_delays: shopt -q -o xtrace
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1593: get_timeout_delays: echo true
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1593: get_timeout_delays: local trace=true
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1594: get_timeout_delays: true
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1594: get_timeout_delays: shopt -u -o xtrace
2025-11-25T17:45:32.289 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1703: wait_for_pg_clean: delays=('1' '2' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3')
2025-11-25T17:45:32.290 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1703: wait_for_pg_clean: local -a delays
2025-11-25T17:45:32.290 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1704: wait_for_pg_clean: local -i loop=0
2025-11-25T17:45:32.290 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1706: wait_for_pg_clean: flush_pg_stats
we killed the osd at 17:45:32 and started the "pg_scrub 1.0" with TIMEOUT=1 at 17:45:32.082 which called wait_for_pg_clean with the following delays set:
delays=('1' '2' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3')
wait_for_pg_clean then called flush_pg_stats and the log shows:
2025-11-25T17:45:32.767 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2257: flush_pg_stats: for osd in $ids 2025-11-25T17:45:32.767 INFO:tasks.workunit.client.0.smithi153.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2258: flush_pg_stats: ceph tell osd.0 flush_pg_stats 2025-11-25T18:00:18.752 INFO:tasks.workunit.client.0.smithi153.stderr:Error ENXIO: problem getting command descriptions from osd.0
so, we waited 15 minutes before we timed out. That's the reason my fix adds a timeout to the ceph tell line.
adding the WAIT_FOR_CLEAN_TIMEOUT will control the delays set in wait_for_clean after killing the osd, so we won't timeout.
the mgr start to show stuck stale pgs warning:
2025-11-25T18:01:19.554+0000 7f91b6280640 20 mgr.server operator() health checks:
{
"PG_AVAILABILITY": {
"severity": "HEALTH_WARN",
"summary": {
"message": "Reduced data availability: 4 pgs stale",
"count": 4
},
"detail": [
{
"message": "pg 1.0 is stuck stale for 60s, current state stale+active+clean, last acting [0]"
},
{
"message": "pg 1.1 is stuck stale for 60s, current state stale+active+clean, last acting [0]"
},
{
"message": "pg 1.2 is stuck stale for 60s, current state stale+active+clean, last acting [0]"
},
{
"message": "pg 1.3 is stuck stale for 60s, current state stale+active+clean, last acting [0]"
}
]
},
"TOO_FEW_OSDS": {
"severity": "HEALTH_WARN",
"summary": {
"message": "OSD count 1 < osd_pool_default_size 3",
"count": 2
},
"detail": []
}
}
which happened after flush_pg_stats timeout.
Please let me know if you agree with that analysis.
Updated by Sridhar Seshasayee 3 months ago
The changes in the associated PR is causing failures in multiple existing standalone scripts shown below:
/a/skanta-2025-12-03_02:50:04-rados-wip-bharath5-testing-2025-12-02-1511-distro-default-smithi/
[8638396, 8638388, 8638400, 8638405, 8638408]
In the qa/standalone/misc/ok-to-stop.sh -> TEST_0_osd(), the test brings down osd.0 and calls wait_for_peered(),
which in-turn calls flush_pg_stats(). Since flush_pg_stats() returns 1 for osd.0 (since it's down), wait_for_peered()
also fails and causes the test to fail. In this case, either flush_pg_stats should not return 1 (as before) or
wait_for_peered() should handle the error status from flush_pg_stats(). I think the former with some additional
checks on the return status (ENXIO) must be performed since flush_pg_stats() can be called on downed OSDs.
Failure Logs:
2025-12-03T06:38:14.395 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/misc/ok-to-stop.sh:285: TEST_0_osd: kill_daemons td/ok-to-stop TERM osd.0
2025-12-03T06:38:14.396 INFO:tasks.workunit.client.0.smithi202.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:336: kill_daemons: shopt -q -o xtrace
2025-12-03T06:38:14.396 INFO:tasks.workunit.client.0.smithi202.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:336: kill_daemons: echo true
2025-12-03T06:38:14.397 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:336: kill_daemons: local trace=true
2025-12-03T06:38:14.397 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:337: kill_daemons: true
2025-12-03T06:38:14.397 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:337: kill_daemons: shopt -u -o xtrace
2025-12-03T06:38:14.705 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:353: kill_daemons: return 0
2025-12-03T06:38:14.705 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/misc/ok-to-stop.sh:286: TEST_0_osd: ceph osd down 0
2025-12-03T06:38:15.497 INFO:tasks.workunit.client.0.smithi202.stderr:osd.0 is already down.
2025-12-03T06:38:15.516 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/misc/ok-to-stop.sh:287: TEST_0_osd: wait_for_peered
2025-12-03T06:38:15.516 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1731: wait_for_peered: local cmd=
2025-12-03T06:38:15.516 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1732: wait_for_peered: local num_peered=-1
2025-12-03T06:38:15.516 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1733: wait_for_peered: local cur_peered
2025-12-03T06:38:15.517 INFO:tasks.workunit.client.0.smithi202.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1734: wait_for_peered: get_timeout_delays 90 .1
2025-12-03T06:38:15.517 INFO:tasks.workunit.client.0.smithi202.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1593: get_timeout_delays: shopt -q -o xtrace
2025-12-03T06:38:15.517 INFO:tasks.workunit.client.0.smithi202.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1593: get_timeout_delays: echo true
2025-12-03T06:38:15.517 INFO:tasks.workunit.client.0.smithi202.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1593: get_timeout_delays: local trace=true
2025-12-03T06:38:15.517 INFO:tasks.workunit.client.0.smithi202.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1594: get_timeout_delays: true
2025-12-03T06:38:15.517 INFO:tasks.workunit.client.0.smithi202.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1594: get_timeout_delays: shopt -u -o xtrace
2025-12-03T06:38:15.611 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1734: wait_for_peered: delays=('0.1' '0.2' '0.4' '0.8' '1.6' '3.2' '6.4' '12.8' '15' '15' '15' '15' '4.5')
2025-12-03T06:38:15.611 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1734: wait_for_peered: local -a delays
2025-12-03T06:38:15.611 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1735: wait_for_peered: local -i loop=0
2025-12-03T06:38:15.611 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1737: wait_for_peered: flush_pg_stats
2025-12-03T06:38:15.611 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2253: flush_pg_stats: local timeout=300
2025-12-03T06:38:15.611 INFO:tasks.workunit.client.0.smithi202.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2255: flush_pg_stats: ceph osd ls
2025-12-03T06:38:16.193 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2255: flush_pg_stats: ids='0
2025-12-03T06:38:16.193 INFO:tasks.workunit.client.0.smithi202.stderr:1
2025-12-03T06:38:16.194 INFO:tasks.workunit.client.0.smithi202.stderr:2
2025-12-03T06:38:16.194 INFO:tasks.workunit.client.0.smithi202.stderr:3'
2025-12-03T06:38:16.194 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2256: flush_pg_stats: seqs=
2025-12-03T06:38:16.194 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2257: flush_pg_stats: for osd in $ids
2025-12-03T06:38:16.194 INFO:tasks.workunit.client.0.smithi202.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2258: flush_pg_stats: timeout 300 ceph tell osd.0 flush_pg_stats
2025-12-03T06:38:16.314 INFO:tasks.workunit.client.0.smithi202.stderr:Error ENXIO: problem getting command descriptions from osd.0
2025-12-03T06:38:16.317 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2258: flush_pg_stats: seq=
2025-12-03T06:38:16.317 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2258: flush_pg_stats: return 1
2025-12-03T06:38:16.317 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1737: wait_for_peered: return 1
2025-12-03T06:38:16.318 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/misc/ok-to-stop.sh:287: TEST_0_osd: return 1
2025-12-03T06:38:16.318 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/misc/ok-to-stop.sh:21: run: return 1
2025-12-03T06:38:16.318 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2402: main: code=1
2025-12-03T06:38:16.318 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2404: main: teardown td/ok-to-stop 1
2025-12-03T06:38:16.318 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:155: teardown: local dir=td/ok-to-stop
Updated by Nitzan Mordechai 3 months ago
Fixed the PR and pushed the changes - you can see the results in : https://pulpito.ceph.com/nmordech-2025-12-11_14:38:40-rados:standalone-wip-bharath5-testing-2025-12-02-1511-distro-default-smithi/
Updated by Laura Flores 2 months ago
PR can be merged if tests pass @Nitzan Mordechai
Updated by Nitzan Mordechai 2 months ago
- Status changed from Fix Under Review to Pending Backport
Updated by Upkeep Bot 2 months ago
- Merge Commit set to bd20e52da837da33f114112778884683b85630e0
- Fixed In set to v20.3.0-4787-gbd20e52da8
- Upkeep Timestamp set to 2026-01-11T12:42:12+00:00
Updated by Upkeep Bot 2 months ago
- Copied to Backport #74374: squid: qa/standalone/ceph-helpers: ceph pg query hangs indefinitely added
Updated by Upkeep Bot 2 months ago
- Copied to Backport #74375: tentacle: qa/standalone/ceph-helpers: ceph pg query hangs indefinitely added
Updated by Laura Flores about 2 months ago
/a/lflores-2026-01-21_20:56:39-rados-main-distro-default-trial/11945
Updated by Laura Flores about 2 months ago
/a/lflores-2026-01-23_19:07:45-rados-wip-rocky10-branch-of-the-day-2026-01-23-1769128778-distro-default-trial/15368
Updated by Sridhar Seshasayee about 2 months ago
/a/skanta-2026-01-27_05:35:03-rados-wip-bharath1-testing-2026-01-26-1242-distro-default-trial/19764
Updated by Aishwarya Mathuria about 2 months ago
/a/skanta-2026-01-30_23:46:16-rados-wip-bharath7-testing-2026-01-29-2016-distro-default-trial/28572
Updated by Laura Flores about 1 month ago
/a/yuriw-2026-02-03_16:00:06-rados-wip-yuri4-testing-2026-02-02-2122-distro-default-trial/31692
Updated by Kamoltat (Junior) Sirivadhna about 1 month ago
Watcher: Note that the problem occurred in main again even after the fix was merged
Updated by Aishwarya Mathuria about 1 month ago
/a/skanta-2026-02-07_00:02:26-rados-wip-bharath7-testing-2026-02-06-0906-distro-default-trial/39117
Updated by Lee Sanders about 1 month ago
/a/skanta-2026-01-29_02:19:11-rados-wip-bharath5-testing-2026-01-28-2018-distro-default-trial/24718
Updated by Lee Sanders about 1 month ago
/a/skanta-2026-01-29_13:05:02-rados-wip-bharath5-testing-2026-01-28-2018-distro-default-trial/25718
Updated by Aishwarya Mathuria about 1 month ago
/a/skanta-2026-02-05_03:38:32-rados-wip-bharath2-testing-2026-02-03-0542-distro-default-trial/35654
Updated by Nitzan Mordechai about 1 month ago
/a/yaarit-2026-02-10_23:48:52-rados-wip-rocky10-branch-of-the-day-2026-02-09-1770676549-distro-default-trial/44401
Updated by Sridhar Seshasayee 26 days ago
/a/skanta-2026-02-21_10:47:12-rados-wip-bharath21-testing-2026-02-20-1039-distro-default-trial/62535
The issue is again hit in the same test (test_pg_scrub):
2026-02-21T11:42:16.256 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1980: test_pg_scrub: WAIT_FOR_CLEAN_TIMEOUT=10
2026-02-21T11:42:16.256 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1980: test_pg_scrub: TIMEOUT=2
2026-02-21T11:42:16.256 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1980: test_pg_scrub: pg_scrub 1.0
2026-02-21T11:42:16.256 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1952: pg_scrub: local pgid=1.0
2026-02-21T11:42:16.256 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1954: pg_scrub: wait_for_pg_clean 1.0
2026-02-21T11:42:16.256 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1702: wait_for_pg_clean: local pg_id=1.0
2026-02-21T11:42:16.256 INFO:tasks.workunit.client.0.trial189.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1703: wait_for_pg_clean: get_timeout_delays 10 1 3
2026-02-21T11:42:16.257 INFO:tasks.workunit.client.0.trial189.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1593: get_timeout_delays: shopt -q -o xtrace
2026-02-21T11:42:16.257 INFO:tasks.workunit.client.0.trial189.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1593: get_timeout_delays: echo true
2026-02-21T11:42:16.257 INFO:tasks.workunit.client.0.trial189.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1593: get_timeout_delays: local trace=true
2026-02-21T11:42:16.257 INFO:tasks.workunit.client.0.trial189.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1594: get_timeout_delays: true
2026-02-21T11:42:16.257 INFO:tasks.workunit.client.0.trial189.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1594: get_timeout_delays: shopt -u -o xtrace
2026-02-21T11:42:16.290 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1703: wait_for_pg_clean: delays=('1' '2' '3' '3' '1')
2026-02-21T11:42:16.290 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1703: wait_for_pg_clean: local -a delays
2026-02-21T11:42:16.290 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1704: wait_for_pg_clean: local -i loop=0
2026-02-21T11:42:16.290 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1706: wait_for_pg_clean: flush_pg_stats
2026-02-21T11:42:16.290 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2282: flush_pg_stats: local timeout=2
2026-02-21T11:42:16.290 INFO:tasks.workunit.client.0.trial189.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2284: flush_pg_stats: ceph osd ls
2026-02-21T11:42:16.519 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2284: flush_pg_stats: ids=0
2026-02-21T11:42:16.519 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2285: flush_pg_stats: seqs=
2026-02-21T11:42:16.519 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2286: flush_pg_stats: for osd in $ids
2026-02-21T11:42:16.520 INFO:tasks.workunit.client.0.trial189.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2287: flush_pg_stats: timeout 2 ceph tell osd.0 flush_pg_stats
2026-02-21T11:42:18.524 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2287: flush_pg_stats: seq=
2026-02-21T11:42:18.524 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2287: flush_pg_stats: true
2026-02-21T11:42:18.524 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2288: flush_pg_stats: test -z ''
2026-02-21T11:42:18.524 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2289: flush_pg_stats: continue
2026-02-21T11:42:18.524 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1708: wait_for_pg_clean: true
2026-02-21T11:42:18.524 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1709: wait_for_pg_clean: echo '#---------- 1.0 loop 0'
2026-02-21T11:42:18.524 INFO:tasks.workunit.client.0.trial189.stdout:#---------- 1.0 loop 0
2026-02-21T11:42:18.525 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1710: wait_for_pg_clean: is_pg_clean 1.0
2026-02-21T11:42:18.525 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1572: is_pg_clean: local pgid=1.0
2026-02-21T11:42:18.525 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1573: is_pg_clean: local pg_state
2026-02-21T11:42:18.525 INFO:tasks.workunit.client.0.trial189.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1574: is_pg_clean: ceph pg 1.0 query
2026-02-21T11:42:18.525 INFO:tasks.workunit.client.0.trial189.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1574: is_pg_clean: jq -r '.state '
2026-02-21T14:39:24.182 DEBUG:teuthology.orchestra.run:got remote process result: 124
Updated by Nitzan Mordechai 12 days ago
- Related to Bug #75406: qa/standalone/ceph-helpers: is_pg_clean hang after osd teardown added
Updated by Nitzan Mordechai 12 days ago
I opened a new tracker for the new issue https://tracker.ceph.com/issues/75406