Project

General

Profile

Actions

Bug #74004

open

qa/standalone/ceph-helpers: ceph pg query hangs indefinitely

Added by Kamoltat (Junior) Sirivadhna 4 months ago. Updated 12 days ago.

Status:
Pending Backport
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Backport:
tentacle, squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v20.3.0-4787-gbd20e52da8
Released In:
Upkeep Timestamp:
2026-01-11T12:42:12+00:00

Description

ksirivad-2025-11-25_16:56:48-rados:standalone-wip-ksirivad-fix-67093-distro-default-smithi/8625131/teuthology.log

025-11-25T18:00:53.017 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1709: wait_for_pg_clean:  echo '#---------- 1.0 loop 12'
2025-11-25T18:00:53.017 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1710: wait_for_pg_clean:  is_pg_clean 1.0
2025-11-25T18:00:53.017 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1572: is_pg_clean:  local pgid=1.0
2025-11-25T18:00:53.018 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1573: is_pg_clean:  local pg_state
2025-11-25T18:00:53.018 INFO:tasks.workunit.client.0.smithi153.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1574: is_pg_clean:  ceph pg 1.0 query
2025-11-25T18:00:53.018 INFO:tasks.workunit.client.0.smithi153.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1574: is_pg_clean:  jq -r '.state '
2025-11-25T20:31:42.739 DEBUG:teuthology.orchestra.run:got remote process result: 124

ksirivad-2025-11-25_16:56:48-rados:standalone-wip-ksirivad-fix-67093-distro-default-smithi/8625113/teuthology.log

2025-11-25T18:00:45.551 INFO:tasks.workunit.client.0.smithi126.stdout:#---------- 1.0 loop 12
2025-11-25T18:00:45.551 INFO:tasks.workunit.client.0.smithi126.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1709: wait_for_pg_clean:  echo '#---------- 1.0 loop 12'
2025-11-25T18:00:45.551 INFO:tasks.workunit.client.0.smithi126.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1710: wait_for_pg_clean:  is_pg_clean 1.0
2025-11-25T18:00:45.551 INFO:tasks.workunit.client.0.smithi126.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1572: is_pg_clean:  local pgid=1.0
2025-11-25T18:00:45.552 INFO:tasks.workunit.client.0.smithi126.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1573: is_pg_clean:  local pg_state
2025-11-25T18:00:45.552 INFO:tasks.workunit.client.0.smithi126.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1574: is_pg_clean:  ceph pg 1.0 query
2025-11-25T18:00:45.552 INFO:tasks.workunit.client.0.smithi126.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1574: is_pg_clean:  jq -r '.state '
2025-11-25T20:31:27.520 DEBUG:teuthology.orchestra.run:got remote process result: 124
2025-11-25T20:31:27.522 INFO:tasks.workunit.client.0.smithi126.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1574: is_pg_clean:  pg_state=


Related issues 3 (3 open0 closed)

Related to RADOS - Bug #75406: qa/standalone/ceph-helpers: is_pg_clean hang after osd teardownFix Under ReviewNitzan Mordechai

Actions
Copied to RADOS - Backport #74374: squid: qa/standalone/ceph-helpers: ceph pg query hangs indefinitelyIn ProgressNitzan MordechaiActions
Copied to RADOS - Backport #74375: tentacle: qa/standalone/ceph-helpers: ceph pg query hangs indefinitelyIn ProgressNitzan MordechaiActions
Actions #1

Updated by Kamoltat (Junior) Sirivadhna 4 months ago

Mostly seen in rados/standalone/{supported-random-distro$/{ubuntu_latest} workloads/misc}

Actions #2

Updated by Nitzan Mordechai 4 months ago

it looks like the killing of the osd was too fast, or maybe the check after killing osd was wrong?

2025-11-25T17:45:31.977 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1950: test_pg_scrub:  kill_daemons td/ceph-helpers KILL osd
2025-11-25T17:45:31.977 INFO:tasks.workunit.client.0.smithi153.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:336: kill_daemons:  shopt -q -o xtrace
2025-11-25T17:45:31.977 INFO:tasks.workunit.client.0.smithi153.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:336: kill_daemons:  echo true
2025-11-25T17:45:31.977 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:336: kill_daemons:  local trace=true
2025-11-25T17:45:31.977 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:337: kill_daemons:  true
2025-11-25T17:45:31.977 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:337: kill_daemons:  shopt -u -o xtrace
2025-11-25T17:45:32.081 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:353: kill_daemons:  return 0
2025-11-25T17:45:32.081 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1951: test_pg_scrub:  TIMEOUT=2
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1951: test_pg_scrub:  pg_scrub 1.0
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1923: pg_scrub:  local pgid=1.0
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1925: pg_scrub:  wait_for_pg_clean 1.0
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1702: wait_for_pg_clean:  local pg_id=1.0
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1703: wait_for_pg_clean:  get_timeout_delays 90 1 3
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1593: get_timeout_delays:  shopt -q -o xtrace
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1593: get_timeout_delays:  echo true

we recentlly added some more function calls after the kill, it sounds like those are the one that causing the issue

Actions #3

Updated by Nitzan Mordechai 4 months ago

  • Status changed from New to Fix Under Review
  • Assignee set to Nitzan Mordechai
  • Pull request ID set to 66457
Actions #4

Updated by Laura Flores 4 months ago

For the sake of establishing a history, I had logged something that looks like this here: https://tracker.ceph.com/issues/64435#note-4

Actions #5

Updated by Radoslaw Zarzynski 4 months ago

scrub note: @Kamoltat (Junior) Sirivadhna is going to take a look on the fix.

Actions #6

Updated by Laura Flores 4 months ago

Snippet from the mgr log:

/a/ksirivad-2025-11-25_16:56:48-rados:standalone-wip-ksirivad-fix-67093-distro-default-smithi/8625131/remote/smithi153/log/mgr.x.log.gz

2025-11-25T20:31:41.337+0000 7f91b6280640 20 mgr.server operator() health checks:
{
    "PG_AVAILABILITY": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Reduced data availability: 4 pgs stale",
            "count": 4
        },
        "detail": [
            {
                "message": "pg 1.0 is stuck stale for 2h, current state stale+active+clean, last acting [0]" 
            },
            {
                "message": "pg 1.1 is stuck stale for 2h, current state stale+active+clean, last acting [0]" 
            },
            {
                "message": "pg 1.2 is stuck stale for 2h, current state stale+active+clean, last acting [0]" 
            },
            {
                "message": "pg 1.3 is stuck stale for 2h, current state stale+active+clean, last acting [0]" 
            }
        ]
    },
    "TOO_FEW_OSDS": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "OSD count 1 < osd_pool_default_size 3",
            "count": 2
        },
        "detail": []
    }
}

Actions #7

Updated by Kamoltat (Junior) Sirivadhna 4 months ago

Laura's snippet supports the case where we are over-thrashing the OSDs, leading pg 1.0 to be stale, that's why when tried to query pg 1.0 we are stuck.

Actions #8

Updated by Nitzan Mordechai 4 months ago

@Kamoltat (Junior) Sirivadhna thanks for checking that!
The stuck stale pgs warning happened because the osd is already killed (part of the test). We are finally failing because of a timeout.
Let's take a look before the "stuck stale" happened:

the code (before my changes):

function test_pg_scrub() {
    local dir=$1

    setup $dir || return 1
    run_mon $dir a --osd_pool_default_size=1 --mon_allow_pool_size_one=true || return 1
    run_mgr $dir x || return 1
    run_osd $dir 0 || return 1
    create_rbd_pool || return 1
    wait_for_clean || return 1
    pg_scrub 1.0 || return 1
    kill_daemons $dir KILL osd || return 1
    ! TIMEOUT=2 pg_scrub 1.0 || return 1
    teardown $dir || return 1
}

the teuthology log shows:

2025-11-25T17:45:31.977 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1950: test_pg_scrub:  kill_daemons td/ceph-helpers KILL osd
2025-11-25T17:45:31.977 INFO:tasks.workunit.client.0.smithi153.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:336: kill_daemons:  shopt -q -o xtrace
2025-11-25T17:45:31.977 INFO:tasks.workunit.client.0.smithi153.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:336: kill_daemons:  echo true
2025-11-25T17:45:31.977 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:336: kill_daemons:  local trace=true
2025-11-25T17:45:31.977 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:337: kill_daemons:  true
2025-11-25T17:45:31.977 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:337: kill_daemons:  shopt -u -o xtrace
2025-11-25T17:45:32.081 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:353: kill_daemons:  return 0
2025-11-25T17:45:32.081 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1951: test_pg_scrub:  TIMEOUT=2
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1951: test_pg_scrub:  pg_scrub 1.0
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1923: pg_scrub:  local pgid=1.0
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1925: pg_scrub:  wait_for_pg_clean 1.0
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1702: wait_for_pg_clean:  local pg_id=1.0
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1703: wait_for_pg_clean:  get_timeout_delays 90 1 3
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1593: get_timeout_delays:  shopt -q -o xtrace
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1593: get_timeout_delays:  echo true
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1593: get_timeout_delays:  local trace=true
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1594: get_timeout_delays:  true
2025-11-25T17:45:32.082 INFO:tasks.workunit.client.0.smithi153.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1594: get_timeout_delays:  shopt -u -o xtrace
2025-11-25T17:45:32.289 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1703: wait_for_pg_clean:  delays=('1' '2' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3')
2025-11-25T17:45:32.290 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1703: wait_for_pg_clean:  local -a delays
2025-11-25T17:45:32.290 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1704: wait_for_pg_clean:  local -i loop=0
2025-11-25T17:45:32.290 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1706: wait_for_pg_clean:  flush_pg_stats

we killed the osd at 17:45:32 and started the "pg_scrub 1.0" with TIMEOUT=1 at 17:45:32.082 which called wait_for_pg_clean with the following delays set:
delays=('1' '2' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3')

wait_for_pg_clean then called flush_pg_stats and the log shows:

2025-11-25T17:45:32.767 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2257: flush_pg_stats:  for osd in $ids
2025-11-25T17:45:32.767 INFO:tasks.workunit.client.0.smithi153.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2258: flush_pg_stats:  ceph tell osd.0 flush_pg_stats
2025-11-25T18:00:18.752 INFO:tasks.workunit.client.0.smithi153.stderr:Error ENXIO: problem getting command descriptions from osd.0

so, we waited 15 minutes before we timed out. That's the reason my fix adds a timeout to the ceph tell line.
adding the WAIT_FOR_CLEAN_TIMEOUT will control the delays set in wait_for_clean after killing the osd, so we won't timeout.

the mgr start to show stuck stale pgs warning:

2025-11-25T18:01:19.554+0000 7f91b6280640 20 mgr.server operator() health checks:
{
    "PG_AVAILABILITY": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Reduced data availability: 4 pgs stale",
            "count": 4
        },
        "detail": [
            {
                "message": "pg 1.0 is stuck stale for 60s, current state stale+active+clean, last acting [0]" 
            },
            {
                "message": "pg 1.1 is stuck stale for 60s, current state stale+active+clean, last acting [0]" 
            },
            {
                "message": "pg 1.2 is stuck stale for 60s, current state stale+active+clean, last acting [0]" 
            },
            {
                "message": "pg 1.3 is stuck stale for 60s, current state stale+active+clean, last acting [0]" 
            }
        ]
    },
    "TOO_FEW_OSDS": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "OSD count 1 < osd_pool_default_size 3",
            "count": 2
        },
        "detail": []
    }
}

which happened after flush_pg_stats timeout.
Please let me know if you agree with that analysis.

Actions #9

Updated by Nitzan Mordechai 4 months ago

  • Backport set to tentacle, squid
Actions #10

Updated by Radoslaw Zarzynski 3 months ago

In QA.

Actions #11

Updated by Sridhar Seshasayee 3 months ago

The changes in the associated PR is causing failures in multiple existing standalone scripts shown below:

/a/skanta-2025-12-03_02:50:04-rados-wip-bharath5-testing-2025-12-02-1511-distro-default-smithi/
[8638396, 8638388, 8638400, 8638405, 8638408]

In the qa/standalone/misc/ok-to-stop.sh -> TEST_0_osd(), the test brings down osd.0 and calls wait_for_peered(),
which in-turn calls flush_pg_stats(). Since flush_pg_stats() returns 1 for osd.0 (since it's down), wait_for_peered()
also fails and causes the test to fail. In this case, either flush_pg_stats should not return 1 (as before) or
wait_for_peered() should handle the error status from flush_pg_stats(). I think the former with some additional
checks on the return status (ENXIO) must be performed since flush_pg_stats() can be called on downed OSDs.

Failure Logs:

2025-12-03T06:38:14.395 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/misc/ok-to-stop.sh:285: TEST_0_osd:  kill_daemons td/ok-to-stop TERM osd.0
2025-12-03T06:38:14.396 INFO:tasks.workunit.client.0.smithi202.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:336: kill_daemons:  shopt -q -o xtrace
2025-12-03T06:38:14.396 INFO:tasks.workunit.client.0.smithi202.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:336: kill_daemons:  echo true
2025-12-03T06:38:14.397 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:336: kill_daemons:  local trace=true
2025-12-03T06:38:14.397 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:337: kill_daemons:  true
2025-12-03T06:38:14.397 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:337: kill_daemons:  shopt -u -o xtrace
2025-12-03T06:38:14.705 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:353: kill_daemons:  return 0
2025-12-03T06:38:14.705 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/misc/ok-to-stop.sh:286: TEST_0_osd:  ceph osd down 0
2025-12-03T06:38:15.497 INFO:tasks.workunit.client.0.smithi202.stderr:osd.0 is already down.
2025-12-03T06:38:15.516 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/misc/ok-to-stop.sh:287: TEST_0_osd:  wait_for_peered
2025-12-03T06:38:15.516 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1731: wait_for_peered:  local cmd=
2025-12-03T06:38:15.516 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1732: wait_for_peered:  local num_peered=-1
2025-12-03T06:38:15.516 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1733: wait_for_peered:  local cur_peered
2025-12-03T06:38:15.517 INFO:tasks.workunit.client.0.smithi202.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1734: wait_for_peered:  get_timeout_delays 90 .1
2025-12-03T06:38:15.517 INFO:tasks.workunit.client.0.smithi202.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1593: get_timeout_delays:  shopt -q -o xtrace
2025-12-03T06:38:15.517 INFO:tasks.workunit.client.0.smithi202.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1593: get_timeout_delays:  echo true
2025-12-03T06:38:15.517 INFO:tasks.workunit.client.0.smithi202.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1593: get_timeout_delays:  local trace=true
2025-12-03T06:38:15.517 INFO:tasks.workunit.client.0.smithi202.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1594: get_timeout_delays:  true
2025-12-03T06:38:15.517 INFO:tasks.workunit.client.0.smithi202.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1594: get_timeout_delays:  shopt -u -o xtrace
2025-12-03T06:38:15.611 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1734: wait_for_peered:  delays=('0.1' '0.2' '0.4' '0.8' '1.6' '3.2' '6.4' '12.8' '15' '15' '15' '15' '4.5')
2025-12-03T06:38:15.611 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1734: wait_for_peered:  local -a delays
2025-12-03T06:38:15.611 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1735: wait_for_peered:  local -i loop=0
2025-12-03T06:38:15.611 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1737: wait_for_peered:  flush_pg_stats
2025-12-03T06:38:15.611 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2253: flush_pg_stats:  local timeout=300
2025-12-03T06:38:15.611 INFO:tasks.workunit.client.0.smithi202.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2255: flush_pg_stats:  ceph osd ls
2025-12-03T06:38:16.193 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2255: flush_pg_stats:  ids='0
2025-12-03T06:38:16.193 INFO:tasks.workunit.client.0.smithi202.stderr:1
2025-12-03T06:38:16.194 INFO:tasks.workunit.client.0.smithi202.stderr:2
2025-12-03T06:38:16.194 INFO:tasks.workunit.client.0.smithi202.stderr:3'
2025-12-03T06:38:16.194 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2256: flush_pg_stats:  seqs=
2025-12-03T06:38:16.194 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2257: flush_pg_stats:  for osd in $ids
2025-12-03T06:38:16.194 INFO:tasks.workunit.client.0.smithi202.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2258: flush_pg_stats:  timeout 300 ceph tell osd.0 flush_pg_stats
2025-12-03T06:38:16.314 INFO:tasks.workunit.client.0.smithi202.stderr:Error ENXIO: problem getting command descriptions from osd.0
2025-12-03T06:38:16.317 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2258: flush_pg_stats:  seq=
2025-12-03T06:38:16.317 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2258: flush_pg_stats:  return 1
2025-12-03T06:38:16.317 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1737: wait_for_peered:  return 1
2025-12-03T06:38:16.318 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/misc/ok-to-stop.sh:287: TEST_0_osd:  return 1
2025-12-03T06:38:16.318 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/misc/ok-to-stop.sh:21: run:  return 1
2025-12-03T06:38:16.318 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2402: main:  code=1
2025-12-03T06:38:16.318 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2404: main:  teardown td/ok-to-stop 1
2025-12-03T06:38:16.318 INFO:tasks.workunit.client.0.smithi202.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:155: teardown:  local dir=td/ok-to-stop

Actions #12

Updated by Radoslaw Zarzynski 3 months ago

scrub note: blocking the PR.

Actions #14

Updated by Radoslaw Zarzynski 2 months ago

Re-reviewed.

Actions #15

Updated by Laura Flores 2 months ago

PR can be merged if tests pass @Nitzan Mordechai

Actions #16

Updated by Nitzan Mordechai 2 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #17

Updated by Upkeep Bot 2 months ago

  • Merge Commit set to bd20e52da837da33f114112778884683b85630e0
  • Fixed In set to v20.3.0-4787-gbd20e52da8
  • Upkeep Timestamp set to 2026-01-11T12:42:12+00:00
Actions #18

Updated by Upkeep Bot 2 months ago

  • Copied to Backport #74374: squid: qa/standalone/ceph-helpers: ceph pg query hangs indefinitely added
Actions #19

Updated by Upkeep Bot 2 months ago

  • Copied to Backport #74375: tentacle: qa/standalone/ceph-helpers: ceph pg query hangs indefinitely added
Actions #20

Updated by Upkeep Bot 2 months ago

  • Tags (freeform) set to backport_processed
Actions #21

Updated by Laura Flores about 2 months ago

/a/lflores-2026-01-21_20:56:39-rados-main-distro-default-trial/11945

Actions #22

Updated by Laura Flores about 2 months ago

/a/lflores-2026-01-23_19:07:45-rados-wip-rocky10-branch-of-the-day-2026-01-23-1769128778-distro-default-trial/15368

Actions #23

Updated by Sridhar Seshasayee about 2 months ago

/a/skanta-2026-01-27_05:35:03-rados-wip-bharath1-testing-2026-01-26-1242-distro-default-trial/19764

Actions #24

Updated by Aishwarya Mathuria about 2 months ago

/a/skanta-2026-01-30_23:46:16-rados-wip-bharath7-testing-2026-01-29-2016-distro-default-trial/28572

Actions #25

Updated by Laura Flores about 1 month ago

/a/yuriw-2026-02-03_16:00:06-rados-wip-yuri4-testing-2026-02-02-2122-distro-default-trial/31692

Actions #26

Updated by Kamoltat (Junior) Sirivadhna about 1 month ago

Watcher: Note that the problem occurred in main again even after the fix was merged

Actions #27

Updated by Aishwarya Mathuria about 1 month ago

/a/skanta-2026-02-07_00:02:26-rados-wip-bharath7-testing-2026-02-06-0906-distro-default-trial/39117

Actions #28

Updated by Lee Sanders about 1 month ago

/a/skanta-2026-01-29_02:19:11-rados-wip-bharath5-testing-2026-01-28-2018-distro-default-trial/24718

Actions #29

Updated by Lee Sanders about 1 month ago

/a/skanta-2026-01-29_13:05:02-rados-wip-bharath5-testing-2026-01-28-2018-distro-default-trial/25718

Actions #30

Updated by Aishwarya Mathuria about 1 month ago

/a/skanta-2026-02-05_03:38:32-rados-wip-bharath2-testing-2026-02-03-0542-distro-default-trial/35654

Actions #31

Updated by Nitzan Mordechai about 1 month ago

/a/yaarit-2026-02-10_23:48:52-rados-wip-rocky10-branch-of-the-day-2026-02-09-1770676549-distro-default-trial/44401

Actions #32

Updated by Sridhar Seshasayee 26 days ago

/a/skanta-2026-02-21_10:47:12-rados-wip-bharath21-testing-2026-02-20-1039-distro-default-trial/62535

The issue is again hit in the same test (test_pg_scrub):

2026-02-21T11:42:16.256 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1980: test_pg_scrub:  WAIT_FOR_CLEAN_TIMEOUT=10
2026-02-21T11:42:16.256 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1980: test_pg_scrub:  TIMEOUT=2
2026-02-21T11:42:16.256 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1980: test_pg_scrub:  pg_scrub 1.0
2026-02-21T11:42:16.256 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1952: pg_scrub:  local pgid=1.0
2026-02-21T11:42:16.256 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1954: pg_scrub:  wait_for_pg_clean 1.0
2026-02-21T11:42:16.256 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1702: wait_for_pg_clean:  local pg_id=1.0
2026-02-21T11:42:16.256 INFO:tasks.workunit.client.0.trial189.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1703: wait_for_pg_clean:  get_timeout_delays 10 1 3
2026-02-21T11:42:16.257 INFO:tasks.workunit.client.0.trial189.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1593: get_timeout_delays:  shopt -q -o xtrace
2026-02-21T11:42:16.257 INFO:tasks.workunit.client.0.trial189.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1593: get_timeout_delays:  echo true
2026-02-21T11:42:16.257 INFO:tasks.workunit.client.0.trial189.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1593: get_timeout_delays:  local trace=true
2026-02-21T11:42:16.257 INFO:tasks.workunit.client.0.trial189.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1594: get_timeout_delays:  true
2026-02-21T11:42:16.257 INFO:tasks.workunit.client.0.trial189.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1594: get_timeout_delays:  shopt -u -o xtrace
2026-02-21T11:42:16.290 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1703: wait_for_pg_clean:  delays=('1' '2' '3' '3' '1')
2026-02-21T11:42:16.290 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1703: wait_for_pg_clean:  local -a delays
2026-02-21T11:42:16.290 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1704: wait_for_pg_clean:  local -i loop=0
2026-02-21T11:42:16.290 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1706: wait_for_pg_clean:  flush_pg_stats
2026-02-21T11:42:16.290 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2282: flush_pg_stats:  local timeout=2
2026-02-21T11:42:16.290 INFO:tasks.workunit.client.0.trial189.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2284: flush_pg_stats:  ceph osd ls
2026-02-21T11:42:16.519 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2284: flush_pg_stats:  ids=0
2026-02-21T11:42:16.519 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2285: flush_pg_stats:  seqs=
2026-02-21T11:42:16.519 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2286: flush_pg_stats:  for osd in $ids
2026-02-21T11:42:16.520 INFO:tasks.workunit.client.0.trial189.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2287: flush_pg_stats:  timeout 2 ceph tell osd.0 flush_pg_stats
2026-02-21T11:42:18.524 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2287: flush_pg_stats:  seq=
2026-02-21T11:42:18.524 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2287: flush_pg_stats:  true
2026-02-21T11:42:18.524 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2288: flush_pg_stats:  test -z ''
2026-02-21T11:42:18.524 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2289: flush_pg_stats:  continue
2026-02-21T11:42:18.524 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1708: wait_for_pg_clean:  true
2026-02-21T11:42:18.524 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1709: wait_for_pg_clean:  echo '#---------- 1.0 loop 0'
2026-02-21T11:42:18.524 INFO:tasks.workunit.client.0.trial189.stdout:#---------- 1.0 loop 0
2026-02-21T11:42:18.525 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1710: wait_for_pg_clean:  is_pg_clean 1.0
2026-02-21T11:42:18.525 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1572: is_pg_clean:  local pgid=1.0
2026-02-21T11:42:18.525 INFO:tasks.workunit.client.0.trial189.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1573: is_pg_clean:  local pg_state
2026-02-21T11:42:18.525 INFO:tasks.workunit.client.0.trial189.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1574: is_pg_clean:  ceph pg 1.0 query
2026-02-21T11:42:18.525 INFO:tasks.workunit.client.0.trial189.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1574: is_pg_clean:  jq -r '.state '
2026-02-21T14:39:24.182 DEBUG:teuthology.orchestra.run:got remote process result: 124

Actions #33

Updated by Nitzan Mordechai 12 days ago

  • Related to Bug #75406: qa/standalone/ceph-helpers: is_pg_clean hang after osd teardown added
Actions #34

Updated by Nitzan Mordechai 12 days ago

I opened a new tracker for the new issue https://tracker.ceph.com/issues/75406

Actions

Also available in: Atom PDF