Bug #72778: crimson-debug: Health check failure - SLOW_OPS (osd.0 blocked for 32s) - crimson - Ceph

Actions

Copy link

Bug #72778

closed

crimson-debug: Health check failure - SLOW_OPS (osd.0 blocked for 32s)

Added by Kautilya Tripathi 7 months ago. Updated 4 months ago.

Status:

Resolved

Priority:

High

Assignee:

Shraddha Agrawal

Category:

Performance

Target version:

% Done:

Source:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

66307

Tags (freeform):

Merge Commit:

d1eb2440991ec1337a2213c426fd7ec581817c59

Fixed In:

v20.3.0-4255-gd1eb244099

Released In:

Upkeep Timestamp:

2025-11-19T07:23:06+00:00

Tags:

frequent-failure crimson

Description

https://pulpito.ceph.com/knrt-2025-08-26_11:25:36-crimson-rados-main-distro-crimson-debug-smithi/8463629/

failure_reason: '"2025-08-26T13:05:11.032708+0000 mon.a (mon.0) 1538 : cluster [WRN]
  Health check failed: 1 slow ops, oldest one blocked for 32 sec, osd.0 has slow ops
  (SLOW_OPS)" in cluster log'
flavor: crimson-debug

Result: Tests passed, but slow ops reported.

Actions

Copy link

Updated by Matan Breizman 7 months ago

Thanks for reporting this tracker!
I would recommend to provide as much context as possible:
1) What was the slow op that is reported on osd.1?
The logs attached only shows that we had one, without additional information.
For the actual information we would need to look at the osd log of this job and grep for: `slow request` `slow_ops` and so on.
2) In which job did the error occurred? Meaning, rbd_api_tests
3) If relevant, which backend was used for the reported job. This could potentially further help with debugging the issue.

See for example and older slow op report which we had - https://tracker.ceph.com/issues/69612

Actions

Copy link

Updated by Matan Breizman 6 months ago

https://pulpito.ceph.com/teuthology-2025-09-16_20:56:11-crimson-rados-main-distro-crimson-debug-smithi/8504228

Actions

Copy link

Updated by Matan Breizman 6 months ago

Tags changed from crimson to crimson, frequent-failure
Priority changed from Normal to High

Actions

Copy link

Updated by Jose J Palacios Perez 6 months ago

This https://pulpito.ceph.com/jjperez-2025-09-17_20:02:04-crimson-rados-wip-perezjos-crimson-only-17-09-2025-PR65166-distro-crimson-debug-smithi/8506473/ looks like a duplicate:

"2025-09-17T22:10:41.129589+0000 osd.2 (osd.2) 1 : cluster [WRN] slow requests (most affected pool [ 'test-librbd-smithi114-35807-49' : 1 ])" in cluster log

Executed as:

teuthology-suite -m smithi -s crimson-rados -p 101 --force-priority -c wip-perezjos-crimson-only-17-09-2025-PR65166  -f crimson-debug --filter seastore --distro centos --distro-version 9 --suite-branch main --suite-repo https://github.com/ceph/ceph.git #PR 65166

Actions

Copy link

Updated by Matan Breizman 6 months ago

From nighlies:

https://pulpito.ceph.com/teuthology-2025-10-07_21:04:03-crimson-rados-main-distro-crimson-release-smithi/8539839
https://pulpito.ceph.com/teuthology-2025-10-07_21:04:03-crimson-rados-main-distro-crimson-release-smithi/8539776

Actions

Copy link

Updated by Matan Breizman 6 months ago

Assignee set to Shraddha Agrawal

Hey Shraddha, would you like to take a look on this one?
Please feel free to ping to me or Aishwarya if you have any questions
Thanks! :)

Actions

Copy link

Updated by Shraddha Agrawal 6 months ago

Hey Matan, thanks for assigning this tracker to me. I'll investigate and get back with my findings!

Actions

Copy link

Updated by Shraddha Agrawal 5 months ago

I see mon.a reported SLOW_OPS:

2025-08-26T13:05:18.908+0000 7f43849f1640 20 mon.a@0(leader).mgrstat health checks:
{
    "SLOW_OPS": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "1 slow ops, oldest one blocked for 37 sec, osd.0 has slow ops",
            "count": 1
        },
        "detail": []
    }
}

In osd.0 logs I see:

DEBUG 2025-08-26 13:05:27,456 [shard 0:main] ms - [0x7c7ab5c7d1c0 osd.0(cluster) v2:172.21.15.50:6804/668993626 >> osd.1 v2:172.21.15.50:6800/3018203893@62110] GOT AckFrame: seq=53987
DEBUG 2025-08-26 13:05:28,100 [shard 0:main] seastore - SeaStore::stat: ...
DEBUG 2025-08-26 13:05:28,100 [shard 0:main] seastore - SeaStoreS::stat: stat=store_statfs(0x63be49000/0x0/0x63c000000, data 0x1b7000/0x1b7000, compress 0x0/0x0/0x0, omap 0x0, meta 0x0)
WARN  2025-08-26 13:05:28,100 [shard 0:main] osd - OSD::get_health_metrics: slow request m=[osd_op(client.4462.0:1 51.1a 51:5bdd5772:::image1.rbd:head [stat] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e1209)] initiated 2025-08-26T13:04:37.130137+0000
ERROR 2025-08-26 13:05:28,100 [shard 0:main] osd - OSD::get_health_metrics: get_health_metrics reporting 1 slow ops, oldest is m=[osd_op(client.4462.0:1 51.1a 51:5bdd5772:::image1.rbd:head [stat] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e1209)]
WARN  2025-08-26 13:05:28,100 [shard 0:main] osd - OSD::get_health_metrics: slow requests (most affected pool [ 'test-librbd-smithi050-36017-49' : 1 ])
WARN  2025-08-26 13:05:28,100 [shard 0:main] monc - log [WRN] : slow requests (most affected pool [ 'test-librbd-smithi050-36017-49' : 1 ])
DEBUG 2025-08-26 13:05:28,101 [shard 1:main] seastore - SeaStoreS::stat: stat=store_statfs(0x63bf02000/0x0/0x63c000000, data 0xfe000/0xfe000, compress 0x0/0x0/0x0, omap 0x0, meta 0x0)
DEBUG 2025-08-26 13:05:28,101 [shard 0:main] seastore - SeaStore::stat: done, stat=store_statfs(0xc77d4b000/0x0/0xc78000000, data 0x2b5000/0x2b5000, compress 0x0/0x0/0x0, omap 0x0, meta 0x0)

This on the first look, looks similar to https://tracker.ceph.com/issues/69612. I am still trying to determine if that is indeed the case or not.

Actions

Copy link

Updated by Shraddha Agrawal 5 months ago

Status changed from New to In Progress

Actions

Copy link

#10

Updated by Matan Breizman 4 months ago

https://pulpito.ceph.com/matan-2025-11-09_14:54:31-crimson-rados-main-distro-crimson-debug-smithi/8590288/

Actions

Copy link

#11

Updated by Shraddha Agrawal 4 months ago · Edited

Here is the summary of all the above instances:

No.	Job	Slow op time	Object store	Operation
1	crimson-rados/rbd/{clusters/fixed-1 crimson-supported-all-distro/centos_latest crimson_qa_overrides deploy/ceph objectstore/seastore/seastore-segmented tasks/rbd_api_tests}	24s. started at 13:05:08 and ended at 13:05:32.	seastore	image1.rbd:head
2	crimson-rados/rbd/{clusters/fixed-1 crimson-supported-all-distro/centos_latest crimson_qa_overrides deploy/ceph objectstore/seastore/seastore-segmented tasks/rbd_api_tests}	4m 48s. starts at 23:03:05 and ends at 23:07:53.	seastore	undecoded
3	crimson-rados/rbd/{clusters/fixed-1 crimson-supported-all-distro/centos_latest crimson_qa_overrides deploy/ceph objectstore/seastore/seastore-segmented tasks/rbd_api_tests}	22s. starts at 22:10:41 and ends at 22:11:03.	seastore	undecoded
4	crimson-rados/singleton/{all/osd-pg-splitting crimson-supported-all-distro/centos_latest crimson_qa_overrides objectstore/bluestore rados}	1m27s.started at 06:20:09 and ended at 06:21:36.	bluestore	benchmark_data_smithi032_39257_object10731
5	crimson-rados/perf/{clusters/{fixed-2} crimson-supported-all-distro/centos_latest crimson_qa_overrides deploy/ceph objectstore/seastore/seastore-segmented settings/optimized workloads/radosbench_4K_rand_read}	2s. started at 5:37:58 and ended at 5:38:00.	seastore	benchmark_data_smithi083_37953_object9185
6	crimson-rados/rbd/{clusters/crimson-fixed-1 crimson-supported-all-distro/centos_latest crimson_qa_overrides deploy/ceph objectstore/seastore/segmented$/{crimson_seastore_segmented_2q}	1m9s. started at 17:32:36 and ended at 17:33:45.	seastore	undeocded

Actions

Copy link

#12

Updated by Matan Breizman 4 months ago · Edited

Shraddha Agrawal wrote in #note-11:

Here is the summary of all the above instances:

No. Job Slow op time Object store Operation

6 crimson-rados/rbd/{clusters/crimson-fixed-1 crimson-supported-all-distro/centos_latest
crimson_qa_overrides deploy/ceph objectstore/seastore/segmented$/{crimson_seastore_segmented_2q} 1m9s. started at 17:32:36 and ended at 17:33:45. seastore > undeocded

Based on the summary above, we can see that the slow_op reports are quite random (object store, workload and so on) I've tracked the last report below:

Looking at:
https://pulpito.ceph.com/matan-2025-11-09_14:54:31-crimson-rados-main-distro-crimson-debug-smithi/8590288/
With:

"2025-11-09T17:32:43.041212+0000 mon.a (mon.0) 1538 : cluster [WRN] Health check failed: 1 slow ops, oldest one blocked for 34 sec, osd.1 has slow ops (SLOW_OPS)" in cluster log

From mon.0 log:

            "message": "1 slow ops, oldest one blocked for 34 sec, osd.1 has slow ops",
            ...
            "message": "1 slow ops, oldest one blocked for 94 sec, osd.1 has slow ops",
            "message": "1 slow ops, oldest one blocked for 94 sec, osd.1 has slow ops",

While the message reported is "for 34 sec", the operation is blocked for longer than that - due to the way we scan the log for warnings, the first message is the one that is reported.

From osd.1 log:

DEBUG 2025-11-09 17:33:40,744 [shard 0:main] seastore - SeaStore::stat: ...
DEBUG 2025-11-09 17:33:40,744 [shard 0:main] seastore - SeaStoreS::stat: stat=store_statfs(0x63a461000/0x0/0x63c000000, data 0x1b9f000/0x1b9f000, compress 0x0/0x0/0x0, omap 0x0, meta 0x0)
WARN  2025-11-09 17:33:40,744 [shard 0:main] osd - OSD::get_health_metrics: slow request m=[osd_op(client.4429.0:2 51.17 51.ba46737 (undecoded) ondisk+read+known_if_redirected+supports_pool_eio e1222)] initiated 2025-11-09T17:32:05.773478+0000
ERROR 2025-11-09 17:33:40,745 [shard 0:main] osd - OSD::get_health_metrics: get_health_metrics reporting 1 slow ops, oldest is m=[osd_op(client.4429.0:2 51.17 51.ba46737 (undecoded) ondisk+read+known_if_redirected+supports_pool_eio e1222)]
WARN  2025-11-09 17:33:40,745 [shard 0:main] osd - OSD::get_health_metrics: slow requests (most affected pool [ 'test-librbd-smithi022-35747-49' : 1 ])
WARN  2025-11-09 17:33:40,745 [shard 0:main] monc - log [WRN] : slow requests (most affected pool [ 'test-librbd-smithi022-35747-49' : 1 ])

Tracking client.4429.0:2 51.17 51.ba46737:

DEBUG 2025-11-09 17:32:05,773 [shard 0:main] ms - [0x7cb2d7513840 osd.1(client) v2:172.21.15.22:6809/3486956418 >> client.4429 172.21.15.22:0/4286543537@45138] <== #1 === osd_op(client.4429.0:2 51.17 51.ba46737 (undecoded) ondisk+read+known_if_redirected+supports_pool_eio e1222) v9 (42)
DEBUG 2025-11-09 17:32:05,773 [shard 0:main] osd - client_request(id=209132, detail=m=[osd_op(client.4429.0:2 51.17 51.ba46737 (undecoded) ondisk+read+known_if_redirected+supports_pool_eio e1222)]): starting start_pg_operation
DEBUG 2025-11-09 17:32:05,773 [shard 0:main] osd - client_request(id=209132, detail=m=[osd_op(client.4429.0:2 51.17 51.ba46737 (undecoded) ondisk+read+known_if_redirected+supports_pool_eio e1222)]): start_pg_operation in await_active stage
DEBUG 2025-11-09 17:32:05,774 [shard 0:main] osd - client_request(id=209132, detail=m=[osd_op(client.4429.0:2 51.17 51.ba46737 (undecoded) ondisk+read+known_if_redirected+supports_pool_eio e1222)]): start_pg_operation active, entering await_map
DEBUG 2025-11-09 17:32:05,774 [shard 0:main] osd - client_request(id=209132, detail=m=[osd_op(client.4429.0:2 51.17 51.ba46737 (undecoded) ondisk+read+known_if_redirected+supports_pool_eio e1222)]): start_pg_operation await_map stage
DEBUG 2025-11-09 17:32:07,905 [shard 0:main] osd - client_request(id=209132, detail=m=[osd_op(client.4429.0:2 51.17 51.ba46737 (undecoded) ondisk+read+known_if_redirected+supports_pool_eio e1222)]): got map 1222, entering get_pg_mapping
DEBUG 2025-11-09 17:32:07,908 [shard 0:main] osd - client_request(id=209132, detail=m=[osd_op(client.4429.0:2 51.17 51.ba46737 (undecoded) ondisk+read+known_if_redirected+supports_pool_eio e1222)]): can_create=false, target-core=0
DEBUG 2025-11-09 17:32:07,908 [shard 0:main] osd - client_request(id=209132, detail=m=[osd_op(client.4429.0:2 51.17 51.ba46737 (undecoded) ondisk+read+known_if_redirected+supports_pool_eio e1222)]): entering create_or_wait_pg

...

Operation is blocked for more than one minute!

...

DEBUG 2025-11-09 17:33:41,611 [shard 0:main] osd - client_request(id=209132, detail=m=[osd_op(client.4429.0:2 51.17 51.ba46737 (undecoded) ondisk+read+known_if_redirected+supports_pool_eio e1222)]): have_pg

Step 2: Why was create_or_wait_pg blocked?

Following get_pg_mapping:

DEBUG 2025-11-09 17:32:07,905 [shard 0:main] osd - client_request(id=209132, detail=m=[osd_op(client.4429.0:2 51.17 51.ba46737 (undecoded) ondisk+read+known_if_redirected+supports_pool_eio e1222)]): got map 1222, entering get_pg_mapping

DEBUG 2025-11-09 17:32:07,906 [shard 0:main] osd - PGShardMapping::get_or_create_pg_mapping: calling primary to add mapping for pg 51.17 to the expected core 4294967295
DEBUG 2025-11-09 17:32:07,906 [shard 0:main] osd - PGShardMapping::get_or_create_pg_mapping: mapping pg 51.17 to core 0 (primary): num_pgs 23
DEBUG 2025-11-09 17:32:07,907 [shard 1:main] osd - PGShardMapping::get_or_create_pg_mapping: mapping pg 51.17 to core 0 (others)
DEBUG 2025-11-09 17:32:07,908 [shard 0:main] osd - PGShardMapping::get_or_create_pg_mapping: returning pg 51.17 mapping to core 0 after broadcasted

Following PG 51.17:

DEBUG 2025-11-09 17:33:41,582 [shard 0:main] osd - PGMap::set_creating: Creating 51.17
DEBUG 2025-11-09 17:33:41,583 [shard 0:main] seastore - SeaStoreS::create_new_collection: cid=51.17_head
DEBUG 2025-11-09 17:33:41,597 [shard 0:main] osd - PGAdvanceMap: pg_advance_map(id=0, detail=PGAdvanceMap(pg=51.17 from=4294967295 to=1230 do_init)): created

DEBUG 2025-11-09 17:33:41,611 [shard 0:main] osd - client_request(id=209132, detail=m=[osd_op(client.4429.0:2 51.17 51.ba46737 (undecoded) ondisk+read+known_if_redirected+supports_pool_eio e1222)]): have_pg

It seems that pg 51.17 was created after a minute or so (17:33:41) which is exactly when the client_request moved to "have_pg" stage and was unblocked.

Conclusions:
The operation reported was blocked for more than a minute, when tracking the cause for the block we have found that it was caused due to waiting for the pg to be created. The positive side is that all is working as expected, there is no outlined issue with the behavior. However, we should still understand how we can avoid these kinds of reports. There are two possible approaches:
1) Increase slow op report threshold
2) Bump reactor number or cores used for testing

For 1, we've seen in the report above that the op was blocked for almost 2 minutes - increasing the threshold this high will possibly a) make us miss valueable reports and b) might not be high enough for future reports which will be blocked for longer than that.

For 2, Currently we use 2/3 reactors per OSD, which is low. As we seen in the mapping above:

mapping pg 51.17 to core 0 (primary): num_pgs 23

core 0, has 23 pgs mapped into it! This should explain the delay in creation on the pg. Increasing the core count / reactor count should better distribute the pgs to cores and possibly solve the delays we see in the suite. See how we currently set the number of reactors to use with "crimson cpu num:" in the crimson-rados suite.

Feel free to reach out if any of the above is unclear! btw, my favorite tool for logging investigation, used above, is "klogg".

Actions

Copy link

#13

Updated by Shraddha Agrawal 4 months ago

Status changed from In Progress to Fix Under Review
Pull request ID set to 66307

Actions

Copy link

#14

Updated by Upkeep Bot 4 months ago

Status changed from Fix Under Review to Resolved
Merge Commit set to d1eb2440991ec1337a2213c426fd7ec581817c59
Fixed In set to v20.3.0-4255-gd1eb244099
Upkeep Timestamp set to 2025-11-19T07:23:06+00:00

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » crimson

Tags

Custom queries

Bug #72778

crimson-debug: Health check failure - SLOW_OPS (osd.0 blocked for 32s)

Updated by Matan Breizman 7 months ago

Updated by Matan Breizman 6 months ago

Updated by Matan Breizman 6 months ago

Updated by Jose J Palacios Perez 6 months ago

Updated by Matan Breizman 6 months ago

Updated by Matan Breizman 6 months ago

Updated by Shraddha Agrawal 6 months ago

Updated by Shraddha Agrawal 5 months ago

Updated by Shraddha Agrawal 5 months ago

Updated by Matan Breizman 4 months ago

Updated by Shraddha Agrawal 4 months ago · Edited

Updated by Matan Breizman 4 months ago · Edited

Updated by Shraddha Agrawal 4 months ago

Updated by Upkeep Bot 4 months ago