Bug #70010
closed[WRN] BLUESTORE_SLOW_OP_ALERT: x OSD(s) experiencing slow operations in BlueStore
0%
Description
Originally documented in Rook-Ceph discussions, they suggested I open a Ceph tracker on this.
[[https://github.com/rook/rook/discussions/15403]]
Stable with rook-version: v1.16.3, ceph-version: 19.2.1-0, on Ubuntu: 24.04.2 LTS with Kernel: 6.8.0-51-generic:
cluster:
id: cb82340a-2eaf-4597-b83e-cc0e62a9d019
health: HEALTH_OK
services:
mon: 3 daemons, quorum as,au,av (age 24m)
mgr: a(active, since 66m), standbys: b
mds: 1/1 daemons up, 1 hot standby
osd: 12 osds: 12 up (since 2m), 12 in (since 2w)
rgw: 2 daemons active (2 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 13 pools, 225 pgs
objects: 128.74k objects, 393 GiB
usage: 1.1 TiB used, 12 TiB / 13 TiB avail
pgs: 225 active+clean
io:
client: 281 MiB/s rd, 894 KiB/s wr, 515 op/s rd, 17 op/s wr
Only change I did was upgrade ceph to 19.2.1-0 and have been having messages about slow ops since:
$ ceph health detail
HEALTH_WARN 3 OSD(s) experiencing slow operations in BlueStore
[WRN] BLUESTORE_SLOW_OP_ALERT: 3 OSD(s) experiencing slow operations in BlueStore
osd.6 observed slow operation indications in BlueStore
osd.9 observed slow operation indications in BlueStore
osd.10 observed slow operation indications in BlueStore
When I review the OSD logs, I'm not seeing issues about slow thing such as OSD.10 and OSD.9 look like this (keyword search for "slow"):
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
At least OSD.6 does have a reference to slow ops:
debug 2025-02-15T14:05:51.957+0000 7d0ff0600640 0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency slow operation observed for kv_final, latency = 5.877769947s debug 2025-02-15T14:05:51.957+0000 7d0ff0600640 0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency_fn slow operation observed for _txc_committed_kv, latency = 6.474522114s, txc = 0x5886f210d800 debug 2025-02-15T14:05:51.957+0000 7d0ff0600640 0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency_fn slow operation observed for _txc_committed_kv, latency = 6.471610069s, txc = 0x5886ef3abb00 debug 2025-02-15T14:05:51.957+0000 7d0ff0600640 0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency_fn slow operation observed for _txc_committed_kv, latency = 6.455352306s, txc = 0x5886ef110c00 debug 2025-02-15T14:05:51.957+0000 7d0ff0600640 0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency_fn slow operation observed for _txc_committed_kv, latency = 6.227544785s, txc = 0x5886f820e000 debug 2025-02-15T14:05:51.962+0000 7d0ff0600640 0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency_fn slow operation observed for _txc_committed_kv, latency = 6.109148026s, txc = 0x5886ef3ab200 debug 2025-02-15T14:05:51.962+0000 7d0ff0600640 0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency_fn slow operation observed for _txc_committed_kv, latency = 6.107332230s, txc = 0x5886ef427500 debug 2025-02-15T14:05:51.972+0000 7d0ff0600640 0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.493177414s, txc = 0x5886f2601200 debug 2025-02-15T14:05:51.976+0000 7d0ff0600640 0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.065506935s, txc = 0x5886ef85a900 Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Each devices listed are NVMe. No issues seen with smartctl or nvme-cli. Over a few days the number of devices changes between 1 and 3, but tends to be the same 3 repeating. I'm not finding much in the documentation on how to troubleshoot this. Not seeing any messages about scrub issues or slow trimming. Restarting the OSD does resolve it for a few hours but it comes back.
Updated by Igor Fedotov about 1 year ago
@reefland - please confirm that the issue has been fixed/worked around by some configuration tricks around async discards as stated in https://tracker.ceph.com/issues/70327.
Updated by Richard Durso about 1 year ago
Igor Fedotov wrote in #note-2:
@reefland - please confirm that the issue has been fixed/worked around by some configuration tricks around async discards as stated in https://tracker.ceph.com/issues/70327.
Yes. Setting bdev_async_discard_threads: "1" seemed to resolve the issue.
Updated by Igor Fedotov about 1 year ago
- Related to Bug #70327: v19.2.1 breaking compatibility with v19.2.0 and consumes all CPU available to OSDs added
Updated by Igor Fedotov about 1 year ago
- Is duplicate of Bug #67835: blk/kerneldevice: discard threads wake up each other if discard thread num >= 2. added