Project

General

Profile

Actions

Bug #70010

closed

[WRN] BLUESTORE_SLOW_OP_ALERT: x OSD(s) experiencing slow operations in BlueStore

Added by Richard Durso about 1 year ago. Updated about 1 year ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:

Description

Originally documented in Rook-Ceph discussions, they suggested I open a Ceph tracker on this.
[[https://github.com/rook/rook/discussions/15403]]

Stable with rook-version: v1.16.3, ceph-version: 19.2.1-0, on Ubuntu: 24.04.2 LTS with Kernel: 6.8.0-51-generic:

cluster:
    id:     cb82340a-2eaf-4597-b83e-cc0e62a9d019
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum as,au,av (age 24m)
    mgr: a(active, since 66m), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 12 osds: 12 up (since 2m), 12 in (since 2w)
    rgw: 2 daemons active (2 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   13 pools, 225 pgs
    objects: 128.74k objects, 393 GiB
    usage:   1.1 TiB used, 12 TiB / 13 TiB avail
    pgs:     225 active+clean

  io:
    client:   281 MiB/s rd, 894 KiB/s wr, 515 op/s rd, 17 op/s wr

Only change I did was upgrade ceph to 19.2.1-0 and have been having messages about slow ops since:
$ ceph health detail
HEALTH_WARN 3 OSD(s) experiencing slow operations in BlueStore
[WRN] BLUESTORE_SLOW_OP_ALERT: 3 OSD(s) experiencing slow operations in BlueStore
     osd.6 observed slow operation indications in BlueStore
     osd.9 observed slow operation indications in BlueStore
     osd.10 observed slow operation indications in BlueStore

When I review the OSD logs, I'm not seeing issues about slow thing such as OSD.10 and OSD.9 look like this (keyword search for "slow"):
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count

At least OSD.6 does have a reference to slow ops:
debug 2025-02-15T14:05:51.957+0000 7d0ff0600640  0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency slow operation observed for kv_final, latency = 5.877769947s
debug 2025-02-15T14:05:51.957+0000 7d0ff0600640  0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency_fn slow operation observed for _txc_committed_kv, latency = 6.474522114s, txc = 0x5886f210d800
debug 2025-02-15T14:05:51.957+0000 7d0ff0600640  0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency_fn slow operation observed for _txc_committed_kv, latency = 6.471610069s, txc = 0x5886ef3abb00
debug 2025-02-15T14:05:51.957+0000 7d0ff0600640  0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency_fn slow operation observed for _txc_committed_kv, latency = 6.455352306s, txc = 0x5886ef110c00
debug 2025-02-15T14:05:51.957+0000 7d0ff0600640  0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency_fn slow operation observed for _txc_committed_kv, latency = 6.227544785s, txc = 0x5886f820e000
debug 2025-02-15T14:05:51.962+0000 7d0ff0600640  0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency_fn slow operation observed for _txc_committed_kv, latency = 6.109148026s, txc = 0x5886ef3ab200
debug 2025-02-15T14:05:51.962+0000 7d0ff0600640  0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency_fn slow operation observed for _txc_committed_kv, latency = 6.107332230s, txc = 0x5886ef427500
debug 2025-02-15T14:05:51.972+0000 7d0ff0600640  0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.493177414s, txc = 0x5886f2601200
debug 2025-02-15T14:05:51.976+0000 7d0ff0600640  0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.065506935s, txc = 0x5886ef85a900

Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count

Each devices listed are NVMe. No issues seen with smartctl or nvme-cli. Over a few days the number of devices changes between 1 and 3, but tends to be the same 3 repeating. I'm not finding much in the documentation on how to troubleshoot this. Not seeing any messages about scrub issues or slow trimming. Restarting the OSD does resolve it for a few hours but it comes back.


Related issues 2 (0 open2 closed)

Related to bluestore - Bug #70327: v19.2.1 breaking compatibility with v19.2.0 and consumes all CPU available to OSDsResolvedYite Gu

Actions
Is duplicate of bluestore - Bug #67835: blk/kerneldevice: discard threads wake up each other if discard thread num >= 2.ResolvedYite Gu

Actions
Actions #1

Updated by Igor Fedotov about 1 year ago

  • Project changed from Ceph to bluestore
Actions #2

Updated by Igor Fedotov about 1 year ago

@reefland - please confirm that the issue has been fixed/worked around by some configuration tricks around async discards as stated in https://tracker.ceph.com/issues/70327.

Actions #3

Updated by Richard Durso about 1 year ago

Igor Fedotov wrote in #note-2:

@reefland - please confirm that the issue has been fixed/worked around by some configuration tricks around async discards as stated in https://tracker.ceph.com/issues/70327.

Yes. Setting bdev_async_discard_threads: "1" seemed to resolve the issue.

Actions #4

Updated by Igor Fedotov about 1 year ago

  • Related to Bug #70327: v19.2.1 breaking compatibility with v19.2.0 and consumes all CPU available to OSDs added
Actions #5

Updated by Igor Fedotov about 1 year ago

  • Is duplicate of Bug #67835: blk/kerneldevice: discard threads wake up each other if discard thread num >= 2. added
Actions #6

Updated by Igor Fedotov about 1 year ago

  • Status changed from New to Duplicate
Actions

Also available in: Atom PDF