os/bluestore: fix bluefs log growth. by aclamk · Pull Request #35473 · ceph/ceph

aclamk · 2020-06-08T11:08:40Z

This is fix for issue
https://tracker.ceph.com/issues/45903
https://bugzilla.redhat.com/show_bug.cgi?id=1821133

Original problem stemmed from BlueFS inability to replay log, which was caused by BlueFS previously wrote replay log that was corrupted, which was caused by BlueFS log growing to extreme size (~600GB), which was caused by OSD working in a way, when BlueFS::sync_metadata was never invoked.
This was possible only because OSD was working under extremely low write load, and single RocksDB WAL was purged for more then 14 days.

ifed01 · 2020-06-08T12:33:47Z

@aclamk - did you see my comment in https://tracker.ceph.com/issues/45903?
This ticket is for Luminous release while the former lacks a patch making bluefs log compaction more aggressive, see #34876

ifed01 · 2020-06-08T12:43:26Z

Additionally for the last couple of weeks I've seen another(?) bluefs log issue which looks related - under some circumstances (heavily fragmented and/or massive space allocation) BlueFS might overflow its log runaway (4MB) and either assert with "ceph_assert(h->file->fnode.ino != 1)" or even corrupt the log. See https://tracker.ceph.com/issues/45519
I'm still not 100% aware whether all this stuff is related though.

ifed01 · 2020-06-08T12:51:18Z

@aclamk - wondering if you were aware of #34876 and did all the analysis for master not luminous? In other words - are you changes needed when #17354/#34876 are in place?

aclamk · 2020-06-08T12:55:14Z

@ifed01
I have a case when BlueFS::sync_metadata was never invoked.
I think that #34876 does not cover this situation.

aclamk · 2020-06-08T13:02:10Z

@ifed01
I agree that https://tracker.ceph.com/issues/45519 is related to https://tracker.ceph.com/issues/45903.
In my observations actual corruption of BlueFS log can happen earlier then detection of runway overflow; but truly, having highly fragmented BlueFS files can make log grow very quickly.

src/os/bluestore/BlueFS.cc

ifed01 · 2020-06-08T13:33:16Z

src/os/bluestore/BlueFS.cc

  logger->inc(l_bluefs_logged_bytes, bl.length());

+  if (just_expanded_log) {
+    ceph_assert(bl.length() <= runway); // if we write this, we will have an unrecoverable data loss


shouldn't we check the assertion unconditionally, no matter if it has just expanded or not?

No, it is deliberately checking previous size of runway. New size of runway will likely be enough to accommodate writes to log, but if we have not enough space to fit in previous extents, then replaying extension will not be possible.

ifed01 · 2020-06-08T14:33:01Z

src/os/bluestore/BlueFS.h

-    std::lock_guard l(lock);
+    std::unique_lock l(lock);
    _flush(h, false);
+    _maybe_compact_log(l);


This looks like an overkill to me -IIUC flush doesn't guarantee data persistence and hence KV has to use FSync from time to time. I.e. compacting log from fsync is likely to be enough. Besides I'm curious about performance impact of this call on each flush.

@ifed01 Modified to check only when flush only really flushed something to disk

@aclamk - but I'm still curious if compacting log on fsync only would be enough? Does RocksDB call fsync often enough during that infrequent writes processing which you're referring in the description.

@ifed01 Customer had a cluster where there was almost no write traffic. Only op was write to WAL for 23 bytes once every second. Unfortunately, it translated to 109 kB of BlueFS log per second (mostly due to BlueFS log containing >70000 extents).

ifed01 · 2020-06-10T13:06:20Z

src/test/objectstore/test_bluefs.cc

+TEST(BlueFS, test_replay_growth) {
+  uint64_t size = 1048576LL * (2 * 1024 + 128);
+  TempBdev bdev{size};
+  g_ceph_context->_conf.set_val(


test_bluefs.cc lacks config settings reversion which is generally a bad practice. Suggest not to follow this approach any more and reset all the changed parameters back to default. store_test.cc is a good example how one can do that automatically if needed.

ifed01 · 2020-06-10T13:08:02Z

@aclamk - wondering why this does the patch fix #45903 just partially? What's not covered?

This partially fixes https://tracker.ceph.com/issues/45903 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>

…lay log This partially fixes https://tracker.ceph.com/issues/45903 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>

aclamk · 2020-06-17T11:18:15Z

@ifed01 I rate that this PR as partial fix to #45903, because it is still possible for bluefs log to grow so much, that it will be stopped by assert preventing log corruption. Although is of course very unlikely to happen, but final solution should compact log immediately if such condition is detected.

aclamk · 2020-06-17T11:19:39Z

@ifed01 I changed conf setting in test_bluefs.cc, I asked for re-review because of that.

ifed01

LGTM, a pair of nits which you might want to fix....

ifed01 · 2020-06-18T08:56:18Z

src/os/bluestore/BlueFS.cc

-int BlueFS::_flush(FileWriter *h, bool force)
+int BlueFS::_flush(FileWriter *h, bool force, std::unique_lock<ceph::mutex>& l)
+{
+  bool flushed;


nit: it looks a bit safer ( currently it's fine but from future modifications perspective...) to init flushed with false.

@ifed01 thanks, I want to fix all nits!

ifed01 · 2020-06-18T09:03:59Z

src/test/objectstore/test_bluefs.cc

+    std::string skey(key);
+    std::string prev_val;
+    conf.get_val(skey, &prev_val);
+    conf.set_val_or_die(key, val);


nit: use skey not key here

…ync ops This partially fixes https://tracker.ceph.com/issues/45903 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>

tchaikov · 2020-06-24T09:36:39Z

http://pulpito.ceph.com/kchai-2020-06-23_07:49:50-rados-wip-kefu-testing-2020-06-23-1007-distro-basic-smithi/5171960, no clues, osd.2 failed to reply queries from osd.0 when peering. but there was no clues in its log.
http://pulpito.ceph.com/kchai-2020-06-23_15:15:59-rados-wip-kefu-testing-2020-06-23-1353-distro-basic-smithi/5173648: tracked by https://tracker.ceph.com/issues/45721
http://pulpito.ceph.com/kchai-2020-06-23_15:15:59-rados-wip-kefu-testing-2020-06-23-1353-distro-basic-smithi/5173654/: no tests failed..

tchaikov · 2020-06-24T09:37:07Z

jenkins test docs

aclamk requested review from ifed01, jdurgin and neha-ojha June 8, 2020 11:08

ifed01 reviewed Jun 8, 2020

View reviewed changes

aclamk force-pushed the fix-45903-bluefs-log-growth branch from 2c8174e to bd1efbd Compare June 8, 2020 17:13

ifed01 changed the title ~~Fix 45903 bluefs log growth~~ os/bluestore: fix bluefs log growth. Jun 10, 2020

ifed01 reviewed Jun 10, 2020

View reviewed changes

batrick added bluestore needs-review labels Jun 15, 2020

aclamk added 2 commits June 16, 2020 19:12

os/bluestore: Added check against using out-of-range extent

340917d

This partially fixes https://tracker.ceph.com/issues/45903 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>

os/bluestore: Added check that protects against corrupting BlueFS rep…

d10a960

…lay log This partially fixes https://tracker.ceph.com/issues/45903 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>

aclamk force-pushed the fix-45903-bluefs-log-growth branch from bd1efbd to 8a3f9db Compare June 17, 2020 11:13

aclamk requested a review from ifed01 June 17, 2020 11:18

ifed01 approved these changes Jun 18, 2020

View reviewed changes

aclamk added 2 commits June 18, 2020 13:40

os/bluestore: Added checks for compacting BlueFS log for flush and fs…

fab0c71

…ync ops This partially fixes https://tracker.ceph.com/issues/45903 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>

os/bluestore: Added test for trimming BlueFS replay log

9a59242

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>

aclamk force-pushed the fix-45903-bluefs-log-growth branch from 8a3f9db to 9a59242 Compare June 18, 2020 11:40

neha-ojha added needs-qa and removed needs-review labels Jun 18, 2020

tchaikov added the wip-kefu-testing label Jun 23, 2020

tchaikov merged commit 23b3d33 into ceph:master Jun 24, 2020

aclamk mentioned this pull request Jun 25, 2020

luminous: os/bluestore: Added rescue procedure for bluefs log replay #35776

Merged

smithfarm mentioned this pull request Aug 13, 2020

octopus: os/bluestore: fix bluefs log growth #36621

Merged

smithfarm mentioned this pull request Oct 27, 2020

[DNM DOES NOT BUILD] nautilus: bluestore: fix bluefs log growth #37833

Closed

smithfarm mentioned this pull request Nov 4, 2020

nautilus: bluestore: Add protection against bluefs log file growth #37948

Merged

aclamk mentioned this pull request Aug 3, 2021

os/bluestore: fix bluefs log run out of space #41888

Closed

3 tasks

ifed01 mentioned this pull request Mar 1, 2022

BlueFS.cc: avoid unnecessary check of _maybe_compact_log_LNF_NF_LD_D() #45201

Closed

14 tasks

liangmingyuanneo mentioned this pull request Aug 30, 2023

os/bluestore: cut down the bluefs compact frequency #53218

Closed

Conversation

aclamk commented Jun 8, 2020

Uh oh!

ifed01 commented Jun 8, 2020

Uh oh!

ifed01 commented Jun 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ifed01 commented Jun 8, 2020

Uh oh!

aclamk commented Jun 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aclamk commented Jun 8, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ifed01 commented Jun 10, 2020

Uh oh!

aclamk commented Jun 17, 2020

Uh oh!

aclamk commented Jun 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ifed01 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tchaikov commented Jun 24, 2020

Uh oh!

tchaikov commented Jun 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ifed01 commented Jun 8, 2020 •

edited

Loading

aclamk commented Jun 8, 2020 •

edited

Loading

aclamk commented Jun 17, 2020 •

edited

Loading