Skip to content

os/bluestore: fix bluefs log growth.#35473

Merged
tchaikov merged 4 commits intoceph:masterfrom
aclamk:fix-45903-bluefs-log-growth
Jun 24, 2020
Merged

os/bluestore: fix bluefs log growth.#35473
tchaikov merged 4 commits intoceph:masterfrom
aclamk:fix-45903-bluefs-log-growth

Conversation

@aclamk
Copy link
Contributor

@aclamk aclamk commented Jun 8, 2020

This is fix for issue
https://tracker.ceph.com/issues/45903
https://bugzilla.redhat.com/show_bug.cgi?id=1821133

Original problem stemmed from BlueFS inability to replay log, which was caused by BlueFS previously wrote replay log that was corrupted, which was caused by BlueFS log growing to extreme size (~600GB), which was caused by OSD working in a way, when BlueFS::sync_metadata was never invoked.
This was possible only because OSD was working under extremely low write load, and single RocksDB WAL was purged for more then 14 days.

@aclamk aclamk requested review from ifed01, jdurgin and neha-ojha June 8, 2020 11:08
@ifed01
Copy link
Contributor

ifed01 commented Jun 8, 2020

@aclamk - did you see my comment in https://tracker.ceph.com/issues/45903?
This ticket is for Luminous release while the former lacks a patch making bluefs log compaction more aggressive, see #34876

@ifed01
Copy link
Contributor

ifed01 commented Jun 8, 2020

Additionally for the last couple of weeks I've seen another(?) bluefs log issue which looks related - under some circumstances (heavily fragmented and/or massive space allocation) BlueFS might overflow its log runaway (4MB) and either assert with "ceph_assert(h->file->fnode.ino != 1)" or even corrupt the log. See https://tracker.ceph.com/issues/45519
I'm still not 100% aware whether all this stuff is related though.

@ifed01
Copy link
Contributor

ifed01 commented Jun 8, 2020

@aclamk - wondering if you were aware of #34876 and did all the analysis for master not luminous? In other words - are you changes needed when #17354/#34876 are in place?

@aclamk
Copy link
Contributor Author

aclamk commented Jun 8, 2020

@ifed01
I have a case when BlueFS::sync_metadata was never invoked.
I think that #34876 does not cover this situation.

@aclamk
Copy link
Contributor Author

aclamk commented Jun 8, 2020

@ifed01
I agree that https://tracker.ceph.com/issues/45519 is related to https://tracker.ceph.com/issues/45903.
In my observations actual corruption of BlueFS log can happen earlier then detection of runway overflow; but truly, having highly fragmented BlueFS files can make log grow very quickly.

logger->inc(l_bluefs_logged_bytes, bl.length());

if (just_expanded_log) {
ceph_assert(bl.length() <= runway); // if we write this, we will have an unrecoverable data loss
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we check the assertion unconditionally, no matter if it has just expanded or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is deliberately checking previous size of runway. New size of runway will likely be enough to accommodate writes to log, but if we have not enough space to fit in previous extents, then replaying extension will not be possible.

std::lock_guard l(lock);
std::unique_lock l(lock);
_flush(h, false);
_maybe_compact_log(l);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an overkill to me -IIUC flush doesn't guarantee data persistence and hence KV has to use FSync from time to time. I.e. compacting log from fsync is likely to be enough. Besides I'm curious about performance impact of this call on each flush.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ifed01 Modified to check only when flush only really flushed something to disk

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aclamk - but I'm still curious if compacting log on fsync only would be enough? Does RocksDB call fsync often enough during that infrequent writes processing which you're referring in the description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ifed01 Customer had a cluster where there was almost no write traffic. Only op was write to WAL for 23 bytes once every second. Unfortunately, it translated to 109 kB of BlueFS log per second (mostly due to BlueFS log containing >70000 extents).

@aclamk aclamk force-pushed the fix-45903-bluefs-log-growth branch from 2c8174e to bd1efbd Compare June 8, 2020 17:13
@ifed01 ifed01 changed the title Fix 45903 bluefs log growth os/bluestore: fix bluefs log growth. Jun 10, 2020
TEST(BlueFS, test_replay_growth) {
uint64_t size = 1048576LL * (2 * 1024 + 128);
TempBdev bdev{size};
g_ceph_context->_conf.set_val(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_bluefs.cc lacks config settings reversion which is generally a bad practice. Suggest not to follow this approach any more and reset all the changed parameters back to default. store_test.cc is a good example how one can do that automatically if needed.

@ifed01
Copy link
Contributor

ifed01 commented Jun 10, 2020

@aclamk - wondering why this does the patch fix #45903 just partially? What's not covered?

aclamk added 2 commits June 16, 2020 19:12
This partially fixes https://tracker.ceph.com/issues/45903

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
…lay log

This partially fixes https://tracker.ceph.com/issues/45903

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
@aclamk aclamk force-pushed the fix-45903-bluefs-log-growth branch from bd1efbd to 8a3f9db Compare June 17, 2020 11:13
@aclamk
Copy link
Contributor Author

aclamk commented Jun 17, 2020

@ifed01 I rate that this PR as partial fix to #45903, because it is still possible for bluefs log to grow so much, that it will be stopped by assert preventing log corruption. Although is of course very unlikely to happen, but final solution should compact log immediately if such condition is detected.

@aclamk aclamk requested a review from ifed01 June 17, 2020 11:18
@aclamk
Copy link
Contributor Author

aclamk commented Jun 17, 2020

@ifed01 I changed conf setting in test_bluefs.cc, I asked for re-review because of that.

Copy link
Contributor

@ifed01 ifed01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a pair of nits which you might want to fix....

int BlueFS::_flush(FileWriter *h, bool force)
int BlueFS::_flush(FileWriter *h, bool force, std::unique_lock<ceph::mutex>& l)
{
bool flushed;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it looks a bit safer ( currently it's fine but from future modifications perspective...) to init flushed with false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ifed01 thanks, I want to fix all nits!

std::string skey(key);
std::string prev_val;
conf.get_val(skey, &prev_val);
conf.set_val_or_die(key, val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use skey not key here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx

aclamk added 2 commits June 18, 2020 13:40
…ync ops

This partially fixes https://tracker.ceph.com/issues/45903

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
@tchaikov
Copy link
Contributor

jenkins test docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants