os/bluestore: fix bluefs log growth.#35473
Conversation
|
@aclamk - did you see my comment in https://tracker.ceph.com/issues/45903? |
|
Additionally for the last couple of weeks I've seen another(?) bluefs log issue which looks related - under some circumstances (heavily fragmented and/or massive space allocation) BlueFS might overflow its log runaway (4MB) and either assert with "ceph_assert(h->file->fnode.ino != 1)" or even corrupt the log. See https://tracker.ceph.com/issues/45519 |
|
@ifed01 |
| logger->inc(l_bluefs_logged_bytes, bl.length()); | ||
|
|
||
| if (just_expanded_log) { | ||
| ceph_assert(bl.length() <= runway); // if we write this, we will have an unrecoverable data loss |
There was a problem hiding this comment.
shouldn't we check the assertion unconditionally, no matter if it has just expanded or not?
There was a problem hiding this comment.
No, it is deliberately checking previous size of runway. New size of runway will likely be enough to accommodate writes to log, but if we have not enough space to fit in previous extents, then replaying extension will not be possible.
src/os/bluestore/BlueFS.h
Outdated
| std::lock_guard l(lock); | ||
| std::unique_lock l(lock); | ||
| _flush(h, false); | ||
| _maybe_compact_log(l); |
There was a problem hiding this comment.
This looks like an overkill to me -IIUC flush doesn't guarantee data persistence and hence KV has to use FSync from time to time. I.e. compacting log from fsync is likely to be enough. Besides I'm curious about performance impact of this call on each flush.
There was a problem hiding this comment.
@ifed01 Modified to check only when flush only really flushed something to disk
There was a problem hiding this comment.
@aclamk - but I'm still curious if compacting log on fsync only would be enough? Does RocksDB call fsync often enough during that infrequent writes processing which you're referring in the description.
There was a problem hiding this comment.
@ifed01 Customer had a cluster where there was almost no write traffic. Only op was write to WAL for 23 bytes once every second. Unfortunately, it translated to 109 kB of BlueFS log per second (mostly due to BlueFS log containing >70000 extents).
2c8174e to
bd1efbd
Compare
src/test/objectstore/test_bluefs.cc
Outdated
| TEST(BlueFS, test_replay_growth) { | ||
| uint64_t size = 1048576LL * (2 * 1024 + 128); | ||
| TempBdev bdev{size}; | ||
| g_ceph_context->_conf.set_val( |
There was a problem hiding this comment.
test_bluefs.cc lacks config settings reversion which is generally a bad practice. Suggest not to follow this approach any more and reset all the changed parameters back to default. store_test.cc is a good example how one can do that automatically if needed.
This partially fixes https://tracker.ceph.com/issues/45903 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
…lay log This partially fixes https://tracker.ceph.com/issues/45903 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
bd1efbd to
8a3f9db
Compare
|
@ifed01 I rate that this PR as partial fix to #45903, because it is still possible for bluefs log to grow so much, that it will be stopped by assert preventing log corruption. Although is of course very unlikely to happen, but final solution should compact log immediately if such condition is detected. |
|
@ifed01 I changed conf setting in test_bluefs.cc, I asked for re-review because of that. |
ifed01
left a comment
There was a problem hiding this comment.
LGTM, a pair of nits which you might want to fix....
src/os/bluestore/BlueFS.cc
Outdated
| int BlueFS::_flush(FileWriter *h, bool force) | ||
| int BlueFS::_flush(FileWriter *h, bool force, std::unique_lock<ceph::mutex>& l) | ||
| { | ||
| bool flushed; |
There was a problem hiding this comment.
nit: it looks a bit safer ( currently it's fine but from future modifications perspective...) to init flushed with false.
src/test/objectstore/test_bluefs.cc
Outdated
| std::string skey(key); | ||
| std::string prev_val; | ||
| conf.get_val(skey, &prev_val); | ||
| conf.set_val_or_die(key, val); |
…ync ops This partially fixes https://tracker.ceph.com/issues/45903 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
8a3f9db to
9a59242
Compare
|
jenkins test docs |
This is fix for issue
https://tracker.ceph.com/issues/45903
https://bugzilla.redhat.com/show_bug.cgi?id=1821133
Original problem stemmed from BlueFS inability to replay log, which was caused by BlueFS previously wrote replay log that was corrupted, which was caused by BlueFS log growing to extreme size (~600GB), which was caused by OSD working in a way, when BlueFS::sync_metadata was never invoked.
This was possible only because OSD was working under extremely low write load, and single RocksDB WAL was purged for more then 14 days.