os/bluestore: enable 4K allocation unit for BlueFS#48854
Conversation
Signed-off-by: Igor Fedotov <ifedotov@croit.io>
We can reuse _compact_log_dump_metadata_NF() instead Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
+ minor refactoring. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
7b09b5a to
6248340
Compare
aclamk
left a comment
There was a problem hiding this comment.
Splendid work!
Some comments on possible improvements, but logic makes sense.
| * that jumps the log write position to the new extent. At this point, the | ||
| * old extent(s) won't be written to, and reflect everything to compact. | ||
| * New events will be written to the new region that we'll keep. | ||
| * ASYNC LOG COMPACTION |
There was a problem hiding this comment.
I think we could cut backward-size dependancy if we split op_update_inc(ino1,) between STARTER and META_DUMP blocks.
SUPERBLOCK:
- contains ino.1 extents of STARTER
STARTER (seq 1):
- op_init
- op_update_inc(ino.1, extents of META_DUMP)
- op_jump(2, sizeof(STARTER))
- unused space
META_DUMP (seq 2):
- ... dump_metadata
- op_update_inc(ino.1, extents of LOG_CONTINUATION)
- op_jump(LOG_CONT.seq, sizeof(STARTER) + sizeof(META_DUMP)
- unused space
LOG_CONT (seq cont):
- the continuation of previous log
The rationale is to have initial log fnode after compaction small enough to fit into 4K superblock. Without that compacted metadata might require fnode longer than 4K which goes beyond existing 4K superblock. BlueFS assert in this case for now. Hence the resulting log allocation disposition is like: - superblock(4K) keeps initial log fnode which refers: op_init, op_update_inc(log), op_jump(next seq) - updated log fnode built from superblock + above op_update_inc refers: compacted meta (a bunch of op_update and others) - * - more op_update_inc(log) to follow if log is extended - * Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
This includes finer position specification during replay and logging read size in hex. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
This effectively enables having 4K allocation units for BlueFS. But it doesn't turn it on by default for the sake of performance. Using main device which lacks enough free large continuous extents might do the trick though. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
When using bluefs_shared_alloc_size one might get a long-lasting state when that large chunks are not available any more and fallback to shared device min alloc size occurs. The introduced cooldown is intended to prevent repetitive allocation attempts with bluefs_shared_alloc_size for a while. The rationale is to eliminate performance penalty these failing attempts might cause. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
6248340 to
e52bcc8
Compare
|
jonkins test make check |
|
jenkins test make check |
1 similar comment
|
jenkins test make check |
|
We've been hitting this issue on Pacific while increasing @ifed01 Are you planning on backporting this PR to Pacific? Any chance it would make it into 16.2.11? |
|
|
Alright, thanks for the info! I tried to cherry-pick your commits on top of I'm trying to compile a version of I'm assuming that with this fix, we would manage to bring the OSD back up, it would allocate what it needs on BlueFS using the 4KiB blocks, and then even if we downgraded to the mainline Pacific version, it would still be able to start as long as it doesn't need to allocate more space on BlueFS. Is that correct, or it wouldn't be able to cope with the smaller blocks? |
I tried to do a backport yesterday but it looks like this needs some additional non-trivial PRs to be backported first for Pacific. Hence postponed that till this one is merged into main branch.
I wouldn't recommend to downgrade to Ceph versions which don't support 4K BlueFS once you had brought that support in. |
|
Rados suite review: https://pulpito.ceph.com/?branch=wip-yuri2-testing-2023-01-23-0928 Failures, unrelated: Details: |
|
hi @ifed01 I would like to know whether the problem of this pr repair will appear in ceph 15.2.x ? |
|
@ifed01 thank you very much for your reply, if I use ceph 15 version, it means that this bug will always exist, I can only adapt this pr by myself |
yes, that's true. Or upgrade to Pacific. And the bug is not that frequent though... |
thank you 😀 |
@NUABO - just realized that Pacific lacks 4K bluefs support as well. You'll need Quincy release to get in onboard. Sorry for the mileading. |
thanks for your friendly reminder (: |
|
Hi, Is it intended? |
Yes, that's correct. By default for the sake of performance BlueFS still uses 64K alloc unit on shared device. |
|
So lowering from 64k to 4k is done by ceph if needed like an automated
mechanism or it's the admin duty when need to fix?
|
This is done by ceph automatically |
Fixes: https://tracker.ceph.com/issues/53466
Signed-off-by: Igor Fedotov igor.fedotov@croit.io
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows