os/bluestore: enable 4K allocation unit for BlueFS by ifed01 · Pull Request #48854 · ceph/ceph

ifed01 · 2022-11-11T19:42:24Z

Fixes: https://tracker.ceph.com/issues/53466
Signed-off-by: Igor Fedotov igor.fedotov@croit.io

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

Signed-off-by: Igor Fedotov <ifedotov@croit.io>

We can reuse _compact_log_dump_metadata_NF() instead Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

+ minor refactoring. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

aclamk

Splendid work!
Some comments on possible improvements, but logic makes sense.

src/os/bluestore/BlueFS.cc

aclamk · 2022-11-16T10:14:03Z

src/os/bluestore/BlueFS.cc

- * that jumps the log write position to the new extent.  At this point, the
- * old extent(s) won't be written to, and reflect everything to compact.
- * New events will be written to the new region that we'll keep.
+ * ASYNC LOG COMPACTION


I think we could cut backward-size dependancy if we split op_update_inc(ino1,) between STARTER and META_DUMP blocks.

SUPERBLOCK: - contains ino.1 extents of STARTER STARTER (seq 1): - op_init - op_update_inc(ino.1, extents of META_DUMP) - op_jump(2, sizeof(STARTER)) - unused space META_DUMP (seq 2): - ... dump_metadata - op_update_inc(ino.1, extents of LOG_CONTINUATION) - op_jump(LOG_CONT.seq, sizeof(STARTER) + sizeof(META_DUMP) - unused space LOG_CONT (seq cont): - the continuation of previous log

src/os/bluestore/BlueFS.cc

The rationale is to have initial log fnode after compaction small enough to fit into 4K superblock. Without that compacted metadata might require fnode longer than 4K which goes beyond existing 4K superblock. BlueFS assert in this case for now. Hence the resulting log allocation disposition is like: - superblock(4K) keeps initial log fnode which refers: op_init, op_update_inc(log), op_jump(next seq) - updated log fnode built from superblock + above op_update_inc refers: compacted meta (a bunch of op_update and others) - * - more op_update_inc(log) to follow if log is extended - * Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

This includes finer position specification during replay and logging read size in hex. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

This effectively enables having 4K allocation units for BlueFS. But it doesn't turn it on by default for the sake of performance. Using main device which lacks enough free large continuous extents might do the trick though. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

When using bluefs_shared_alloc_size one might get a long-lasting state when that large chunks are not available any more and fallback to shared device min alloc size occurs. The introduced cooldown is intended to prevent repetitive allocation attempts with bluefs_shared_alloc_size for a while. The rationale is to eliminate performance penalty these failing attempts might cause. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

ifed01 · 2022-11-17T00:34:45Z

jonkins test make check

ifed01 · 2022-11-17T00:34:55Z

jenkins test make check

ifed01 · 2022-11-17T08:19:28Z

jenkins test make check

BenoitKnecht · 2023-01-11T10:48:37Z

We've been hitting this issue on Pacific while increasing pg_num in one of our pools, causing several OSDs to crash, some with as little as 70% usage.

@ifed01 Are you planning on backporting this PR to Pacific? Any chance it would make it into 16.2.11?

ifed01 · 2023-01-11T12:20:00Z

We've been hitting this issue on Pacific while increasing pg_num in one of our pools, causing several OSDs to crash, some with as little as 70% usage.

@ifed01 Are you planning on backporting this PR to Pacific? Any chance it would make it into 16.2.11?
yes, backports are planned. But I doubt this would get into 16.2.11 which to be released soon

BenoitKnecht · 2023-01-11T15:23:57Z

Alright, thanks for the info!

I tried to cherry-pick your commits on top of pacific, but I'm really not sure I managed to resolve conflicts correctly, so if you're already working on a backport on a different branch, I'd be very much interested in checking it out. 🙂

I'm trying to compile a version of ceph-osd with this fix in case we get crashes again, because otherwise the only workaround would be to copy the DB to a separate device, which is rather cumbersome.

I'm assuming that with this fix, we would manage to bring the OSD back up, it would allocate what it needs on BlueFS using the 4KiB blocks, and then even if we downgraded to the mainline Pacific version, it would still be able to start as long as it doesn't need to allocate more space on BlueFS. Is that correct, or it wouldn't be able to cope with the smaller blocks?

ifed01 · 2023-01-12T11:52:50Z

Alright, thanks for the info!

I tried to cherry-pick your commits on top of pacific, but I'm really not sure I managed to resolve conflicts correctly, so if you're already working on a backport on a different branch, I'd be very much interested in checking it out. 🙂

I tried to do a backport yesterday but it looks like this needs some additional non-trivial PRs to be backported first for Pacific. Hence postponed that till this one is merged into main branch.

I'm trying to compile a version of ceph-osd with this fix in case we get crashes again, because otherwise the only workaround would be to copy the DB to a separate device, which is rather cumbersome.

I'm assuming that with this fix, we would manage to bring the OSD back up, it would allocate what it needs on BlueFS using the 4KiB blocks, and then even if we downgraded to the mainline Pacific version, it would still be able to start as long as it doesn't need to allocate more space on BlueFS. Is that correct, or it wouldn't be able to cope with the smaller blocks?

I wouldn't recommend to downgrade to Ceph versions which don't support 4K BlueFS once you had brought that support in.
Instead of all the tricks with unofficial backports you might want to set bluefs_shared_alloc_size to 32K to recover broken OSDs. We used such a workaround a few times and it looks safe enough (which isn't the case when setting it to 4K!). But generally this wouldn't fix the issue permanently and you can face the issue again after a while if space fragmenting goes on...

ljflores · 2023-01-25T23:22:07Z

Rados suite review: https://pulpito.ceph.com/?branch=wip-yuri2-testing-2023-01-23-0928

Failures, unrelated:
1. https://tracker.ceph.com/issues/58585 -- new tracker
2. https://tracker.ceph.com/issues/58256 -- fix merged to latest main
3. https://tracker.ceph.com/issues/58475
4. https://tracker.ceph.com/issues/57754 -- closed
5. https://tracker.ceph.com/issues/57546 -- fix is in testing

Details:
1. rook: failed to pull kubelet image - Ceph - Orchestrator
2. ObjectStore/StoreTestSpecificAUSize.SpilloverTest/2: Expected: (logger->get(l_bluefs_slow_used_bytes)) >= (16 * 1024 * 1024), actual: 0 vs 16777216 - Ceph - RADOS
3. test_dashboard_e2e.sh: Conflicting peer dependency: postcss@8.4.21 - Ceph - Mgr - Dashboard
4. test_envlibrados_for_rocksdb.sh: update-alternatives: error: alternative path /usr/bin/gcc-11 doesn't exist - Ceph - RADOS
5. rados/thrash-erasure-code: wait_for_recovery timeout due to "active+clean+remapped+laggy" pgs - Ceph - RADOS

NUABO · 2023-05-29T05:42:07Z

hi @ifed01 I would like to know whether the problem of this pr repair will appear in ceph 15.2.x ?

ifed01 · 2023-05-29T08:23:24Z

hi @ifed01 I would like to know whether the problem of this pr repair will appear in ceph 15.2.x ?

Hi @NUABO - no, there are no plans to backport this fix to Octopus as it's at end-of-life state now

NUABO · 2023-05-29T08:29:04Z

@ifed01 thank you very much for your reply, if I use ceph 15 version, it means that this bug will always exist, I can only adapt this pr by myself

ifed01 · 2023-05-29T09:06:48Z

@ifed01 thank you very much for your reply, if I use ceph 15 version, it means that this bug will always exist, I can only adapt this pr by myself

yes, that's true. Or upgrade to Pacific. And the bug is not that frequent though...

NUABO · 2023-05-29T09:11:54Z

@ifed01 thank you very much for your reply, if I use ceph 15 version, it means that this bug will always exist, I can only adapt this pr by myself

yes, that's true. Or upgrade to Pacific. And the bug is not that frequent though...

thank you 😀

ifed01 · 2023-05-29T13:37:56Z

@ifed01 thank you very much for your reply, if I use ceph 15 version, it means that this bug will always exist, I can only adapt this pr by myself

yes, that's true. Or upgrade to Pacific. And the bug is not that frequent though...

thank you 😀

@NUABO - just realized that Pacific lacks 4K bluefs support as well. You'll need Quincy release to get in onboard. Sorry for the mileading.

NUABO · 2023-05-29T13:51:25Z

@ifed01 thank you very much for your reply, if I use ceph 15 version, it means that this bug will always exist, I can only adapt this pr by myself

yes, that's true. Or upgrade to Pacific. And the bug is not that frequent though...

thank you 😀

@NUABO - just realized that Pacific lacks 4K bluefs support as well. You'll need Quincy release to get in onboard. Sorry for the mileading.

thanks for your friendly reminder (:

Badb0yBadb0y · 2024-11-27T02:55:50Z

Hi,
After updated to quincy 17.2.7 on ubuntu, it still shows 64k

ceph daemon /var/run/ceph/ceph-osd.257.asok config get "bluefs_shared_alloc_size"
{
    "bluefs_shared_alloc_size": "65536"
}

Is it intended?

ifed01 · 2024-11-27T09:02:18Z

Hi, After updated to quincy 17.2.7 on ubuntu, it still shows 64k
ceph daemon /var/run/ceph/ceph-osd.257.asok config get "bluefs_shared_alloc_size"
{
    "bluefs_shared_alloc_size": "65536"
}
Is it intended?

Yes, that's correct. By default for the sake of performance BlueFS still uses 64K alloc unit on shared device.
4K ones are used when space is very fragmented and there is not enough free contiguous 64K chunks only.

Badb0yBadb0y · 2024-11-27T09:06:53Z

So lowering from 64k to 4k is done by ceph if needed like an automated mechanism or it's the admin duty when need to fix?

ifed01 · 2024-11-27T09:12:00Z

So lowering from 64k to 4k is done by ceph if needed like an automated mechanism or it's the admin duty when need to fix?

This is done by ceph automatically

ifed01 added 6 commits November 11, 2022 01:33

os/bluestore: unify allocation functions' signature at BlueFS.

bd20741

Signed-off-by: Igor Fedotov <ifedotov@croit.io>

os/bluestore: get rid off BlueFS::_compact_log_async_dump_metadata_NF()

285df4b

We can reuse _compact_log_dump_metadata_NF() instead Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

os/bluestore: simplify and cleanup BlueFS::_compact_log_async_...()

0fc0ced

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

os/bluestore: introduce method to estimate BlueFS transaction size

05478fc

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

os/bluestore: increment Bluefs::super.version at _write_super

0bfc42a

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

os/bluestore: introduce bluefs_fnode_t::swap method

0af2858

+ minor refactoring. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

ifed01 requested a review from a team as a code owner November 11, 2022 19:42

ifed01 requested review from aclamk and neha-ojha November 11, 2022 19:42

github-actions bot added bluestore common core tests labels Nov 11, 2022

ifed01 force-pushed the wip-ifed-small-chunk-bluefs branch from 7b09b5a to 6248340 Compare November 14, 2022 11:58

aclamk approved these changes Nov 16, 2022

View reviewed changes

ifed01 added 7 commits November 16, 2022 19:28

test/test_bluefs: get rid of build warning

228c053

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

os/bluestore: new BlueFS perf counters on compaction.

d4a5561

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

os/bluestore: output cosmetics for BlueFS

e5b7ba9

This includes finer position specification during replay and logging read size in hex. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

os/bluestore: get rid off BlueFS::allocate_without_fallback.

62ae4e4

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

ifed01 force-pushed the wip-ifed-small-chunk-bluefs branch from 6248340 to e52bcc8 Compare November 16, 2022 23:48

ifed01 added the needs-qa label Nov 17, 2022

markhpc added the performance label Dec 8, 2022

ljflores added the wip-yuri2-testing label Dec 13, 2022

yuriw merged commit c0309cf into ceph:main Jan 25, 2023

ifed01 mentioned this pull request Feb 16, 2023

quincy: os/bluestore: enable 4K allocation unit for BlueFS #49884

Merged

14 tasks

ifed01 mentioned this pull request Jun 27, 2023

pacific: os/bluestore: cumulative bluefs backport #52212

Merged

14 tasks

ifed01 deleted the wip-ifed-small-chunk-bluefs branch August 16, 2023 08:33

microyahoo mentioned this pull request Nov 3, 2023

bluefs _allocate unable to allocate 0x90000 on bdev 1 rook/rook#9885

Closed

Conversation

ifed01 commented Nov 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

aclamk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aclamk Nov 16, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ifed01 commented Nov 17, 2022

Uh oh!

ifed01 commented Nov 17, 2022

Uh oh!

ifed01 commented Nov 17, 2022

Uh oh!

BenoitKnecht commented Jan 11, 2023

Uh oh!

ifed01 commented Jan 11, 2023

Uh oh!

BenoitKnecht commented Jan 11, 2023

Uh oh!

ifed01 commented Jan 12, 2023

Uh oh!

ljflores commented Jan 25, 2023

Uh oh!

NUABO commented May 29, 2023

Uh oh!

ifed01 commented May 29, 2023

Uh oh!

NUABO commented May 29, 2023

Uh oh!

ifed01 commented May 29, 2023

Uh oh!

NUABO commented May 29, 2023

Uh oh!

ifed01 commented May 29, 2023

Uh oh!

NUABO commented May 29, 2023

Uh oh!

Badb0yBadb0y commented Nov 27, 2024

Uh oh!

ifed01 commented Nov 27, 2024

Uh oh!

Badb0yBadb0y commented Nov 27, 2024 via email

Uh oh!

ifed01 commented Nov 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ifed01 commented Nov 11, 2022 •

edited

Loading