Skip to content

os/bluestore: enable 4K allocation unit for BlueFS#48854

Merged
yuriw merged 13 commits intoceph:mainfrom
ifed01:wip-ifed-small-chunk-bluefs
Jan 25, 2023
Merged

os/bluestore: enable 4K allocation unit for BlueFS#48854
yuriw merged 13 commits intoceph:mainfrom
ifed01:wip-ifed-small-chunk-bluefs

Conversation

@ifed01
Copy link
Copy Markdown
Contributor

@ifed01 ifed01 commented Nov 11, 2022

Fixes: https://tracker.ceph.com/issues/53466
Signed-off-by: Igor Fedotov igor.fedotov@croit.io

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

Signed-off-by: Igor Fedotov <ifedotov@croit.io>
We can reuse _compact_log_dump_metadata_NF() instead

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
+ minor refactoring.

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Copy link
Copy Markdown
Contributor

@aclamk aclamk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splendid work!
Some comments on possible improvements, but logic makes sense.

* that jumps the log write position to the new extent. At this point, the
* old extent(s) won't be written to, and reflect everything to compact.
* New events will be written to the new region that we'll keep.
* ASYNC LOG COMPACTION
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could cut backward-size dependancy if we split op_update_inc(ino1,) between STARTER and META_DUMP blocks.

  SUPERBLOCK:
  - contains ino.1 extents of STARTER

  STARTER (seq 1):
  - op_init
  - op_update_inc(ino.1, extents of META_DUMP)
  - op_jump(2, sizeof(STARTER))
  - unused space

  META_DUMP (seq 2):
  - ... dump_metadata
  - op_update_inc(ino.1, extents of LOG_CONTINUATION)
  - op_jump(LOG_CONT.seq, sizeof(STARTER) + sizeof(META_DUMP)
  - unused space

  LOG_CONT (seq cont):
  - the continuation of previous log

The rationale is to have initial log fnode after compaction small
enough to fit into 4K superblock. Without that compacted metadata might
require fnode longer than 4K which goes beyond existing 4K
superblock. BlueFS assert in this case for now.
Hence the resulting log allocation disposition is like:
- superblock(4K) keeps initial log fnode which refers:
  op_init, op_update_inc(log), op_jump(next seq)
- updated log fnode built from superblock + above op_update_inc refers:
  compacted meta (a bunch of op_update and others)
- *
- more op_update_inc(log) to follow if log is extended
- *

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
This includes finer position specification during replay
and logging read size in hex.

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
This effectively enables having 4K allocation units for BlueFS.
But it doesn't turn it on by default for the sake of performance.
Using main device which lacks enough free large continuous extents
might do the trick though.

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
When using bluefs_shared_alloc_size one might get a long-lasting state when
that large chunks are not available any more and fallback to shared
device min alloc size occurs. The introduced cooldown is intended to
prevent repetitive allocation attempts with bluefs_shared_alloc_size for
a while. The rationale is to eliminate performance penalty these failing
attempts might cause.

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
@ifed01 ifed01 force-pushed the wip-ifed-small-chunk-bluefs branch from 6248340 to e52bcc8 Compare November 16, 2022 23:48
@ifed01
Copy link
Copy Markdown
Contributor Author

ifed01 commented Nov 17, 2022

jonkins test make check

@ifed01
Copy link
Copy Markdown
Contributor Author

ifed01 commented Nov 17, 2022

jenkins test make check

1 similar comment
@ifed01
Copy link
Copy Markdown
Contributor Author

ifed01 commented Nov 17, 2022

jenkins test make check

@BenoitKnecht
Copy link
Copy Markdown
Contributor

We've been hitting this issue on Pacific while increasing pg_num in one of our pools, causing several OSDs to crash, some with as little as 70% usage.

@ifed01 Are you planning on backporting this PR to Pacific? Any chance it would make it into 16.2.11?

@ifed01
Copy link
Copy Markdown
Contributor Author

ifed01 commented Jan 11, 2023

We've been hitting this issue on Pacific while increasing pg_num in one of our pools, causing several OSDs to crash, some with as little as 70% usage.

@ifed01 Are you planning on backporting this PR to Pacific? Any chance it would make it into 16.2.11?
yes, backports are planned. But I doubt this would get into 16.2.11 which to be released soon

@BenoitKnecht
Copy link
Copy Markdown
Contributor

Alright, thanks for the info!

I tried to cherry-pick your commits on top of pacific, but I'm really not sure I managed to resolve conflicts correctly, so if you're already working on a backport on a different branch, I'd be very much interested in checking it out. 🙂

I'm trying to compile a version of ceph-osd with this fix in case we get crashes again, because otherwise the only workaround would be to copy the DB to a separate device, which is rather cumbersome.

I'm assuming that with this fix, we would manage to bring the OSD back up, it would allocate what it needs on BlueFS using the 4KiB blocks, and then even if we downgraded to the mainline Pacific version, it would still be able to start as long as it doesn't need to allocate more space on BlueFS. Is that correct, or it wouldn't be able to cope with the smaller blocks?

@ifed01
Copy link
Copy Markdown
Contributor Author

ifed01 commented Jan 12, 2023

Alright, thanks for the info!

I tried to cherry-pick your commits on top of pacific, but I'm really not sure I managed to resolve conflicts correctly, so if you're already working on a backport on a different branch, I'd be very much interested in checking it out. 🙂

I tried to do a backport yesterday but it looks like this needs some additional non-trivial PRs to be backported first for Pacific. Hence postponed that till this one is merged into main branch.

I'm trying to compile a version of ceph-osd with this fix in case we get crashes again, because otherwise the only workaround would be to copy the DB to a separate device, which is rather cumbersome.

I'm assuming that with this fix, we would manage to bring the OSD back up, it would allocate what it needs on BlueFS using the 4KiB blocks, and then even if we downgraded to the mainline Pacific version, it would still be able to start as long as it doesn't need to allocate more space on BlueFS. Is that correct, or it wouldn't be able to cope with the smaller blocks?

I wouldn't recommend to downgrade to Ceph versions which don't support 4K BlueFS once you had brought that support in.
Instead of all the tricks with unofficial backports you might want to set bluefs_shared_alloc_size to 32K to recover broken OSDs. We used such a workaround a few times and it looks safe enough (which isn't the case when setting it to 4K!). But generally this wouldn't fix the issue permanently and you can face the issue again after a while if space fragmenting goes on...

@ljflores
Copy link
Copy Markdown
Member

Rados suite review: https://pulpito.ceph.com/?branch=wip-yuri2-testing-2023-01-23-0928

Failures, unrelated:
1. https://tracker.ceph.com/issues/58585 -- new tracker
2. https://tracker.ceph.com/issues/58256 -- fix merged to latest main
3. https://tracker.ceph.com/issues/58475
4. https://tracker.ceph.com/issues/57754 -- closed
5. https://tracker.ceph.com/issues/57546 -- fix is in testing

Details:
1. rook: failed to pull kubelet image - Ceph - Orchestrator
2. ObjectStore/StoreTestSpecificAUSize.SpilloverTest/2: Expected: (logger->get(l_bluefs_slow_used_bytes)) >= (16 * 1024 * 1024), actual: 0 vs 16777216 - Ceph - RADOS
3. test_dashboard_e2e.sh: Conflicting peer dependency: postcss@8.4.21 - Ceph - Mgr - Dashboard
4. test_envlibrados_for_rocksdb.sh: update-alternatives: error: alternative path /usr/bin/gcc-11 doesn't exist - Ceph - RADOS
5. rados/thrash-erasure-code: wait_for_recovery timeout due to "active+clean+remapped+laggy" pgs - Ceph - RADOS

@yuriw yuriw merged commit c0309cf into ceph:main Jan 25, 2023
@NUABO
Copy link
Copy Markdown
Contributor

NUABO commented May 29, 2023

hi @ifed01 I would like to know whether the problem of this pr repair will appear in ceph 15.2.x ?

@ifed01
Copy link
Copy Markdown
Contributor Author

ifed01 commented May 29, 2023

hi @ifed01 I would like to know whether the problem of this pr repair will appear in ceph 15.2.x ?

Hi @NUABO - no, there are no plans to backport this fix to Octopus as it's at end-of-life state now

@NUABO
Copy link
Copy Markdown
Contributor

NUABO commented May 29, 2023

@ifed01 thank you very much for your reply, if I use ceph 15 version, it means that this bug will always exist, I can only adapt this pr by myself

@ifed01
Copy link
Copy Markdown
Contributor Author

ifed01 commented May 29, 2023

@ifed01 thank you very much for your reply, if I use ceph 15 version, it means that this bug will always exist, I can only adapt this pr by myself

yes, that's true. Or upgrade to Pacific. And the bug is not that frequent though...

@NUABO
Copy link
Copy Markdown
Contributor

NUABO commented May 29, 2023

@ifed01 thank you very much for your reply, if I use ceph 15 version, it means that this bug will always exist, I can only adapt this pr by myself

yes, that's true. Or upgrade to Pacific. And the bug is not that frequent though...

thank you 😀

@ifed01
Copy link
Copy Markdown
Contributor Author

ifed01 commented May 29, 2023

@ifed01 thank you very much for your reply, if I use ceph 15 version, it means that this bug will always exist, I can only adapt this pr by myself

yes, that's true. Or upgrade to Pacific. And the bug is not that frequent though...

thank you 😀

@NUABO - just realized that Pacific lacks 4K bluefs support as well. You'll need Quincy release to get in onboard. Sorry for the mileading.

@NUABO
Copy link
Copy Markdown
Contributor

NUABO commented May 29, 2023

@ifed01 thank you very much for your reply, if I use ceph 15 version, it means that this bug will always exist, I can only adapt this pr by myself

yes, that's true. Or upgrade to Pacific. And the bug is not that frequent though...

thank you 😀

@NUABO - just realized that Pacific lacks 4K bluefs support as well. You'll need Quincy release to get in onboard. Sorry for the mileading.

thanks for your friendly reminder (:

@Badb0yBadb0y
Copy link
Copy Markdown

Hi,
After updated to quincy 17.2.7 on ubuntu, it still shows 64k

ceph daemon /var/run/ceph/ceph-osd.257.asok config get "bluefs_shared_alloc_size"
{
    "bluefs_shared_alloc_size": "65536"
}

Is it intended?

@ifed01
Copy link
Copy Markdown
Contributor Author

ifed01 commented Nov 27, 2024

Hi, After updated to quincy 17.2.7 on ubuntu, it still shows 64k

ceph daemon /var/run/ceph/ceph-osd.257.asok config get "bluefs_shared_alloc_size"
{
    "bluefs_shared_alloc_size": "65536"
}

Is it intended?

Yes, that's correct. By default for the sake of performance BlueFS still uses 64K alloc unit on shared device.
4K ones are used when space is very fragmented and there is not enough free contiguous 64K chunks only.

@Badb0yBadb0y
Copy link
Copy Markdown

Badb0yBadb0y commented Nov 27, 2024 via email

@ifed01
Copy link
Copy Markdown
Contributor Author

ifed01 commented Nov 27, 2024

So lowering from 64k to 4k is done by ceph if needed like an automated mechanism or it's the admin duty when need to fix?

This is done by ceph automatically

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants