bluefs: bluefs alloc unit should only be shrink by liangmingyuanneo · Pull Request #57015 · ceph/ceph

liangmingyuanneo · 2024-04-21T12:37:06Z

The alloc unit has already forbidden changed for bluestore, what's more, it should forbidden increased in bluefs. Otherwise, it can leads to coredump or bad data. Let's explain it use Bitmap Allocater, it has two presentations:
a) in BitmapAllocator::init_rm_free(offset, length), (offset + length) should bigger than offs. But when get_min_alloc_size() changed bigger, this can not be guaranteed.
b) if init_rm_free() are
successfully in luck, then in rocksdb compact, when release() be called, it release a small extent but may leads to larger extents be released to Bitmap. As a result, rocksdb data is corrupted, and the osd can not be booted again.

https://tracker.ceph.com/issues/65600

Signed-off-by: Mingyuan Liang liangmingyuan@baidu.com

src/os/bluestore/BlueFS.cc

src/os/bluestore/bluefs_types.cc

yuriw · 2024-05-03T21:34:06Z

This PR is under test in https://tracker.ceph.com/issues/65797.

Matan-B

2024-05-16T02:27:59.753 INFO:tasks.workunit.client.0.smithi104.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-bluefs-volume-ops.sh:162: TEST_bluestore:  kill 121010
2024-05-16T02:27:59.753 INFO:tasks.workunit.client.0.smithi104.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-bluefs-volume-ops.sh:162: TEST_bluestore:  sleep 1
2024-05-16T02:28:00.754 INFO:tasks.workunit.client.0.smithi104.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-bluefs-volume-ops.sh:162: TEST_bluestore:  kill 121010
2024-05-16T02:28:00.755 INFO:tasks.workunit.client.0.smithi104.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-bluefs-volume-ops.sh: line 162: kill: (121010) - No such process
2024-05-16T02:28:00.755 INFO:tasks.workunit.client.0.smithi104.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-bluefs-volume-ops.sh:163: TEST_bluestore:  ceph osd down 3
2024-05-16T02:28:01.275 INFO:tasks.workunit.client.0.smithi104.stderr:osd.3 is already down.
2024-05-16T02:28:01.291 INFO:tasks.workunit.client.0.smithi104.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-bluefs-volume-ops.sh:166: TEST_bluestore:  ceph-bluestore-tool --path td/osd-bluefs-volume-ops/0 fsck
2024-05-16T02:28:02.596 INFO:tasks.workunit.client.0.smithi104.stdout:fsck success
2024-05-16T02:28:02.617 INFO:tasks.workunit.client.0.smithi104.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-bluefs-volume-ops.sh:168: TEST_bluestore:  dd if=/dev/zero of=td/osd-bluefs-volume-ops/0/wal count=512 bs=1M
2024-05-16T02:28:02.922 INFO:tasks.workunit.client.0.smithi104.stderr:512+0 records in
2024-05-16T02:28:02.922 INFO:tasks.workunit.client.0.smithi104.stderr:512+0 records out
2024-05-16T02:28:02.922 INFO:tasks.workunit.client.0.smithi104.stderr:536870912 bytes (537 MB, 512 MiB) copied, 0.302297 s, 1.8 GB/s
2024-05-16T02:28:02.922 INFO:tasks.workunit.client.0.smithi104.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-bluefs-volume-ops.sh:169: TEST_bluestore:  ceph-bluestore-tool --path td/osd-bluefs-volume-ops/0 --dev-target td/osd-bluefs-volume-ops/0/wal --command bluefs-bdev-new-wal
2024-05-16T02:28:02.945 INFO:tasks.workunit.client.0.smithi104.stdout:inferring bluefs devices from bluestore path
2024-05-16T02:28:07.054 INFO:tasks.workunit.client.0.smithi104.stderr:*** Caught signal (Segmentation fault) **
2024-05-16T02:28:07.054 INFO:tasks.workunit.client.0.smithi104.stderr: in thread 7f08e70feac0 thread_name:ceph-bluestore-
2024-05-16T02:28:07.056 INFO:tasks.workunit.client.0.smithi104.stderr: ceph version 19.0.0-3728-g6cd0f801 (6cd0f8013e2dea00c3a29a0d8b10656b132d7c80) squid (dev)
2024-05-16T02:28:07.056 INFO:tasks.workunit.client.0.smithi104.stderr: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f08e8303520]
2024-05-16T02:28:07.056 INFO:tasks.workunit.client.0.smithi104.stderr: 2: (BlueFS::_write_super(int)+0xd5) [0x55b54ce483a5]
2024-05-16T02:28:07.056 INFO:tasks.workunit.client.0.smithi104.stderr: 3: (BlueFS::_init_alloc()+0x3aa) [0x55b54ce49d3a]
2024-05-16T02:28:07.057 INFO:tasks.workunit.client.0.smithi104.stderr: 4: (BlueFS::mount()+0xd7) [0x55b54ce4c477]
2024-05-16T02:28:07.057 INFO:tasks.workunit.client.0.smithi104.stderr: 5: (BlueStore::add_new_bluefs_device(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x45b) [0x55b54cec6c6b]
2024-05-16T02:28:07.057 INFO:tasks.workunit.client.0.smithi104.stderr: 6: main()
2024-05-16T02:28:07.057 INFO:tasks.workunit.client.0.smithi104.stderr: 7: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f08e82ead90]
2024-05-16T02:28:07.057 INFO:tasks.workunit.client.0.smithi104.stderr: 8: __libc_start_main()
2024-05-16T02:28:07.057 INFO:tasks.workunit.client.0.smithi104.stderr: 9: _start()
2024-05-16T02:28:07.057 INFO:tasks.workunit.client.0.smithi104.stderr:2024-05-16T02:28:07.075+0000 7f08e70feac0 -1 *** Caught signal (Segmentation fault) **
2024-05-16T02:28:07.057 INFO:tasks.workunit.client.0.smithi104.stderr: in thread 7f08e70feac0 thread_name:ceph-bluestore-
2024-05-16T02:28:07.057 INFO:tasks.workunit.client.0.smithi104.stderr:

https://pulpito.ceph.com/yuriw-2024-05-15_21:09:29-rados-wip-yuri5-testing-2024-05-15-0804-distro-default-smithi/7707982

@liangmingyuanneo, Looks related

liangmingyuanneo · 2024-05-24T02:15:16Z

@Matan-B Thanks, I will check it. Could I get the log of osd?

The alloc unit has already forbidden changed for bluestore, what's more, it should forbidden increased in bluefs. Otherwise, it can leads to coredump or bad data. Let's explain it use Bitmap Allocater, it has two presentations: a) in BitmapAllocator::init_rm_free(offset, length), (offset + length) should bigger than offs. But when get_min_alloc_size() changed bigger, this can not be guaranteed. b) if init_rm_free() are successfully in luck, then in rocksdb compact, when release() be called, it release a small extent but may leads to larger extents be released to Bitmap. As a result, rocksdb data is corrupted, and the osd can not be booted again. Signed-off-by: Mingyuan Liang <liangmingyuan@baidu.com>

liangmingyuanneo · 2024-05-24T06:45:14Z

@ifed01 When new wal is added, there may be no new db device for the osd. Please review again at your convenience, thank you.

ifed01

LGTM

ifed01 · 2024-05-24T07:49:10Z

src/os/bluestore/BlueFS.cc

+    _write_super(BDEV_DB);
+  }
+
+  alloc_size_changed = false;


IMO better move under if statement above.

Sorry for lately replay. Yep, I also think about move if statement. But because NEWWAL statement may not be executed, the original the _write_super() still be needed before NEWWAL statement. So I make the _write_super() after if statement. Anyway, this leads to the codes hard to read. If you think move in if statement is better, I will do it.

ifed01 · 2024-05-24T22:10:11Z

jenkins test make check

ifed01 · 2024-05-24T22:10:21Z

jenkins test api

ifed01 · 2024-05-24T22:10:33Z

jenkins test windows

addressed

aclamk · 2024-06-04T12:23:42Z

jenkins test windows

Additionall this locks tail of DB/WAL volumes which is unaligned to configured (not minimal!!) BlueFS allocation unit. Effectively replaces changes from ceph#57015 Fixes: https://tracker.ceph.com/issues/68772 Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

Additionall this locks tail of DB/WAL volumes which is unaligned to configured (not minimal!!) BlueFS allocation unit. Effectively replaces changes from ceph#57015 Fixes: https://tracker.ceph.com/issues/68772 Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit effaa68)

liangmingyuanneo requested a review from a team as a code owner April 21, 2024 12:37

github-actions bot added bluestore core labels Apr 21, 2024

ifed01 reviewed Apr 24, 2024

View reviewed changes

liangmingyuanneo force-pushed the wip-bluefs-max-alloc-size branch 2 times, most recently from c75118d to f1372de Compare April 25, 2024 13:46

ifed01 approved these changes Apr 25, 2024

View reviewed changes

ifed01 added the needs-qa label Apr 25, 2024

ljflores added the wip-yuri5-testing label May 3, 2024

Matan-B previously requested changes May 23, 2024

View reviewed changes

liangmingyuanneo force-pushed the wip-bluefs-max-alloc-size branch from f1372de to eed0312 Compare May 24, 2024 06:40

ifed01 approved these changes May 24, 2024

View reviewed changes

Matan-B self-requested a review May 27, 2024 08:20

Matan-B removed their request for review May 27, 2024 08:21

yuriw merged commit fa83b90 into ceph:main Jun 5, 2024

aclamk mentioned this pull request Mar 1, 2025

os/bluestore: Modify logic of bluefs persistent alloc unit sizes #62069

Closed

14 tasks

Conversation

liangmingyuanneo commented Apr 21, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuriw commented May 3, 2024

Uh oh!

Matan-B left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liangmingyuanneo commented May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liangmingyuanneo commented May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ifed01 left a comment

Choose a reason for hiding this comment

Uh oh!

ifed01 May 24, 2024

Choose a reason for hiding this comment

Uh oh!

liangmingyuanneo May 25, 2024

Choose a reason for hiding this comment

Uh oh!

ifed01 commented May 24, 2024

Uh oh!

ifed01 commented May 24, 2024

Uh oh!

ifed01 commented May 24, 2024

Uh oh!

aclamk commented Jun 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Matan-B left a comment •

edited

Loading

liangmingyuanneo commented May 24, 2024 •

edited

Loading

liangmingyuanneo commented May 24, 2024 •

edited

Loading