os/bluestore: Blazingly fast new BlueFS WAL disk format :rocket: by pereman2 · Pull Request #56927 · ceph/ceph

pereman2 · 2024-04-16T16:30:12Z

problem

BlueFS write are expensive due to every BlueFS::fsync invoking disk->flush twice, one for the file data and another one for BlueFS log metadata. We can avoid this duality by joining merging metadata and data in the same envelope. This PR delievers on that front.

New format

Previous a bluefs log transaction would hold a file_update_inc that includes the increase of file size, that way we now what is the length of data the file hold in its own extents. Therefore, every write would perform a fnode->size += delta and consequently mark it as dirty.

This new format is basically a envelope that holds both data and delta metadata plus some error detection stuff:

Flush length (u64) -> the length of the data in the envelope
Payload (flush length) -> data of the WAL write asked for (size is flush length)
Marker (u64) -> id of the file used for error detection (this in talks to change to crc or something else)

With this new format, for every fsync we do, create this envelope and flush it without marking the file as dirty, therefore not generating the log disk flush. This inferred huge benefits in performance that will be looking at next.

EOF tricks

A "huge" problem is: how do we know we cannot read more data from the file. Either we reach end of allocated extents or... in this case we append some 0s to the evenlope so that next flush_length is overwritten and therefore we can check if next flush is not yet flushed to disk. This basically works like a null terminated string.

Preliminary results:

I ran multiple fio jobs including different workloads: randrw, random writes, random reads, etc...

By counting number of flushes with a simple counter vs a vector of flush extents we saw a significat performance degradation that might be worth to use the vector only in replay and forget about storing flush extents during the run:

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

ronen-fr · 2024-04-18T04:31:09Z

src/os/bluestore/BlueFS.cc

              vselector->get_hint_by_dir(dirname);
            vselector->add_usage(file->vselector_hint, file->fnode);

+            if (boost::algorithm::ends_with(filename, ".log")) {


ends_with() is now part of c++ (since c++20, which we are using)

src/os/bluestore/BlueFS.cc

ronen-fr · 2024-04-18T06:37:46Z

src/os/bluestore/BlueFS.cc

+      bufferlist t;
+      t.substr_of(buf.bl, flush_offset - buf.bl_off, sizeof(File::WALFlush::WALLength));
+      dout(30) << "length dump\n";
+      t.hexdump(*_dout);


are we OK with log lines like this one, without the preamble?

ronen-fr · 2024-04-18T06:42:34Z

src/os/bluestore/BlueFS.cc

+  }
+}
+
+int64_t BlueFS::_read_wal(


a style/pref comment: this is a pretty long function. Can it be broken down into logical, named, sub-functions?

I might abstract some parts that are reused in different places but imho I prefer reading from top to bottom what this function does instead of going to defitinition in functions.

markhpc · 2024-04-22T03:15:22Z

@pereman2 I was very excited about your PR, and did some librbd fio testing on mako over the weekend. These are relatively fast NVMe drives so I'm not sure how limited they are by fsync (which should be a no-op for the drive, but still require the syscall). I suspect we may see better numbers as other bottlenecks are eliminated. It would be very interesting however to see how this performs on consumer grade flash and HDDs.

Here are the results:

Single NVMe OSD 4K Random Write (1X)

Ceph Version	IOPS	Avg Latency (ms)	99.9% Latency (ms)	ceph-osd CPU Usage %
v18.2.2	94030	16.49	301.73	1508
main	94445	16.58	333.71	1538
main + #56927	97048	16.02	321.56	1561

30 NVMe OSD 4K Random Write (3X)

Ceph Version	IOPS	Avg Latency (ms)	99.9% Latency (ms)	ceph-osd CPU Usage % (mako06)
v18.2.2	476575	13.46	385.54	1051
main	465904	13.76	322.02	1068
main + #56927	469424	13.88	410.33	1098

pereman2 · 2024-04-22T09:56:55Z

Here are the results:

Oh cool! I wonder the difference between our nvmes used are.

If you are up for it, I will update the code removing some obvious inefficiencies that might have an effect on CPU, and you may run it again to see if it did fix something. Nevertheless I will attach my results again after that fix

pereman2 · 2024-04-22T11:52:56Z

@markhpc I'm curious. How many times did you run the 4k randwrite benchmark? Did you pre fill the cluster to simulated some real data or was it ran in a real cluster?
My benchmark basically does:

iterations = 8
bssplit = "--bssplit=4k/16:8k/10:12k/9:16k/8:20k/7:24k/7:28k/6:32k/6:36k/5:40k/5:44k/4:48k/4:52k/4:56k/3:60k/3:64k/3"
for run in range(iterations):
  for t in ['randrw', 'randwrite', 'randread', 'rw']:
     # do fio test with "t"

pereman2 · 2024-04-22T13:48:32Z

@markhpc I got new results! Looks like randwrites are not getting any better somehow but randrw do get better:

markhpc · 2024-04-22T14:00:47Z

@markhpc I'm curious. How many times did you run the 4k randwrite benchmark? Did you pre fill the cluster to simulated some real data or was it ran in a real cluster? My benchmark basically does:
iterations = 8
bssplit = "--bssplit=4k/16:8k/10:12k/9:16k/8:20k/7:24k/7:28k/6:32k/6:36k/5:40k/5:44k/4:48k/4:52k/4:56k/3:60k/3:64k/3"
for run in range(iterations):
  for t in ['randrw', 'randwrite', 'randread', 'rw']:
     # do fio test with "t"

Yep, these are prefilled rbd volumes. Only ran the set of tests for one iteration this time. 4k and 4m randreads and randwrites for 5 minutes. It's pretty easy to run repeated tests though if we want.

mheler · 2024-04-23T14:28:29Z

@markhpc I'm curious. How many times did you run the 4k randwrite benchmark? Did you pre fill the cluster to simulated some real data or was it ran in a real cluster? My benchmark basically does:
iterations = 8
bssplit = "--bssplit=4k/16:8k/10:12k/9:16k/8:20k/7:24k/7:28k/6:32k/6:36k/5:40k/5:44k/4:48k/4:52k/4:56k/3:60k/3:64k/3"
for run in range(iterations):
  for t in ['randrw', 'randwrite', 'randread', 'rw']:
     # do fio test with "t"
Yep, these are prefilled rbd volumes. Only ran the set of tests for one iteration this time. 4k and 4m randreads and randwrites for 5 minutes. It's pretty easy to run repeated tests though if we want.

Was the NVMe drive pre-filled? That's actually going to matter more than the rbd volumes.

pereman2 · 2024-04-23T14:30:55Z

@markhpc I'm curious. How many times did you run the 4k randwrite benchmark? Did you pre fill the cluster to simulated some real data or was it ran in a real cluster? My benchmark basically does:
iterations = 8
bssplit = "--bssplit=4k/16:8k/10:12k/9:16k/8:20k/7:24k/7:28k/6:32k/6:36k/5:40k/5:44k/4:48k/4:52k/4:56k/3:60k/3:64k/3"
for run in range(iterations):
  for t in ['randrw', 'randwrite', 'randread', 'rw']:
     # do fio test with "t"
Yep, these are prefilled rbd volumes. Only ran the set of tests for one iteration this time. 4k and 4m randreads and randwrites for 5 minutes. It's pretty easy to run repeated tests though if we want.
Was the NVMe drive pre-filled? That's actually going to matter more than the rbd volumes.

Yes!

markhpc · 2024-05-01T16:23:45Z

@markhpc I'm curious. How many times did you run the 4k randwrite benchmark? Did you pre fill the cluster to simulated some real data or was it ran in a real cluster? My benchmark basically does:
iterations = 8
bssplit = "--bssplit=4k/16:8k/10:12k/9:16k/8:20k/7:24k/7:28k/6:32k/6:36k/5:40k/5:44k/4:48k/4:52k/4:56k/3:60k/3:64k/3"
for run in range(iterations):
  for t in ['randrw', 'randwrite', 'randread', 'rw']:
     # do fio test with "t"
Yep, these are prefilled rbd volumes. Only ran the set of tests for one iteration this time. 4k and 4m randreads and randwrites for 5 minutes. It's pretty easy to run repeated tests though if we want.
Was the NVMe drive pre-filled? That's actually going to matter more than the rbd volumes.

Naw, this was a quick test. I was curious what kind of syscall overhead reduction we might see here versus other previous tests where I can see some overhead for 4k random writes. We could certainly do tests at larger fill values though.

aclamk · 2024-05-10T10:13:51Z

src/os/bluestore/bluefs_types.h

+enum bluefs_node_type {
+  LEGACY = 0,
+  WAL_V2 = 1,
+  NODE_TYPE_END = 0x100,


NODE_TYPE_END is 2 currently.
We can move it up if more types appear.

src/os/bluestore/bluefs_types.h

src/os/bluestore/BlueFS.cc

aclamk · 2024-05-10T10:30:19Z

src/os/bluestore/BlueFS.cc

+  while(flush_offset < file->fnode.allocated) {
+    // read first part of wal flush
+    bufferlist bl;
+    _read(h, flush_offset, sizeof(File::WALFlush::WALLength), &bl, nullptr);


This is bad.
FileReader will redirect reading to _read_wal, that is already trying to cut envelope out.

Here I treat _read as read raw data, fnode.size includes all extra envelope data. BlueFS::read is the one that redirect to _read or _read_wal.

src/os/bluestore/BlueFS.h

src/os/bluestore/BlueFS.cc

ifed01 · 2024-07-31T13:01:19Z

jenkins test make check

ifed01 · 2024-07-31T13:04:11Z

src/os/bluestore/BlueFS.cc

+  // Ensure no dangling wal v2 files are inside transactions.
+  _compact_log_sync_LNF_LD();
+
+  _write_super(BDEV_DB, true);


ahg, should've been compile time error thanks for pointing it out

ifed01 · 2024-07-31T15:14:15Z

src/os/bluestore/BlueFS.cc

+  // Ensure no dangling wal v2 files are inside transactions.
+  _compact_log_sync_LNF_LD();
+
+  _write_super(BDEV_DB, 2);


This is downgrade to 1, version shouldn't be 2 then?

not by brightest day

ifed01 · 2024-07-31T18:28:04Z

jenkins test make

ifed01 · 2024-07-31T18:28:33Z

jenkins test make check

aclamk · 2024-08-06T13:09:48Z

jenkins test make check

github-actions · 2024-08-07T10:29:20Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

github-actions · 2024-09-25T05:35:31Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

github-actions · 2024-11-24T07:01:34Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2024-12-24T10:01:35Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

github-actions bot added bluestore core tests labels Apr 16, 2024

pereman2 requested review from aclamk and ifed01 April 16, 2024 16:34

pereman2 changed the title ~~os/bluestore: new BlueFS WAL disk format~~ os/bluestore: Blazingly fast new BlueFS WAL disk format 🚀 Apr 17, 2024

ronen-fr reviewed Apr 18, 2024

View reviewed changes

src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved

ronen-fr reviewed Apr 18, 2024

View reviewed changes

src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved

ronen-fr reviewed Apr 18, 2024

View reviewed changes

src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved

ronen-fr reviewed Apr 18, 2024

View reviewed changes

markhpc added the performance label Apr 18, 2024

pereman2 marked this pull request as ready for review April 29, 2024 14:03

pereman2 requested a review from a team as a code owner April 29, 2024 14:03

pereman2 force-pushed the wal-fsync branch 2 times, most recently from 72dc822 to 7e436c8 Compare May 9, 2024 15:57

aclamk reviewed May 10, 2024

View reviewed changes

src/os/bluestore/bluefs_types.h Outdated Show resolved Hide resolved

aclamk reviewed May 10, 2024

View reviewed changes

src/os/bluestore/BlueFS.cc Show resolved Hide resolved

aclamk reviewed May 10, 2024

View reviewed changes

src/os/bluestore/BlueFS.h Show resolved Hide resolved

ifed01 reviewed Jul 29, 2024

View reviewed changes

src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved

pereman2 force-pushed the wal-fsync branch 2 times, most recently from 9cbc2a3 to 51fefc7 Compare July 31, 2024 10:24

ifed01 reviewed Jul 31, 2024

View reviewed changes

pereman2 force-pushed the wal-fsync branch from 51fefc7 to 47ede1f Compare July 31, 2024 13:41

ifed01 reviewed Jul 31, 2024

View reviewed changes

pereman2 force-pushed the wal-fsync branch from 47ede1f to 50c7adc Compare July 31, 2024 15:42

aclamk added the aclamk-testing-nauvoo bluestore testing label Jul 31, 2024

ifed01 approved these changes Aug 1, 2024

View reviewed changes

pereman2 force-pushed the wal-fsync branch from 50c7adc to 765816f Compare August 1, 2024 11:44

github-actions bot added the needs-rebase label Aug 7, 2024

pereman2 added 4 commits August 7, 2024 12:33

os/bluestore: bluefs wal v2

38519ff

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

os/bluestore: test bluefs wal v2

4961dca

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

os/bluestore: bluestore tool downgrade-wal-to-v1 command

041edad

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

os/bluestore: document downgrade-wal-to-v1

a65118d

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

pereman2 force-pushed the wal-fsync branch from 765816f to a65118d Compare August 7, 2024 10:33

github-actions bot removed the needs-rebase label Aug 7, 2024

aclamk removed the aclamk-testing-nauvoo bluestore testing label Sep 18, 2024

github-actions bot added the needs-rebase label Sep 25, 2024

github-actions bot added the stale label Nov 24, 2024

github-actions bot closed this Dec 24, 2024

aclamk mentioned this pull request Mar 11, 2025

os/bluestore: Fast WAL for RocksDB #62224

Merged

14 tasks

Conversation

pereman2 commented Apr 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

problem

New format

EOF tricks

Preliminary results:

Contribution Guidelines

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markhpc commented Apr 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pereman2 commented Apr 22, 2024

Uh oh!

pereman2 commented Apr 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pereman2 commented Apr 22, 2024

Uh oh!

markhpc commented Apr 22, 2024

Uh oh!

mheler commented Apr 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pereman2 commented Apr 23, 2024

Uh oh!

markhpc commented May 1, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ifed01 commented Jul 31, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ifed01 commented Jul 31, 2024

Uh oh!

ifed01 commented Jul 31, 2024

Uh oh!

aclamk commented Aug 6, 2024

Uh oh!

github-actions bot commented Aug 7, 2024

Uh oh!

github-actions bot commented Sep 25, 2024

Uh oh!

github-actions bot commented Nov 24, 2024

Uh oh!

github-actions bot commented Dec 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

pereman2 commented Apr 16, 2024 •

edited

Loading

markhpc commented Apr 22, 2024 •

edited

Loading

pereman2 commented Apr 22, 2024 •

edited

Loading

mheler commented Apr 23, 2024 •

edited

Loading