os/bluestore: Blazingly fast new BlueFS WAL disk format 🚀 #56927
os/bluestore: Blazingly fast new BlueFS WAL disk format 🚀 #56927
Conversation
src/os/bluestore/BlueFS.cc
Outdated
| vselector->get_hint_by_dir(dirname); | ||
| vselector->add_usage(file->vselector_hint, file->fnode); | ||
|
|
||
| if (boost::algorithm::ends_with(filename, ".log")) { |
There was a problem hiding this comment.
ends_with() is now part of c++ (since c++20, which we are using)
src/os/bluestore/BlueFS.cc
Outdated
| bufferlist t; | ||
| t.substr_of(buf.bl, flush_offset - buf.bl_off, sizeof(File::WALFlush::WALLength)); | ||
| dout(30) << "length dump\n"; | ||
| t.hexdump(*_dout); |
There was a problem hiding this comment.
are we OK with log lines like this one, without the preamble?
| } | ||
| } | ||
|
|
||
| int64_t BlueFS::_read_wal( |
There was a problem hiding this comment.
a style/pref comment: this is a pretty long function. Can it be broken down into logical, named, sub-functions?
There was a problem hiding this comment.
I might abstract some parts that are reused in different places but imho I prefer reading from top to bottom what this function does instead of going to defitinition in functions.
|
@pereman2 I was very excited about your PR, and did some librbd fio testing on mako over the weekend. These are relatively fast NVMe drives so I'm not sure how limited they are by fsync (which should be a no-op for the drive, but still require the syscall). I suspect we may see better numbers as other bottlenecks are eliminated. It would be very interesting however to see how this performs on consumer grade flash and HDDs. Here are the results: Single NVMe OSD 4K Random Write (1X)
30 NVMe OSD 4K Random Write (3X)
|
Oh cool! I wonder the difference between our nvmes used are. If you are up for it, I will update the code removing some obvious inefficiencies that might have an effect on CPU, and you may run it again to see if it did fix something. Nevertheless I will attach my results again after that fix |
|
@markhpc I'm curious. How many times did you run the 4k randwrite benchmark? Did you pre fill the cluster to simulated some real data or was it ran in a real cluster? iterations = 8
bssplit = "--bssplit=4k/16:8k/10:12k/9:16k/8:20k/7:24k/7:28k/6:32k/6:36k/5:40k/5:44k/4:48k/4:52k/4:56k/3:60k/3:64k/3"
for run in range(iterations):
for t in ['randrw', 'randwrite', 'randread', 'rw']:
# do fio test with "t" |
|
@markhpc I got new results! Looks like randwrites are not getting any better somehow but randrw do get better: |
Yep, these are prefilled rbd volumes. Only ran the set of tests for one iteration this time. 4k and 4m randreads and randwrites for 5 minutes. It's pretty easy to run repeated tests though if we want. |
Was the NVMe drive pre-filled? That's actually going to matter more than the rbd volumes. |
Yes! |
Naw, this was a quick test. I was curious what kind of syscall overhead reduction we might see here versus other previous tests where I can see some overhead for 4k random writes. We could certainly do tests at larger fill values though. |
72dc822 to
7e436c8
Compare
src/os/bluestore/bluefs_types.h
Outdated
| enum bluefs_node_type { | ||
| LEGACY = 0, | ||
| WAL_V2 = 1, | ||
| NODE_TYPE_END = 0x100, |
There was a problem hiding this comment.
NODE_TYPE_END is 2 currently.
We can move it up if more types appear.
src/os/bluestore/BlueFS.cc
Outdated
| while(flush_offset < file->fnode.allocated) { | ||
| // read first part of wal flush | ||
| bufferlist bl; | ||
| _read(h, flush_offset, sizeof(File::WALFlush::WALLength), &bl, nullptr); |
There was a problem hiding this comment.
This is bad.
FileReader will redirect reading to _read_wal, that is already trying to cut envelope out.
There was a problem hiding this comment.
Here I treat _read as read raw data, fnode.size includes all extra envelope data. BlueFS::read is the one that redirect to _read or _read_wal.
9cbc2a3 to
51fefc7
Compare
|
jenkins test make check |
src/os/bluestore/BlueFS.cc
Outdated
| // Ensure no dangling wal v2 files are inside transactions. | ||
| _compact_log_sync_LNF_LD(); | ||
|
|
||
| _write_super(BDEV_DB, true); |
There was a problem hiding this comment.
ahg, should've been compile time error thanks for pointing it out
src/os/bluestore/BlueFS.cc
Outdated
| // Ensure no dangling wal v2 files are inside transactions. | ||
| _compact_log_sync_LNF_LD(); | ||
|
|
||
| _write_super(BDEV_DB, 2); |
There was a problem hiding this comment.
This is downgrade to 1, version shouldn't be 2 then?
There was a problem hiding this comment.
not by brightest day
|
jenkins test make |
|
jenkins test make check |
|
jenkins test make check |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |


problem
BlueFS write are expensive due to every
BlueFS::fsyncinvokingdisk->flushtwice, one for the file data and another one for BlueFS log metadata. We can avoid this duality by joining merging metadata and data in the same envelope. This PR delievers on that front.New format
Previous a bluefs log transaction would hold a file_update_inc that includes the increase of file size, that way we now what is the length of data the file hold in its own extents. Therefore, every write would perform a
fnode->size += deltaand consequently mark it as dirty.This new format is basically a envelope that holds both data and delta metadata plus some error detection stuff:
Flush length (u64)-> the length of the data in the envelopePayload (flush length)-> data of the WAL write asked for (size is flush length)Marker (u64)-> id of the file used for error detection (this in talks to change to crc or something else)With this new format, for every fsync we do, create this envelope and flush it without marking the file as dirty, therefore not generating the log disk flush. This inferred huge benefits in performance that will be looking at next.
EOF tricks
A "huge" problem is: how do we know we cannot read more data from the file. Either we reach end of allocated extents or... in this case we append some 0s to the evenlope so that next flush_length is overwritten and therefore we can check if next flush is not yet flushed to disk. This basically works like a null terminated string.
Preliminary results:
I ran multiple fio jobs including different workloads: randrw, random writes, random reads, etc...
By counting number of flushes with a simple counter vs a vector of flush extents we saw a significat performance degradation that might be worth to use the vector only in replay and forget about storing flush extents during the run:

Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e