Skip to content

os/bluestore: Fast WAL for RocksDB#62224

Merged
aclamk merged 18 commits intoceph:mainfrom
aclamk:wip-aclamk-pere-wal-fsync
Apr 9, 2025
Merged

os/bluestore: Fast WAL for RocksDB#62224
aclamk merged 18 commits intoceph:mainfrom
aclamk:wip-aclamk-pere-wal-fsync

Conversation

@aclamk
Copy link
Contributor

@aclamk aclamk commented Mar 11, 2025

The work here is based on #56927.
It implements the design principles showed there, but with a bit different implementation details.

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@aclamk aclamk force-pushed the wip-aclamk-pere-wal-fsync branch from 31e4731 to c114aea Compare March 11, 2025 17:59
@aclamk aclamk added aclamk-testing-phoebe bluestore testing and removed aclamk-testing-phoebe bluestore testing labels Mar 11, 2025
@aclamk aclamk force-pushed the wip-aclamk-pere-wal-fsync branch from c114aea to ab34d64 Compare March 11, 2025 18:42
@aclamk
Copy link
Contributor Author

aclamk commented Mar 11, 2025

jenkins test make check

1 similar comment
@aclamk
Copy link
Contributor Author

aclamk commented Mar 12, 2025

jenkins test make check

@aclamk aclamk added the aclamk-testing-phoebe bluestore testing label Mar 12, 2025
@aclamk aclamk force-pushed the wip-aclamk-pere-wal-fsync branch from ab34d64 to 41e8d24 Compare March 17, 2025 17:16
@aclamk
Copy link
Contributor Author

aclamk commented Mar 18, 2025

jenkins test api

@aclamk aclamk force-pushed the wip-aclamk-pere-wal-fsync branch from 6780770 to 10842c6 Compare March 18, 2025 10:42
@aclamk aclamk marked this pull request as ready for review March 18, 2025 12:38
@aclamk aclamk requested a review from a team as a code owner March 18, 2025 12:38
type: bool
level: advanced
desc: Enables a faster backend in BlueFS for WAL writes.
long_desc: Enabling this feature will reduce ~50% the amount of fdatasync syscalls issued by WAL writes. This happens because we embed metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fdatasync double backticks

desc: Enables a faster backend in BlueFS for WAL writes.
long_desc: Enabling this feature will reduce ~50% the amount of fdatasync syscalls issued by WAL writes. This happens because we embed metadata
with the data itself. Downgrading from a version that uses v2 to v1 will require running `ceph-bluestore-tool --command downgrade-wal-to-v1`
to move wal files to previous format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to migrate WAL entries to the previous format.

using ::ceph::encode; \
do {

#define ENCODE_START_FILLER(v, compat, filler_in) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be removed, not used any more

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

* @param bl bufferlist we were encoding to
* @param new_struct_compat struct-compat value to use
*/
#define ENCODE_FINISH_FILLER() \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be removed, not used any more

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

h->file->fnode.wal_size += content_length;
ceph_le64 flush_length_le(content_length);
h->wal_header_filler.copy_in(sizeof(flush_length_le), (char*)&flush_length_le);
uint64_t length = sizeof(File::wal_flush_t::wal_length_t) + content_length + sizeof(h->file->wal_marker);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to use head_size()/tail_size() helpers instead of sizeof?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced all sizeofs with header_size() / tail_size() calls.

hashed_ino ^= hashed_ino << 11;
hashed_ino ^= hashed_ino << 23;
// use hashed ino in a endiness-agnostic way
for (int i = 0; i < 8; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 to be replaced with sizeof(wal_marker) or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure with "8", but I added comment to explain the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU 8 comes from marker's length. Shouldn't it be sizeof(marker) then?

}
if (const unsigned tail = bl.length() & ~super.block_mask(); tail) {
const auto padding_len = super.block_size - tail;
unsigned padding_len = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apaprently no need to declare outside 'if' statement

@aclamk aclamk force-pushed the wip-aclamk-pere-wal-fsync branch 2 times, most recently from 1032eea to c2fdd21 Compare March 21, 2025 21:49
to move wal files to previous format.
long_desc: In envelope mode BlueFS files do not need to update metadata. When applied to RocksDB WAL files,
it reduces by ~50% the amount of fdatasync syscalls.
Downgrading from an envelope mode to legacy mode requires `ceph-bluestore-tool --command downgrade-wal-to-v1`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'wal-to-v1' is kinda weird now...

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

pereman2 and others added 17 commits March 28, 2025 16:05
WAL file mode allows to cut by 50% of fdatasync().
For regular files, we independently sync file data and file metadata.
In WAL mode we are able to recover most metadata from file bytestream.
Hence - we only sync file data.

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
(cherry picked from commit 38519ff)
Amended-by: Adam Kupczyk
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
(cherry picked from commit 4961dca)
This is useful to have ability to create a contigous_filler variable.
Otherwise one needs std::optional<>.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Main changes:
- remove bluefs_wal_header_t; we no longer use denc directly,
  instead we read length and marker directly
- get rid of wal_limit; this is replaced with splitting
  WAL_V2=>{WAL_V2,WAL_V2_FIN}. WAL_V2 means there could be more wal data,
  WAL_V2_FIN means file was orderly closed = no more data.
- modified wal_marker_t (previously WALMarker) to be calculated in
  endiness-agnostic way.
- replaced _wal_update_size with _wal_index_file; it does a similar
  function, but is less convoluted now
- simplified _wal_read
- now BlueFS::close_writer() flushes data
- BlueRocksEnv's BlueRocksWritableFile::truncate() now does the action

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Macros
are not longer used.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
It is just not used.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
If we happen to shutdown with dirty.pending_releases, after mount
we will most likely want to free them. And its an error - they will
already be free.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
The fact that test was broken was hidden by the error
in BlueFS::_init_alloc, that created the allocator with invalid
alloc unit size.
After that was fixed, also test needed to be fixed.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
(cherry picked from commit 041edad)
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
(cherry picked from commit a65118d)
Dropped checks around free space.
Now downgrades one file at a time.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Changed logic for selecting bluefs_super_t wal_version field.
Now it is more a result of the data that is in bluefs log and
less directly controlled.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
The check involves timing check.
Jenkins is too unpredictable with performance.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
This refactor turns WAL_v2 into globally available ENVELOPE_MODE.
The RocksDB WAL files are now just using existing feature.
Now adding new file encoding schemes will be easier.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
…pe_mode

No logic change, just conf name and constants adaptation.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
This is the variant of DENC_START that manually controls encoding version.
As being incompatible with static compile-time checking of
StructVChecker, new (legacy) mode had to be introduced.

Also,
Removed struct_v and struct_compat from denc_finish.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
@aclamk aclamk removed the needs-qa label Mar 28, 2025
@aclamk
Copy link
Contributor Author

aclamk commented Mar 28, 2025

@markhpc I removed your "need-qa" label. This thing passed preliminary teuthology, and this version is not final.

Note: critical bluestore PRs we are weekly testing / reviewing with aclamk-testing-* labels.

@aclamk aclamk force-pushed the wip-aclamk-pere-wal-fsync branch from 13ed631 to 11f9a67 Compare March 28, 2025 16:28
+ relocated config bluefs_wal_envelope_mode close to other bluefs
  configs

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
@aclamk
Copy link
Contributor Author

aclamk commented Mar 29, 2025

jenkins test make check arm64

@aclamk aclamk added aclamk-testing-ganymede bluestore testing and removed aclamk-testing-phoebe bluestore testing labels Apr 2, 2025
@aclamk
Copy link
Contributor Author

aclamk commented Apr 7, 2025

Approved by https://tracker.ceph.com/issues/70787.

@aclamk aclamk merged commit 09e8bd4 into ceph:main Apr 9, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants