Skip to content

os/bluestore: Recompression, part 4. Scanner, Estimator and core recompression.#56975

Merged
aclamk merged 9 commits intoceph:mainfrom
aclamk:wip-aclamk-bs-compression-recompression
Apr 25, 2025
Merged

os/bluestore: Recompression, part 4. Scanner, Estimator and core recompression.#56975
aclamk merged 9 commits intoceph:mainfrom
aclamk:wip-aclamk-bs-compression-recompression

Conversation

@aclamk
Copy link
Contributor

@aclamk aclamk commented Apr 18, 2024

This work is a partial solution to the problem bluestore compression quality on RBD.

Original (v1) compression has following characteristics:

  • is as fast as possible on write
  • does not care for read speed
  • relaxed disk space management
  • ignores snaps

New (v2) compression is doing:

  • has a (fast - best compresison) dial
  • keeps high read speeds
  • is very conscious of used space
  • is snap aware

#54075 Nice debugs.
#54504 New write path.
#57448 Segmented onode.
#56975 Main
#57450 Test

Initial engineering tests of the feature:
old: https://docs.google.com/spreadsheets/d/1ZRfTYFlc8gdDoU76qrgTFswzViho4ameabIk9TIx6Uw/edit#gid=560728694
moved, no restructions: https://docs.google.com/spreadsheets/d/1M_k2huVDlqZOPC5lz5JoKM1jrPSP9K43ErBoo5cMHDo/edit#gid=1978643008

Below extracted summary:
ref - reference
1.1 - most agressive recompression, it is enough to predict 1.1x gain to do recompress
..
1.7 - relaxed recompression, it requires 1.7x gain estimation to do recompress

Allocated space:
image

Write IOPs:
image

CPU used for writes:
image

Read IOPs: (Ref degradation is not a fluke. It is real and repeatable.):
image

CPU used for reads: (Ref degradation is not a fluke. It is real and repeatable.):
image

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Review checklist

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@markhpc
Copy link
Member

markhpc commented Apr 18, 2024

@aclamk Very nice! Digging those 1.7 results especially. They are nearly as good as ref in write IOPS and significantly better in all other metrics.

@aclamk aclamk force-pushed the wip-aclamk-bs-compression-recompression branch 2 times, most recently from d35088d to 2bfe098 Compare April 19, 2024 16:32
@aclamk
Copy link
Contributor Author

aclamk commented May 7, 2024

jenkins test make check

@aclamk aclamk force-pushed the wip-aclamk-bs-compression-recompression branch from 348ae8e to 22976f6 Compare May 13, 2024 12:19
@aclamk aclamk changed the title BlueStore compression - recompression os/bluestore: Recompression, part 4. Scanner, Estimator and core recompression. May 13, 2024
@aclamk aclamk force-pushed the wip-aclamk-bs-compression-recompression branch 3 times, most recently from 045c200 to 38b5305 Compare May 27, 2024 07:03
@aclamk
Copy link
Contributor Author

aclamk commented Jun 7, 2024

jenkins test make check

@aclamk aclamk force-pushed the wip-aclamk-bs-compression-recompression branch from 5785400 to 48d4a6b Compare June 21, 2024 15:05
@aclamk
Copy link
Contributor Author

aclamk commented Jun 27, 2024

jenkins test make check

@aclamk aclamk force-pushed the wip-aclamk-bs-compression-recompression branch 3 times, most recently from beae52b to ef69846 Compare July 1, 2024 10:38
@github-actions
Copy link

github-actions bot commented Jul 9, 2024

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@aclamk aclamk force-pushed the wip-aclamk-bs-compression-recompression branch from 4862e11 to 7795cc5 Compare April 2, 2025 10:58
derr << __func__ << " 0 reshard-range doing 0x" << std::hex << needs_reshard_begin
<< "-0x" << needs_reshard_end << std::dec << " on"
<< pretty_binary_string(onode->oid.hobj.to_str()) << dendl;
encoded_segment_estimate = 500; // just something, instead div0 ....
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use some ratio * segment_size here?

@aclamk aclamk force-pushed the wip-aclamk-bs-compression-recompression branch from 7795cc5 to 88c20a5 Compare April 8, 2025 14:36
struct Collection;
struct Onode;
class Scanner;
class Estimator;
Copy link
Contributor

@liu-chunmei liu-chunmei Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aclamk Does it make sense Estimator is not defined in this commit or in prior commit, but we use it here? better to let each commit pass the build.

@liu-chunmei
Copy link
Contributor

@aclamk I have no other questions except for the repacked issue.

aclamk added 2 commits April 9, 2025 13:59
This is borrowed from ceph#57631;
selective cherry-picked from commit:
    os/bluestore: implement data reformatting on reads
    Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>
Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Split do_write into do_write and do_write_with_blobs.
The original is used when only uncompressed data is written.
The new one accepts stream of data formatted into blobs;
the blobs can be compressed or uncompressed.

Add blob_create_full_compressed.
Fix do_put_new_blobs to handle compressed.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
@aclamk aclamk force-pushed the wip-aclamk-bs-compression-recompression branch from 88c20a5 to dc9c6b3 Compare April 10, 2025 06:37
* modify it under the terms of the GNU Lesser General Public
* License version 2.1, as published by the Free Software
* Foundation. See file COPYING.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compression.cc and compression.h are empty, not necessary to separate a commit, could you please merge this commit with next commit?

@aclamk aclamk added the aclamk-testing-phoebe bluestore testing label Apr 23, 2025
Copy link
Contributor

@liu-chunmei liu-chunmei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only need merge two commit.

aclamk added 7 commits April 24, 2025 06:45
Add CMake rules to compile.
Add bluestore_compression dout subsys.

Created Estimator class.
It is used by Scanner to decide if specific extent is to be recompressed.
Prepare for future machine learning / adaptive algorithm for estimation.

So far logic of Estimator is relatively simple.
It learns expected recompression values and uses them in next iterations to predict.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Add feature of recompression scanner that looks around write region to see how much
would be gained, if we read some more around and wrote more.
Added Compression.h / Compression.cc.
Added debug_bluestore_compression dout.
Created Scanner class.
Provides write_lookaround() for scanning loaded extents.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Modify do_write_v2() to branch into do_write_v2_compressed().
Segmented and regular cases are recognized and handled properly.
New do_write_v2_compressed() oversees compression / recompression.

Make one Estimator per Collection.
It makes possible for estimator to learn in collection specific compressibility.
In write_v2_compressed use compressor already selected in choose_write_options.
Make Collection create Estimator on first use.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Tests that use original write path specific knowledge are failing now.
Falling back to write_v1 in these tests.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Add admin socket commands:
1) bluestore collections
Lists collections.
2) bluestore list <coll> [start object] [max count]
Lists collection coll starting from object (optional). Default 100 entries. 0 = unlimited.
3) bluestore onode metadata <object>
Prints onode metadata as seen by BlueStore.

It might happen (usually in tests) that 2 BlueStore instances are created at the same time.
Since admin commands are unique, it fails to register.
Use first register to detect whether we can register at all.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Add new admin socket command to inspect Estimator stats per collection.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Standalone tests for ceph_test_objectstore require separate instances
for testing write_v2=true.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
@aclamk aclamk force-pushed the wip-aclamk-bs-compression-recompression branch from 93c3e5a to 117a2d9 Compare April 24, 2025 06:47
@rzarzynski
Copy link
Contributor

jenkins test make check arm64

1 similar comment
@rzarzynski
Copy link
Contributor

jenkins test make check arm64

@aclamk
Copy link
Contributor Author

aclamk commented Apr 25, 2025

Approved by: https://tracker.ceph.com/issues/71057

@aclamk aclamk merged commit ff2d2a2 into ceph:main Apr 25, 2025
12 checks passed
BlobRef blob = onode->c->new_blob();

bluestore_blob_t &bblob = blob->dirty_blob();
uint32_t csum_order = // conv 8 -> 32 so "<<" does not overflow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be moved under 'if' below

bstore->_buffer_cache_write(this->txc, onode, pos, b.disk_data,
wctx->buffered ? 0 : Buffer::FLAG_NOCACHE);
pos += b.disk_data.length();
if (b.is_compressed()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having
auto& data = b.is_compressed() ? b.object_data : b.disk_data;
and using that data var for the remaining stuff
would look less bulky to me...

while (i != extra_recompress.end()) {
dout(25) << std::hex << i->first
<< "~" << i->second << dendl;
if (end == unset) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just having
if (i->first != end) {

here ?

// accumulated size of compressed, used in feedback learn stage
uint32_t actual_compressed = 0;
uint32_t actual_compressed_plus_pad = 0;
std::map<uint32_t, uint32_t> extra_recompress;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't interval_set<uint32_t> be a replacement? It can merge adjustent extents for you.

dout(25) << std::hex << i->first
<< "~" << i->second << dendl;
if (end == unset) {
regions.emplace_back();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's better to reserve space for 'regions' beforehand to avoid reallocations

using Scan = BlueStore::Scanner::Scan;
using P = BlueStore::printer;
using Estimator = BlueStore::Estimator;
using P = BlueStore::printer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate

inline void Estimator::mark_recompress(const BlueStore::Extent* e)
{
ceph_assert(!extra_recompress.contains(e->logical_offset));
dout(25) << "recompress: " << e->print(P::NICK + P::JUSTID) << dendl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better have before the assert

inline void Estimator::mark_main(uint32_t location, uint32_t length)
{
ceph_assert(!extra_recompress.contains(location));
dout(25) << "main data compress: " << std::hex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

ceph::buffer::list& data_bl,
Writer::blob_vec& bd)
{
uint32_t au_size = bluestore->min_alloc_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I missed something it looks like bluestore member is used here and for cct access only. Likely we can pass min_alloc_size as a parameter and keep cct as a member instead.
IMO the less references to such a bulky class the better...

extent_map_t::iterator seek_lextent(uint64_t offset);
extent_map_t::const_iterator seek_lextent(uint64_t offset) const;
/// seek to the exactly the extent, or after offset
extent_map_t::iterator seek_nextent(uint64_t offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be better to add 'exact' parameter to seet_lextent funcs instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants