os/bluestore: Recompression, part 4. Scanner, Estimator and core recompression. by aclamk · Pull Request #56975 · ceph/ceph

aclamk · 2024-04-18T08:45:21Z

This work is a partial solution to the problem bluestore compression quality on RBD.

Original (v1) compression has following characteristics:

is as fast as possible on write
does not care for read speed
relaxed disk space management
ignores snaps

New (v2) compression is doing:

has a (fast - best compresison) dial
keeps high read speeds
is very conscious of used space
is snap aware

#54075 Nice debugs.
#54504 New write path.
#57448 Segmented onode.
#56975 Main
#57450 Test

Initial engineering tests of the feature:
old: https://docs.google.com/spreadsheets/d/1ZRfTYFlc8gdDoU76qrgTFswzViho4ameabIk9TIx6Uw/edit#gid=560728694
moved, no restructions: https://docs.google.com/spreadsheets/d/1M_k2huVDlqZOPC5lz5JoKM1jrPSP9K43ErBoo5cMHDo/edit#gid=1978643008

Below extracted summary:
ref - reference
1.1 - most agressive recompression, it is enough to predict 1.1x gain to do recompress
..
1.7 - relaxed recompression, it requires 1.7x gain estimation to do recompress

Allocated space:

Write IOPs:

CPU used for writes:

Read IOPs: (Ref degradation is not a fluke. It is real and repeatable.):

CPU used for reads: (Ref degradation is not a fluke. It is real and repeatable.):

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Review checklist

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

markhpc · 2024-04-18T14:23:57Z

@aclamk Very nice! Digging those 1.7 results especially. They are nearly as good as ref in write IOPS and significantly better in all other metrics.

aclamk · 2024-05-07T12:22:48Z

jenkins test make check

aclamk · 2024-06-07T07:56:43Z

jenkins test make check

aclamk · 2024-06-27T07:18:58Z

jenkins test make check

github-actions · 2024-07-09T11:27:54Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

github-actions · 2025-03-17T11:23:03Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

liu-chunmei · 2025-04-06T06:44:40Z

src/os/bluestore/BlueStore.cc

+      derr << __func__ << " 0 reshard-range doing 0x" << std::hex << needs_reshard_begin
+        << "-0x" << needs_reshard_end << std::dec << " on"
+        << pretty_binary_string(onode->oid.hobj.to_str()) << dendl;
+      encoded_segment_estimate = 500; // just something, instead div0 ....


can we use some ratio * segment_size here?

src/os/bluestore/Compression.cc

src/os/bluestore/BlueStore.cc

liu-chunmei · 2025-04-08T21:45:33Z

src/os/bluestore/BlueStore.h

  struct Collection;
  struct Onode;
+  class Scanner;
+  class Estimator;


@aclamk Does it make sense Estimator is not defined in this commit or in prior commit, but we use it here? better to let each commit pass the build.

liu-chunmei · 2025-04-09T05:55:59Z

@aclamk I have no other questions except for the repacked issue.

This is borrowed from ceph#57631; selective cherry-picked from commit: os/bluestore: implement data reformatting on reads Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com> Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>

Split do_write into do_write and do_write_with_blobs. The original is used when only uncompressed data is written. The new one accepts stream of data formatted into blobs; the blobs can be compressed or uncompressed. Add blob_create_full_compressed. Fix do_put_new_blobs to handle compressed. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>

liu-chunmei · 2025-04-10T23:01:16Z

src/os/bluestore/Compression.cc

+ * modify it under the terms of the GNU Lesser General Public
+ * License version 2.1, as published by the Free Software
+ * Foundation.  See file COPYING.
+ */


compression.cc and compression.h are empty, not necessary to separate a commit, could you please merge this commit with next commit?

liu-chunmei

LGTM, only need merge two commit.

Add CMake rules to compile. Add bluestore_compression dout subsys. Created Estimator class. It is used by Scanner to decide if specific extent is to be recompressed. Prepare for future machine learning / adaptive algorithm for estimation. So far logic of Estimator is relatively simple. It learns expected recompression values and uses them in next iterations to predict. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>

Add feature of recompression scanner that looks around write region to see how much would be gained, if we read some more around and wrote more. Added Compression.h / Compression.cc. Added debug_bluestore_compression dout. Created Scanner class. Provides write_lookaround() for scanning loaded extents. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>

Modify do_write_v2() to branch into do_write_v2_compressed(). Segmented and regular cases are recognized and handled properly. New do_write_v2_compressed() oversees compression / recompression. Make one Estimator per Collection. It makes possible for estimator to learn in collection specific compressibility. In write_v2_compressed use compressor already selected in choose_write_options. Make Collection create Estimator on first use. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>

Tests that use original write path specific knowledge are failing now. Falling back to write_v1 in these tests. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>

Add admin socket commands: 1) bluestore collections Lists collections. 2) bluestore list <coll> [start object] [max count] Lists collection coll starting from object (optional). Default 100 entries. 0 = unlimited. 3) bluestore onode metadata <object> Prints onode metadata as seen by BlueStore. It might happen (usually in tests) that 2 BlueStore instances are created at the same time. Since admin commands are unique, it fails to register. Use first register to detect whether we can register at all. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>

Add new admin socket command to inspect Estimator stats per collection. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>

Standalone tests for ceph_test_objectstore require separate instances for testing write_v2=true. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>

rzarzynski · 2025-04-24T12:22:30Z

jenkins test make check arm64

rzarzynski · 2025-04-24T18:02:50Z

jenkins test make check arm64

aclamk · 2025-04-25T10:18:30Z

Approved by: https://tracker.ceph.com/issues/71057

ifed01 · 2025-04-24T14:36:05Z

src/os/bluestore/Writer.cc

+  BlobRef blob = onode->c->new_blob();
+
+  bluestore_blob_t &bblob = blob->dirty_blob();
+  uint32_t csum_order = // conv 8 -> 32 so "<<" does not overflow


Could be moved under 'if' below

ifed01 · 2025-04-24T14:41:28Z

src/os/bluestore/Writer.cc

-    bstore->_buffer_cache_write(this->txc, onode, pos, b.disk_data,
-      wctx->buffered ? 0 : Buffer::FLAG_NOCACHE);
-    pos += b.disk_data.length();
+    if (b.is_compressed()) {


having
auto& data = b.is_compressed() ? b.object_data : b.disk_data;
and using that data var for the remaining stuff
would look less bulky to me...

ifed01 · 2025-04-24T17:42:41Z

src/os/bluestore/Compression.cc

+  while (i != extra_recompress.end()) {
+    dout(25) << std::hex << i->first
+      << "~" << i->second << dendl;
+    if (end == unset) {


why not just having
if (i->first != end) {

here ?

ifed01 · 2025-04-24T17:45:33Z

src/os/bluestore/Compression.h

+  // accumulated size of compressed, used in feedback learn stage
+  uint32_t actual_compressed = 0;
+  uint32_t actual_compressed_plus_pad = 0;
+  std::map<uint32_t, uint32_t> extra_recompress;


wouldn't interval_set<uint32_t> be a replacement? It can merge adjustent extents for you.

ifed01 · 2025-04-24T17:47:00Z

src/os/bluestore/Compression.cc

+    dout(25) << std::hex << i->first
+      << "~" << i->second << dendl;
+    if (end == unset) {
+      regions.emplace_back();


Perhaps it's better to reserve space for 'regions' beforehand to avoid reallocations

ifed01 · 2025-04-25T09:04:39Z

src/os/bluestore/Compression.cc

+using Scan = BlueStore::Scanner::Scan;
+using P = BlueStore::printer;
+using Estimator = BlueStore::Estimator;
+using P = BlueStore::printer;


ifed01 · 2025-04-25T09:07:46Z

src/os/bluestore/Compression.cc

+inline void Estimator::mark_recompress(const BlueStore::Extent* e)
+{
+  ceph_assert(!extra_recompress.contains(e->logical_offset));
+  dout(25) << "recompress: " << e->print(P::NICK + P::JUSTID) << dendl;


better have before the assert

ifed01 · 2025-04-25T09:07:55Z

src/os/bluestore/Compression.cc

+inline void Estimator::mark_main(uint32_t location, uint32_t length)
+{
+  ceph_assert(!extra_recompress.contains(location));
+  dout(25) << "main data compress: " << std::hex


ifed01 · 2025-04-25T09:11:38Z

src/os/bluestore/Compression.cc

+  ceph::buffer::list& data_bl,
+  Writer::blob_vec& bd)
+{
+  uint32_t au_size = bluestore->min_alloc_size;


Unless I missed something it looks like bluestore member is used here and for cct access only. Likely we can pass min_alloc_size as a parameter and keep cct as a member instead.
IMO the less references to such a bulky class the better...

ifed01 · 2025-04-25T09:17:40Z

src/os/bluestore/BlueStore.h

    extent_map_t::iterator seek_lextent(uint64_t offset);
    extent_map_t::const_iterator seek_lextent(uint64_t offset) const;
+    /// seek to the exactly the extent, or after offset
+    extent_map_t::iterator seek_nextent(uint64_t offset);


wouldn't it be better to add 'exact' parameter to seet_lextent funcs instead?

aclamk requested a review from a team as a code owner April 18, 2024 08:45

github-actions bot added bluestore build/ops common core tests labels Apr 18, 2024

aclamk added the performance label Apr 18, 2024

aclamk requested a review from a team as a code owner April 18, 2024 11:11

github-actions bot added the crimson label Apr 18, 2024

aclamk force-pushed the wip-aclamk-bs-compression-recompression branch 2 times, most recently from d35088d to 2bfe098 Compare April 19, 2024 16:32

aclamk force-pushed the wip-aclamk-bs-compression-recompression branch from 348ae8e to 22976f6 Compare May 13, 2024 12:19

aclamk changed the title ~~BlueStore compression - recompression~~ os/bluestore: Recompression, part 4. Scanner, Estimator and core recompression. May 13, 2024

This was referenced May 13, 2024

os/bluestore: Recompression, part 5. Testing. #57450

Closed

os/bluestore: Recompression, part 1. Nice debugs. #54075

Merged

os/bluestore: Recompression, part 2. New write path. #54504

Merged

os/bluestore: Recompression, part 3. Segmented onode. #57448

Merged

aclamk force-pushed the wip-aclamk-bs-compression-recompression branch 3 times, most recently from 045c200 to 38b5305 Compare May 27, 2024 07:03

aclamk force-pushed the wip-aclamk-bs-compression-recompression branch from 5785400 to 48d4a6b Compare June 21, 2024 15:05

aclamk force-pushed the wip-aclamk-bs-compression-recompression branch 3 times, most recently from beae52b to ef69846 Compare July 1, 2024 10:38

github-actions bot added the needs-rebase label Jul 9, 2024

github-actions bot removed the needs-rebase label Mar 6, 2025

github-actions bot added the needs-rebase label Mar 17, 2025

aclamk force-pushed the wip-aclamk-bs-compression-recompression branch from 4862e11 to 7795cc5 Compare April 2, 2025 10:58

github-actions bot removed the needs-rebase label Apr 2, 2025

liu-chunmei reviewed Apr 6, 2025

View reviewed changes

src/os/bluestore/Compression.cc Outdated Show resolved Hide resolved

liu-chunmei reviewed Apr 6, 2025

View reviewed changes

src/os/bluestore/BlueStore.cc Outdated Show resolved Hide resolved

aclamk force-pushed the wip-aclamk-bs-compression-recompression branch from 7795cc5 to 88c20a5 Compare April 8, 2025 14:36

liu-chunmei reviewed Apr 8, 2025

View reviewed changes

aclamk added 2 commits April 9, 2025 13:59

aclamk force-pushed the wip-aclamk-bs-compression-recompression branch from 88c20a5 to dc9c6b3 Compare April 10, 2025 06:37

liu-chunmei reviewed Apr 10, 2025

View reviewed changes

aclamk added the aclamk-testing-phoebe bluestore testing label Apr 23, 2025

liu-chunmei approved these changes Apr 24, 2025

View reviewed changes

aclamk added 7 commits April 24, 2025 06:45

test/objectstore/store_test: Adapt tests to write_v2

4d6e7ac

Tests that use original write path specific knowledge are failing now. Falling back to write_v1 in these tests. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>

os/bluestore: Add "bluestore compression stats"

177ff7a

Add new admin socket command to inspect Estimator stats per collection. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>

qa/suites/rados/bluestore: Add standalone tests for write_v2

117a2d9

Standalone tests for ceph_test_objectstore require separate instances for testing write_v2=true. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>

aclamk force-pushed the wip-aclamk-bs-compression-recompression branch from 93c3e5a to 117a2d9 Compare April 24, 2025 06:47

aclamk merged commit ff2d2a2 into ceph:main Apr 25, 2025
12 checks passed

ifed01 reviewed Apr 25, 2025

View reviewed changes

Conversation

aclamk commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Review checklist

Checklist

Uh oh!

markhpc commented Apr 18, 2024

Uh oh!

aclamk commented May 7, 2024

Uh oh!

aclamk commented Jun 7, 2024

Uh oh!

aclamk commented Jun 27, 2024

Uh oh!

github-actions bot commented Jul 9, 2024

Uh oh!

github-actions bot commented Mar 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

liu-chunmei Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liu-chunmei commented Apr 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liu-chunmei left a comment

Choose a reason for hiding this comment

Uh oh!

rzarzynski commented Apr 24, 2025

Uh oh!

rzarzynski commented Apr 24, 2025

Uh oh!

aclamk commented Apr 25, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

aclamk commented Apr 18, 2024 •

edited

Loading

liu-chunmei Apr 8, 2025 •

edited

Loading