Conversation
|
@aclamk Very nice! Digging those 1.7 results especially. They are nearly as good as ref in write IOPS and significantly better in all other metrics. |
d35088d to
2bfe098
Compare
|
jenkins test make check |
348ae8e to
22976f6
Compare
045c200 to
38b5305
Compare
|
jenkins test make check |
5785400 to
48d4a6b
Compare
|
jenkins test make check |
beae52b to
ef69846
Compare
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
4862e11 to
7795cc5
Compare
| derr << __func__ << " 0 reshard-range doing 0x" << std::hex << needs_reshard_begin | ||
| << "-0x" << needs_reshard_end << std::dec << " on" | ||
| << pretty_binary_string(onode->oid.hobj.to_str()) << dendl; | ||
| encoded_segment_estimate = 500; // just something, instead div0 .... |
There was a problem hiding this comment.
can we use some ratio * segment_size here?
7795cc5 to
88c20a5
Compare
| struct Collection; | ||
| struct Onode; | ||
| class Scanner; | ||
| class Estimator; |
There was a problem hiding this comment.
@aclamk Does it make sense Estimator is not defined in this commit or in prior commit, but we use it here? better to let each commit pass the build.
|
@aclamk I have no other questions except for the repacked issue. |
This is borrowed from ceph#57631; selective cherry-picked from commit: os/bluestore: implement data reformatting on reads Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com> Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Split do_write into do_write and do_write_with_blobs. The original is used when only uncompressed data is written. The new one accepts stream of data formatted into blobs; the blobs can be compressed or uncompressed. Add blob_create_full_compressed. Fix do_put_new_blobs to handle compressed. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
88c20a5 to
dc9c6b3
Compare
| * modify it under the terms of the GNU Lesser General Public | ||
| * License version 2.1, as published by the Free Software | ||
| * Foundation. See file COPYING. | ||
| */ |
There was a problem hiding this comment.
compression.cc and compression.h are empty, not necessary to separate a commit, could you please merge this commit with next commit?
liu-chunmei
left a comment
There was a problem hiding this comment.
LGTM, only need merge two commit.
Add CMake rules to compile. Add bluestore_compression dout subsys. Created Estimator class. It is used by Scanner to decide if specific extent is to be recompressed. Prepare for future machine learning / adaptive algorithm for estimation. So far logic of Estimator is relatively simple. It learns expected recompression values and uses them in next iterations to predict. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Add feature of recompression scanner that looks around write region to see how much would be gained, if we read some more around and wrote more. Added Compression.h / Compression.cc. Added debug_bluestore_compression dout. Created Scanner class. Provides write_lookaround() for scanning loaded extents. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Modify do_write_v2() to branch into do_write_v2_compressed(). Segmented and regular cases are recognized and handled properly. New do_write_v2_compressed() oversees compression / recompression. Make one Estimator per Collection. It makes possible for estimator to learn in collection specific compressibility. In write_v2_compressed use compressor already selected in choose_write_options. Make Collection create Estimator on first use. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Tests that use original write path specific knowledge are failing now. Falling back to write_v1 in these tests. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Add admin socket commands: 1) bluestore collections Lists collections. 2) bluestore list <coll> [start object] [max count] Lists collection coll starting from object (optional). Default 100 entries. 0 = unlimited. 3) bluestore onode metadata <object> Prints onode metadata as seen by BlueStore. It might happen (usually in tests) that 2 BlueStore instances are created at the same time. Since admin commands are unique, it fails to register. Use first register to detect whether we can register at all. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Add new admin socket command to inspect Estimator stats per collection. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Standalone tests for ceph_test_objectstore require separate instances for testing write_v2=true. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
93c3e5a to
117a2d9
Compare
|
jenkins test make check arm64 |
1 similar comment
|
jenkins test make check arm64 |
|
Approved by: https://tracker.ceph.com/issues/71057 |
| BlobRef blob = onode->c->new_blob(); | ||
|
|
||
| bluestore_blob_t &bblob = blob->dirty_blob(); | ||
| uint32_t csum_order = // conv 8 -> 32 so "<<" does not overflow |
There was a problem hiding this comment.
Could be moved under 'if' below
| bstore->_buffer_cache_write(this->txc, onode, pos, b.disk_data, | ||
| wctx->buffered ? 0 : Buffer::FLAG_NOCACHE); | ||
| pos += b.disk_data.length(); | ||
| if (b.is_compressed()) { |
There was a problem hiding this comment.
having
auto& data = b.is_compressed() ? b.object_data : b.disk_data;
and using that data var for the remaining stuff
would look less bulky to me...
| while (i != extra_recompress.end()) { | ||
| dout(25) << std::hex << i->first | ||
| << "~" << i->second << dendl; | ||
| if (end == unset) { |
There was a problem hiding this comment.
why not just having
if (i->first != end) {
here ?
| // accumulated size of compressed, used in feedback learn stage | ||
| uint32_t actual_compressed = 0; | ||
| uint32_t actual_compressed_plus_pad = 0; | ||
| std::map<uint32_t, uint32_t> extra_recompress; |
There was a problem hiding this comment.
wouldn't interval_set<uint32_t> be a replacement? It can merge adjustent extents for you.
| dout(25) << std::hex << i->first | ||
| << "~" << i->second << dendl; | ||
| if (end == unset) { | ||
| regions.emplace_back(); |
There was a problem hiding this comment.
Perhaps it's better to reserve space for 'regions' beforehand to avoid reallocations
| using Scan = BlueStore::Scanner::Scan; | ||
| using P = BlueStore::printer; | ||
| using Estimator = BlueStore::Estimator; | ||
| using P = BlueStore::printer; |
| inline void Estimator::mark_recompress(const BlueStore::Extent* e) | ||
| { | ||
| ceph_assert(!extra_recompress.contains(e->logical_offset)); | ||
| dout(25) << "recompress: " << e->print(P::NICK + P::JUSTID) << dendl; |
There was a problem hiding this comment.
better have before the assert
| inline void Estimator::mark_main(uint32_t location, uint32_t length) | ||
| { | ||
| ceph_assert(!extra_recompress.contains(location)); | ||
| dout(25) << "main data compress: " << std::hex |
| ceph::buffer::list& data_bl, | ||
| Writer::blob_vec& bd) | ||
| { | ||
| uint32_t au_size = bluestore->min_alloc_size; |
There was a problem hiding this comment.
Unless I missed something it looks like bluestore member is used here and for cct access only. Likely we can pass min_alloc_size as a parameter and keep cct as a member instead.
IMO the less references to such a bulky class the better...
| extent_map_t::iterator seek_lextent(uint64_t offset); | ||
| extent_map_t::const_iterator seek_lextent(uint64_t offset) const; | ||
| /// seek to the exactly the extent, or after offset | ||
| extent_map_t::iterator seek_nextent(uint64_t offset); |
There was a problem hiding this comment.
wouldn't it be better to add 'exact' parameter to seet_lextent funcs instead?
This work is a partial solution to the problem bluestore compression quality on RBD.
Original (v1) compression has following characteristics:
New (v2) compression is doing:
#54075 Nice debugs.
#54504 New write path.
#57448 Segmented onode.
#56975 Main
#57450 Test
Initial engineering tests of the feature:
old: https://docs.google.com/spreadsheets/d/1ZRfTYFlc8gdDoU76qrgTFswzViho4ameabIk9TIx6Uw/edit#gid=560728694
moved, no restructions: https://docs.google.com/spreadsheets/d/1M_k2huVDlqZOPC5lz5JoKM1jrPSP9K43ErBoo5cMHDo/edit#gid=1978643008
Below extracted summary:
ref - reference
1.1 - most agressive recompression, it is enough to predict 1.1x gain to do recompress
..
1.7 - relaxed recompression, it requires 1.7x gain estimation to do recompress
Allocated space:

Write IOPs:

CPU used for writes:

Read IOPs: (Ref degradation is not a fluke. It is real and repeatable.):

CPU used for reads: (Ref degradation is not a fluke. It is real and repeatable.):

Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Review checklist
Implemented.
Tested by randomly selecting write path on mount: https://github.com/aclamk/ceph/tree/wip-aclamk-bs-compression-recompression-test-writerandom
Tested by randomly selecting segment size on mount: https://github.com/aclamk/ceph/tree/wip-aclamk-bs-compression-recompression-test-randomsegment
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e