os/bluestore: implement object content recompression/defragmentation when scrubbing#57631
os/bluestore: implement object content recompression/defragmentation when scrubbing#57631
Conversation
|
|
||
| r = _do_read(c, o, offset, length, bl, op_flags); | ||
| r = _do_read(c, o, offset, length, bl, op_flags, 0, | ||
| might_need_reformatting ? &span_stat : nullptr); |
There was a problem hiding this comment.
If we can potentially defrag on any read, it could be interesting once the extents are read off disk to make a decision about whether or not an object is fragmented enough to warrant defragmentation on-the-fly.
There was a problem hiding this comment.
@markhpc I am excited to see it. Opens interesting paths.
|
I'm not sure "data_layout_reformatting" is very clear to the users what it is doing. Maybe instead use "optimize_during_scrub" with options "defrag", "recompress", "all", ... The second option is flexible if we want to allow additional optimizations or modes in the future. |
src/mon/OSDMonitor.cc
Outdated
| {"bulk", BULK}, | ||
| {"read_ratio", READ_RATIO} | ||
| {"read_ratio", READ_RATIO}, | ||
| {"data_layout_reformatting", DATA_LAYOUT_REFORMATTING}, |
There was a problem hiding this comment.
This is missing corresponding "data_reformatting" get and set entries in MonCommands.h (easy to miss!)
Very much dislike "optimize_during_scrub". I'd favor the boolean flag approach with something like "scrub_defragmentation", and "scrub_recompression". The number of pool options we have and the way we display them in the command help has become pretty unwieldy though. This doesn't do anything to help it. Having said that, I agree that this is best configurable at the pool level, especially if we want to allow lazy compression of objects on scrub. |
|
I did a very minimal initial test where I created 4 128GB RBD images, pre-filled them with 4MB writes, then did 5 minutes of 4K random writes to them in a pool with 512 PGs. This is on NVMe, so it's pretty quick. I then looked at a certain PG 2.1ff:
and ran deep-scrub on it 3 times:
And the fragmentation score and reformat counters at this point is: Next set the pool data_reformatting to defragment and rerun deep-scrub on 2.1ff:
Next set the pool data_reformatting to both and rerun deep-scrub on 2.1ff:
Next set the pool compression_mode to passive and rerun deep-scrub on 2.1ff. Results are similar to above.
Running another deep scrub on that PG results in an attempt to recompress the objects again: A couple of thoughts:
Morning update: I tried deep-scrubbing another PG this morning with data_reformatting set to both and watched onode_extents:
It looks like the number of extents actually went up? |
|
Here's a debug_bluestore = 10 log while deep-scrubbing one of the other PGs (2.1fb). |
Are you referring to onode_extens perf counter here? IMO that's a wrong metric as it's an amount of logical (not physical) extents in the cache(!) |
Looks good to me, the majority of reads has log output like: |
I could see no initial compression, e.g. sc (aka storage compressed) = 0 2024-05-23T12:58:04.864+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read recompress info, 0x80000 vs. 0x80000 the above line shows no saving due to recompression - recompressed data needs the same 512K of space. So I presume your input data flow isn't compressible. Or something wrong with compression settings... |
Yikes! That's confusing. I don't know if I knew that at one point, but clearly I didn't remember it. :) We should rename it. I'll try to take a look with the histogram. |
There was no initial compression. I wrote the data out uncompressed with data_reformatting off, then set it to "defragment", then "both". Next I set the compression mode to passive, then aggressive, then force. Each time, I ran a deep scrub on different PGs to see what would happen. |
This is a great point. I think the intuitive user expectation is that this feature could be used to compress old objects written uncompressed before this feature was available. Also earlier we spoke about a feature like:
|
462da8c to
56f74a1
Compare
| length = o->onode.size; | ||
|
|
||
| r = _do_read(c, o, offset, length, bl, op_flags); | ||
| r = _do_read(c, o, offset, length, bl, op_flags, 0, |
There was a problem hiding this comment.
The implementation is very strongly tied to existing BlueStore execution paths.
This is a problem, because it gives very little room to actually work on optimization algorithms.
There was a problem hiding this comment.
That's true but we really want to save some cycles by not doing independent standalone optimization.
| IOContext* ioc) | ||
| { | ||
| IOContext* ioc, | ||
| span_stat_t* res_span_stat) |
There was a problem hiding this comment.
_prepare_read_ioc is modified to provide scanning for optimization.
We should have a separate function that will scan object and decide action.
For example it might happen that in object we have 3 chunks for total of 4MB,
and in attempt to defragment it we allocated 3 chunks.
There was a problem hiding this comment.
_prepare_read_iocis modified to provide scanning for optimization. We should have a separate function that will scan object and decide action.
That's again an attempt to design things for an independent optimization process. Which still sounds questionable to me.
For example it might happen that in object we have 3 chunks for total of 4MB, and in attempt to defragment it we allocated 3 chunks.
Not sure I understand this example. When speaking of defragmentation only - it would happen if amount of new fragments is less than original one for a specific data span (512K when triggered from deep scrub).
If recompression is enabled as well -reformatting unconditionally occurs when new required space for a data span is less than the original one. Fragmentation is taken into consideration when recompression is not beneficial only. This recompression+defragmentation prioritization might look a bit inflexible - one might want prefer defragmentation over recompression but for now it looks like an overkill to me.
There was a problem hiding this comment.
The big reason I think we want to piggyback on deep-scrub is because we're already reading the content of the object to do the scrub, so it's the perfect time to make decisions about de-fragmentation and compression as the most expensive part of this can easily be reading the fragmented extents. I don't think we should scan objects as a separate loop unless we have a very good reason. It's much better in my mind to opportunistically defrag/compress when we've already done the expensive fragmented read.
There was a problem hiding this comment.
@markhpc
I (now) agree that performing optimization to object as an extra step during deep-scrub is good.
"I don't think we should scan objects as a separate loop unless we have a very good reason."
The code here is scanning the object - it's just piggybacking _read_cache to do it.
There is no change to behaviour of _read_cache.
The filling of span_stat_t could be done in separate loop at a cost of a loop.
Similarly, we now load shared blobs in _prepare_read_ioc - useless for reading process.
That would made optimization sense, if significant part of processing was reused - but its only the loop.
@gardran My opinion is that whenever we make a new feature we should make it as detached from existing code as possible to minimize accidental bug injection. Within reason of course...
There was a problem hiding this comment.
@gardran My opinion is that whenever we make a new feature we should make it as detached from existing code as possible to minimize accidental bug injection. Within reason of course...
Generally I agree with this point when we're talking about a severe intervention into the execution path: changing run flow, having new algorithms, etc...
Instead I just perform stats collection (by using simple arithmetic ops) here. So apart from a trivial null/invalid pointer dereferencing this shouldn't introduce any issues to existing functionality. Moreover if refactoring is disabled this function collects stats into a dummy local var which is an absolutely trivial operation.
So IMO that sort of changes (in this specific function) is quite safe and doesn't worth the separation - basic code review/testing should reveal any potential issues if any.
5dcfa15 to
c10b8dd
Compare
Config Diff Tool Output! changed: bluestore_compression_mode: old: (global.yaml.in)
! changed: bluestore_compression_mode: new: ['bluestore_prefer_deferred_size'] (global.yaml.in)
! changed: bluestore_compression_mode: old: 'none' means never use compression. 'passive' means use compression when clients hint that data is compressible. 'aggressive' means use compression unless clients hint that data is not compressible. This option is used when the per-pool property for the compression mode is not present. (global.yaml.in)
! changed: bluestore_compression_mode: new: 'none' means never use compression. 'passive' means use compression when clients hint that data is compressible. 'aggressive' means use compression unless clients hint that data is not compressible. '*_lazy' counterparts instruct to apply compression as per above during deep-scrubbing only. Which has to be additionally enabled at per-pool level using 'deep_scrub_recompression' pool setting. This option is used when the per-pool property for the compression mode is not present. (global.yaml.in)
! changed: bluestore_compression_mode: old: ['none', 'passive', 'aggressive', 'force'] (global.yaml.in)
! changed: bluestore_compression_mode: new: ['none', 'passive', 'aggressive', 'force', 'passive_lazy', 'aggressive_lazy', 'force_lazy'] (global.yaml.in)
! changed: bluestore_compression_mode: old: The default policy for using compression if the per-pool property ``compression_mode`` is not set. ``none`` means never use compression. ``passive`` means use compression when :c:func:`clients hint <rados_set_alloc_hint>` that data is compressible. ``aggressive`` means use compression unless clients hint that data is not compressible. ``force`` means use compression under all circumstances even if the clients hint that the data is not compressible. (global.yaml.in)
! changed: bluestore_compression_mode: new: The default policy for using compression if the per-pool property ``compression_mode`` is not set. ``none`` means never use compression. ``passive`` means use compression when :c:func:`clients hint <rados_set_alloc_hint>` that data is compressible. ``aggressive`` means use compression unless clients hint that data is not compressible. ``force`` means use compression under all circumstances even if the clients hint that the data is not compressible. ''*_lazy'' modes are similar to their counterpart ones but compression to be performed during deep scrubbing only. Which has to be additionally enabled on per-pool basis using ''deep_scrub_recompress'' pool setting. (global.yaml.in)
The above configuration changes are found in the PR. Please update the relevant release documentation if necessary. |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
retest this please |
|
Any news on this PR? |
|
Might need rebasing on top of this once merged: #66552 |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>
Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>
Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>
Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>
Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>
Plus a few more PR comment resolutions from Adam. Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>
Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>
Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>
This is required to avoid buffer caching which would prevent content reformatting. Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>
This could occur when store_test tunes 'bluestore_allocator_lookup_policy' config parameter and restores default value on shutdown. Signed-off-by: Garry Drankovich <garry.dranckovich@clyso.com>
c10b8dd to
33baf66
Compare
This PR adds an ability for BlueStore to perform object content recompression and defragmentation when handling read requests during deep-scrubbing.
After reading object's chunk BlueStore assesses if it can improve space saving by recompressing an object and/or it can optimize physical layout of the object.
Per-pool option 'data_layout_reformatting' has been introduced to control the behavior. It support the following options:
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e