os/bluestore: implement object content recompression/defragmentation when scrubbing by gardran · Pull Request #57631 · ceph/ceph

gardran · 2024-05-22T13:26:27Z

This PR adds an ability for BlueStore to perform object content recompression and defragmentation when handling read requests during deep-scrubbing.
After reading object's chunk BlueStore assesses if it can improve space saving by recompressing an object and/or it can optimize physical layout of the object.
Per-pool option 'data_layout_reformatting' has been introduced to control the behavior. It support the following options:

recompress - enables object content recompression
defragment - enabled object defragmentation
both - enables both recompression and defragmentation
- original behavior, no object reformatting

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

markhpc · 2024-05-22T18:44:38Z

src/os/bluestore/BlueStore.cc


-    r = _do_read(c, o, offset, length, bl, op_flags);
+    r = _do_read(c, o, offset, length, bl, op_flags, 0,
+          might_need_reformatting ? &span_stat : nullptr);


If we can potentially defrag on any read, it could be interesting once the extents are read off disk to make a decision about whether or not an object is fragmented enough to warrant defragmentation on-the-fly.

@markhpc I am excited to see it. Opens interesting paths.

dvanders · 2024-05-22T18:45:01Z

I'm not sure "data_layout_reformatting" is very clear to the users what it is doing.

Maybe instead use "optimize_during_scrub" with options "defrag", "recompress", "all", ...
Or perhaps better: a different boolean pool flag for each: defrag_during_scrub, recompress_during_scrub, ...

The second option is flexible if we want to allow additional optimizations or modes in the future.

src/osd/osd_types.cc

markhpc · 2024-05-23T03:04:52Z

src/mon/OSDMonitor.cc

      {"bulk", BULK},
-      {"read_ratio", READ_RATIO}
+      {"read_ratio", READ_RATIO},
+      {"data_layout_reformatting", DATA_LAYOUT_REFORMATTING},


This is missing corresponding "data_reformatting" get and set entries in MonCommands.h (easy to miss!)

markhpc · 2024-05-23T03:28:07Z

I'm not sure "data_layout_reformatting" is very clear to the users what it is doing.

Maybe instead use "optimize_during_scrub" with options "defrag", "recompress", "all", ... Or perhaps better: a different boolean pool flag for each: defrag_during_scrub, recompress_during_scrub, ...

The second option is flexible if we want to allow additional optimizations or modes in the future.

Very much dislike "optimize_during_scrub". I'd favor the boolean flag approach with something like "scrub_defragmentation", and "scrub_recompression". The number of pool options we have and the way we display them in the command help has become pretty unwieldy though. This doesn't do anything to help it. Having said that, I agree that this is best configurable at the pool level, especially if we want to allow lazy compression of objects on scrub.

markhpc · 2024-05-23T04:40:23Z

I did a very minimal initial test where I created 4 128GB RBD images, pre-filled them with 4MB writes, then did 5 minutes of 4K random writes to them in a pool with 512 PGs. This is on NVMe, so it's pretty quick. I then looked at a certain PG 2.1ff:

PG_STAT	OBJECTS	BYTES	OMAP_BYTES*	OMAP_KEYS*	LOG	LOG_DUPS	DISK_LOG
2.1ff	273	1145044992	0	0	618	3000	618

and ran deep-scrub on it 3 times:

Start Time	End Time	Duration
2024-05-23T01:48:36.667+0000	2024-05-23T01:48:43.199+0000	6.532
2024-05-23T01:55:48.326+0000	2024-05-23T01:55:54.804+0000	6.478
2024-05-23T01:59:05.596+0000	2024-05-23T01:59:12.075+0000	6.479

And the fragmentation score and reformat counters at this point is:

"fragmentation_rating": 0.01635269932531749
"reformat_compress_attempted": 0,
"reformat_compress_omitted": 0,
"reformat_defragment_attempted": 0,
"reformat_defragment_omitted": 0,
"reformat_issued": 0

Next set the pool data_reformatting to defragment and rerun deep-scrub on 2.1ff:

Start Time	End Time	Duration
2024-05-23T04:01:18.327+0000	2024-05-23T04:01:24.845+0000	6.518

"fragmentation_rating": 0.016560792018378661
"reformat_compress_attempted": 0,
"reformat_compress_omitted": 0,
"reformat_defragment_attempted": 2184,
"reformat_defragment_omitted": 0,
"reformat_issued": 2184

Next set the pool data_reformatting to both and rerun deep-scrub on 2.1ff:

Start Time	End Time	Duration
2024-05-23T04:15:42.800+0000	2024-05-23T04:15:49.279+0000	6.479

"fragmentation_rating": 0.016560788158529967
"reformat_compress_attempted": 0,
"reformat_compress_omitted": 0,
"reformat_defragment_attempted": 2184,
"reformat_defragment_omitted": 0,
"reformat_issued": 2184

Next set the pool compression_mode to passive and rerun deep-scrub on 2.1ff. Results are similar to above.
Next set the pool compression_mode to aggressive and rerun deep-scrub on 2.1ff. (forgot to check results)
Next set the pool compression_mode to force and rerun deep-scrub on 2.1ff.

Start Time	End Time	Duration
2024-05-23T04:26:35.157+0000	2024-05-23T04:26:41.637+0000	6.48

"fragmentation_rating": 0.016575220146748098
"reformat_compress_attempted": 4368,
"reformat_compress_omitted": 4074,
"reformat_defragment_attempted": 2184,
"reformat_defragment_omitted": 0,
"reformat_issued": 2478

Running another deep scrub on that PG results in an attempt to recompress the objects again:

"reformat_compress_attempted": 6552,
"reformat_compress_omitted": 6111,
"reformat_defragment_attempted": 2184,
"reformat_defragment_omitted": 0,
"reformat_issued": 2625

A couple of thoughts:

It would be nice if we had additional compression modes like "passive_lazy, aggressive_lazy, and force_lazy". The idea would be to allow objects to be written to the pool in an uncompressed state and only compress them on deep-scrub. Not sure how that would work with the global OSD level settings if we don't have the compress on scrub stuff set for the pool though.
It looks like most of the re-compression attempts here were omitted. After re-reading the code, it looks like this should pretty much only happen if need > allocated or need == allocated.
It's not clear to me why the fragmentation_rating goes up slightly after the first defrag pass. I probably should have been looking at onode_extents though. I'll try to do some better testing tomorrow.

Morning update: I tried deep-scrubbing another PG this morning with data_reformatting set to both and watched onode_extents:

pg	onode_extents before	onode_extents after
2.1fe	1259816	1276370

It looks like the number of extents actually went up?

markhpc · 2024-05-23T13:06:04Z

Here's a debug_bluestore = 10 log while deep-scrubbing one of the other PGs (2.1fb).
osd.2.log.gz

gardran · 2024-05-24T13:49:12Z

extents actually went up?

Are you referring to onode_extens perf counter here? IMO that's a wrong metric as it's an amount of logical (not physical) extents in the cache(!)
Proper way would be to check the report from "ceph tell osd.0 bluestore allocator dump block" and see how fragmented it is.
Output of "bluestore allocator fragmentation histogram block" is another mean for that analysis.

gardran · 2024-05-24T13:52:19Z

Here's a debug_bluestore = 10 log while deep-scrubbing one of the other PGs (2.1fb). osd.2.log.gz

Looks good to me, the majority of reads has log output like:
75+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read layout reformatting = both span stat {s=524288 ch=0 a=524288 sc=0 ac=0 sh=0 e=8 f=28}
2024-05-23T12:58:06.176+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read recompress info, 0x80000 vs. 0x80000
2024-05-23T12:58:06.176+0000 7fed59ea4700 10 HybridAllocator allocate 0x80000/1000,80000,0
2024-05-23T12:58:06.176+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read preallocated: 0x80000 new frags:1 defragment: 1
Meaning there had been 28 physical fragments originally and a single one to be written out.

gardran · 2024-05-24T14:00:33Z

ted": 4368,
"reformat_compress_omitte

Here's a debug_bluestore = 10 log while deep-scrubbing one of the other PGs (2.1fb). osd.2.log.gz

I could see no initial compression, e.g.
2024-05-23T12:58:04.863+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read layout reformatting = both span stat {s=524288 ch=0 a=524288 sc=0 ac=0 sh=0 e=8 f=11}

sc (aka storage compressed) = 0
ac (aka allocated compressed) = 0
mean that no data has been compressed before scrubbing. And no recompression is applied:

2024-05-23T12:58:04.864+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read recompress info, 0x80000 vs. 0x80000

the above line shows no saving due to recompression - recompressed data needs the same 512K of space.

So I presume your input data flow isn't compressible. Or something wrong with compression settings...

markhpc · 2024-05-24T15:16:18Z

extents actually went up?

Are you referring to onode_extens perf counter here? IMO that's a wrong metric as it's an amount of logical (not physical) extents in the cache(!) Proper way would be to check the report from "ceph tell osd.0 bluestore allocator dump block" and see how fragmented it is. Output of "bluestore allocator fragmentation histogram block" is another mean for that analysis.

Yikes! That's confusing. I don't know if I knew that at one point, but clearly I didn't remember it. :) We should rename it. I'll try to take a look with the histogram.

markhpc · 2024-05-24T15:21:42Z

I could see no initial compression, e.g. 2024-05-23T12:58:04.863+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read layout reformatting = both span stat {s=524288 ch=0 a=524288 sc=0 ac=0 sh=0 e=8 f=11}

sc (aka storage compressed) = 0 ac (aka allocated compressed) = 0 mean that no data has been compressed before scrubbing. And no recompression is applied:

2024-05-23T12:58:04.864+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read recompress info, 0x80000 vs. 0x80000

the above line shows no saving due to recompression - recompressed data needs the same 512K of space.

So I presume your input data flow isn't compressible. Or something wrong with compression settings...

There was no initial compression. I wrote the data out uncompressed with data_reformatting off, then set it to "defragment", then "both". Next I set the compression mode to passive, then aggressive, then force. Each time, I ran a deep scrub on different PGs to see what would happen.

dvanders · 2024-05-24T15:27:35Z

I could see no initial compression, e.g. 2024-05-23T12:58:04.863+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read layout reformatting = both span stat {s=524288 ch=0 a=524288 sc=0 ac=0 sh=0 e=8 f=11}
sc (aka storage compressed) = 0 ac (aka allocated compressed) = 0 mean that no data has been compressed before scrubbing. And no recompression is applied:
2024-05-23T12:58:04.864+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read recompress info, 0x80000 vs. 0x80000
the above line shows no saving due to recompression - recompressed data needs the same 512K of space.
So I presume your input data flow isn't compressible. Or something wrong with compression settings...

There was no initial compression. I wrote the data out uncompressed with data_reformatting off, then set it to "defragment", then "both". Next I set the compression mode to passive, then aggressive, then force. Each time, I ran a deep scrub on different PGs to see what would happen.

This is a great point. I think the intuitive user expectation is that this feature could be used to compress old objects written uncompressed before this feature was available.

Also earlier we spoke about a feature like:

pool compression is disabled, so initial writes are written uncompressed
scrub recompress works to compress the objects in the background

src/os/bluestore/BlueStore.cc

aclamk · 2024-05-29T18:35:58Z

src/os/bluestore/BlueStore.cc

      length = o->onode.size;

-    r = _do_read(c, o, offset, length, bl, op_flags);
+    r = _do_read(c, o, offset, length, bl, op_flags, 0,


The implementation is very strongly tied to existing BlueStore execution paths.
This is a problem, because it gives very little room to actually work on optimization algorithms.

That's true but we really want to save some cycles by not doing independent standalone optimization.

src/os/bluestore/BlueStore.cc

aclamk · 2024-05-29T18:44:10Z

src/os/bluestore/BlueStore.cc

-  IOContext* ioc)
-{
+  IOContext* ioc,
+  span_stat_t* res_span_stat)


_prepare_read_ioc is modified to provide scanning for optimization.
We should have a separate function that will scan object and decide action.
For example it might happen that in object we have 3 chunks for total of 4MB,
and in attempt to defragment it we allocated 3 chunks.

_prepare_read_ioc is modified to provide scanning for optimization. We should have a separate function that will scan object and decide action.

That's again an attempt to design things for an independent optimization process. Which still sounds questionable to me.

For example it might happen that in object we have 3 chunks for total of 4MB, and in attempt to defragment it we allocated 3 chunks.

Not sure I understand this example. When speaking of defragmentation only - it would happen if amount of new fragments is less than original one for a specific data span (512K when triggered from deep scrub).
If recompression is enabled as well -reformatting unconditionally occurs when new required space for a data span is less than the original one. Fragmentation is taken into consideration when recompression is not beneficial only. This recompression+defragmentation prioritization might look a bit inflexible - one might want prefer defragmentation over recompression but for now it looks like an overkill to me.

The big reason I think we want to piggyback on deep-scrub is because we're already reading the content of the object to do the scrub, so it's the perfect time to make decisions about de-fragmentation and compression as the most expensive part of this can easily be reading the fragmented extents. I don't think we should scan objects as a separate loop unless we have a very good reason. It's much better in my mind to opportunistically defrag/compress when we've already done the expensive fragmented read.

@markhpc
I (now) agree that performing optimization to object as an extra step during deep-scrub is good.
"I don't think we should scan objects as a separate loop unless we have a very good reason."
The code here is scanning the object - it's just piggybacking _read_cache to do it.
There is no change to behaviour of _read_cache.
The filling of span_stat_t could be done in separate loop at a cost of a loop.
Similarly, we now load shared blobs in _prepare_read_ioc - useless for reading process.
That would made optimization sense, if significant part of processing was reused - but its only the loop.

@gardran My opinion is that whenever we make a new feature we should make it as detached from existing code as possible to minimize accidental bug injection. Within reason of course...

@gardran My opinion is that whenever we make a new feature we should make it as detached from existing code as possible to minimize accidental bug injection. Within reason of course...

Generally I agree with this point when we're talking about a severe intervention into the execution path: changing run flow, having new algorithms, etc...
Instead I just perform stats collection (by using simple arithmetic ops) here. So apart from a trivial null/invalid pointer dereferencing this shouldn't introduce any issues to existing functionality. Moreover if refactoring is disabled this function collects stats into a dummy local var which is an absolutely trivial operation.
So IMO that sort of changes (in this specific function) is quite safe and doesn't worth the separation - basic code review/testing should reveal any potential issues if any.

src/os/bluestore/BlueStore.cc

github-actions · 2025-07-10T14:33:54Z

Config Diff Tool Output

! changed: bluestore_compression_mode: old:  (global.yaml.in)
! changed: bluestore_compression_mode: new: ['bluestore_prefer_deferred_size'] (global.yaml.in)
! changed: bluestore_compression_mode: old: 'none' means never use compression.  'passive' means use compression when clients hint that data is compressible.  'aggressive' means use compression unless clients hint that data is not compressible.  This option is used when the per-pool property for the compression mode is not present. (global.yaml.in)
! changed: bluestore_compression_mode: new: 'none' means never use compression.  'passive' means use compression when clients hint that data is compressible.  'aggressive' means use compression unless clients hint that data is not compressible.  '*_lazy' counterparts instruct to apply compression as per above during deep-scrubbing only. Which has to be additionally enabled at per-pool level using 'deep_scrub_recompression' pool setting. This option is used when the per-pool property for the compression mode is not present. (global.yaml.in)
! changed: bluestore_compression_mode: old: ['none', 'passive', 'aggressive', 'force'] (global.yaml.in)
! changed: bluestore_compression_mode: new: ['none', 'passive', 'aggressive', 'force', 'passive_lazy', 'aggressive_lazy', 'force_lazy'] (global.yaml.in)
! changed: bluestore_compression_mode: old: The default policy for using compression if the per-pool property ``compression_mode`` is not set. ``none`` means never use compression. ``passive`` means use compression when :c:func:`clients hint <rados_set_alloc_hint>` that data is compressible.  ``aggressive`` means use compression unless clients hint that data is not compressible.  ``force`` means use compression under all circumstances even if the clients hint that the data is not compressible. (global.yaml.in)
! changed: bluestore_compression_mode: new: The default policy for using compression if the per-pool property ``compression_mode`` is not set. ``none`` means never use compression. ``passive`` means use compression when :c:func:`clients hint <rados_set_alloc_hint>` that data is compressible.  ``aggressive`` means use compression unless clients hint that data is not compressible.  ``force`` means use compression under all circumstances even if the clients hint that the data is not compressible. ''*_lazy'' modes are similar to their counterpart ones but compression to be performed during deep scrubbing only. Which has to be additionally enabled on per-pool basis using ''deep_scrub_recompress'' pool setting. (global.yaml.in)

The above configuration changes are found in the PR. Please update the relevant release documentation if necessary.
Ignore this comment if docs are already updated. To make the "Check ceph config changes" CI check pass, please comment /config check ok and re-run the test.

github-actions · 2025-09-08T16:01:45Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

markhpc · 2025-09-16T15:27:15Z

retest this please

dvanders · 2025-11-10T19:24:01Z

Any news on this PR?
IMHO it's more important than ever now that we have fast EC, which I expect will increase fragmentation.

ifed01 · 2025-12-09T16:05:01Z

Might need rebasing on top of this once merged: #66552

github-actions · 2026-02-07T18:03:29Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2026-02-09T13:25:55Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>

Plus a few more PR comment resolutions from Adam. Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>

Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>

This is required to avoid buffer caching which would prevent content reformatting. Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>

This could occur when store_test tunes 'bluestore_allocator_lookup_policy' config parameter and restores default value on shutdown. Signed-off-by: Garry Drankovich <garry.dranckovich@clyso.com>

gardran requested a review from a team as a code owner May 22, 2024 13:26

github-actions bot added bluestore core mon tests labels May 22, 2024

markhpc reviewed May 22, 2024

View reviewed changes

dvanders reviewed May 22, 2024

View reviewed changes

src/osd/osd_types.cc Outdated Show resolved Hide resolved

markhpc removed the bluestore label May 23, 2024

markhpc requested changes May 23, 2024

View reviewed changes

markhpc added the performance label May 23, 2024

markhpc added the bluestore label May 24, 2024

gardran force-pushed the wip-gdran-defregment branch from 462da8c to 56f74a1 Compare May 27, 2024 16:48

github-actions bot added the common label May 28, 2024