Skip to content

os/bluestore: implement object content recompression/defragmentation when scrubbing#57631

Open
gardran wants to merge 10 commits intoceph:mainfrom
gardran:wip-gdran-defregment
Open

os/bluestore: implement object content recompression/defragmentation when scrubbing#57631
gardran wants to merge 10 commits intoceph:mainfrom
gardran:wip-gdran-defregment

Conversation

@gardran
Copy link
Contributor

@gardran gardran commented May 22, 2024

This PR adds an ability for BlueStore to perform object content recompression and defragmentation when handling read requests during deep-scrubbing.
After reading object's chunk BlueStore assesses if it can improve space saving by recompressing an object and/or it can optimize physical layout of the object.
Per-pool option 'data_layout_reformatting' has been introduced to control the behavior. It support the following options:

  • recompress - enables object content recompression
  • defragment - enabled object defragmentation
  • both - enables both recompression and defragmentation
  • - original behavior, no object reformatting

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e


r = _do_read(c, o, offset, length, bl, op_flags);
r = _do_read(c, o, offset, length, bl, op_flags, 0,
might_need_reformatting ? &span_stat : nullptr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can potentially defrag on any read, it could be interesting once the extents are read off disk to make a decision about whether or not an object is fragmented enough to warrant defragmentation on-the-fly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markhpc I am excited to see it. Opens interesting paths.

@dvanders
Copy link
Contributor

dvanders commented May 22, 2024

I'm not sure "data_layout_reformatting" is very clear to the users what it is doing.

Maybe instead use "optimize_during_scrub" with options "defrag", "recompress", "all", ...
Or perhaps better: a different boolean pool flag for each: defrag_during_scrub, recompress_during_scrub, ...

The second option is flexible if we want to allow additional optimizations or modes in the future.

@markhpc markhpc removed the bluestore label May 23, 2024
{"bulk", BULK},
{"read_ratio", READ_RATIO}
{"read_ratio", READ_RATIO},
{"data_layout_reformatting", DATA_LAYOUT_REFORMATTING},
Copy link
Member

@markhpc markhpc May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing corresponding "data_reformatting" get and set entries in MonCommands.h (easy to miss!)

@markhpc
Copy link
Member

markhpc commented May 23, 2024

I'm not sure "data_layout_reformatting" is very clear to the users what it is doing.

Maybe instead use "optimize_during_scrub" with options "defrag", "recompress", "all", ... Or perhaps better: a different boolean pool flag for each: defrag_during_scrub, recompress_during_scrub, ...

The second option is flexible if we want to allow additional optimizations or modes in the future.

Very much dislike "optimize_during_scrub". I'd favor the boolean flag approach with something like "scrub_defragmentation", and "scrub_recompression". The number of pool options we have and the way we display them in the command help has become pretty unwieldy though. This doesn't do anything to help it. Having said that, I agree that this is best configurable at the pool level, especially if we want to allow lazy compression of objects on scrub.

@markhpc
Copy link
Member

markhpc commented May 23, 2024

I did a very minimal initial test where I created 4 128GB RBD images, pre-filled them with 4MB writes, then did 5 minutes of 4K random writes to them in a pool with 512 PGs. This is on NVMe, so it's pretty quick. I then looked at a certain PG 2.1ff:

PG_STAT OBJECTS BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS DISK_LOG
2.1ff 273 1145044992 0 0 618 3000 618

and ran deep-scrub on it 3 times:

Start Time End Time Duration
2024-05-23T01:48:36.667+0000 2024-05-23T01:48:43.199+0000 6.532
2024-05-23T01:55:48.326+0000 2024-05-23T01:55:54.804+0000 6.478
2024-05-23T01:59:05.596+0000 2024-05-23T01:59:12.075+0000 6.479

And the fragmentation score and reformat counters at this point is:

"fragmentation_rating": 0.01635269932531749
"reformat_compress_attempted": 0,
"reformat_compress_omitted": 0,
"reformat_defragment_attempted": 0,
"reformat_defragment_omitted": 0,
"reformat_issued": 0

Next set the pool data_reformatting to defragment and rerun deep-scrub on 2.1ff:

Start Time End Time Duration
2024-05-23T04:01:18.327+0000 2024-05-23T04:01:24.845+0000 6.518
"fragmentation_rating": 0.016560792018378661
"reformat_compress_attempted": 0,
"reformat_compress_omitted": 0,
"reformat_defragment_attempted": 2184,
"reformat_defragment_omitted": 0,
"reformat_issued": 2184

Next set the pool data_reformatting to both and rerun deep-scrub on 2.1ff:

Start Time End Time Duration
2024-05-23T04:15:42.800+0000 2024-05-23T04:15:49.279+0000 6.479
"fragmentation_rating": 0.016560788158529967
"reformat_compress_attempted": 0,
"reformat_compress_omitted": 0,
"reformat_defragment_attempted": 2184,
"reformat_defragment_omitted": 0,
"reformat_issued": 2184

Next set the pool compression_mode to passive and rerun deep-scrub on 2.1ff. Results are similar to above.
Next set the pool compression_mode to aggressive and rerun deep-scrub on 2.1ff. (forgot to check results)
Next set the pool compression_mode to force and rerun deep-scrub on 2.1ff.

Start Time End Time Duration
2024-05-23T04:26:35.157+0000 2024-05-23T04:26:41.637+0000 6.48
"fragmentation_rating": 0.016575220146748098
"reformat_compress_attempted": 4368,
"reformat_compress_omitted": 4074,
"reformat_defragment_attempted": 2184,
"reformat_defragment_omitted": 0,
"reformat_issued": 2478

Running another deep scrub on that PG results in an attempt to recompress the objects again:

"reformat_compress_attempted": 6552,
"reformat_compress_omitted": 6111,
"reformat_defragment_attempted": 2184,
"reformat_defragment_omitted": 0,
"reformat_issued": 2625

A couple of thoughts:

  1. It would be nice if we had additional compression modes like "passive_lazy, aggressive_lazy, and force_lazy". The idea would be to allow objects to be written to the pool in an uncompressed state and only compress them on deep-scrub. Not sure how that would work with the global OSD level settings if we don't have the compress on scrub stuff set for the pool though.
  2. It looks like most of the re-compression attempts here were omitted. After re-reading the code, it looks like this should pretty much only happen if need > allocated or need == allocated.
  3. It's not clear to me why the fragmentation_rating goes up slightly after the first defrag pass. I probably should have been looking at onode_extents though. I'll try to do some better testing tomorrow.

Morning update: I tried deep-scrubbing another PG this morning with data_reformatting set to both and watched onode_extents:

pg onode_extents before onode_extents after
2.1fe 1259816 1276370

It looks like the number of extents actually went up?

@markhpc
Copy link
Member

markhpc commented May 23, 2024

Here's a debug_bluestore = 10 log while deep-scrubbing one of the other PGs (2.1fb).
osd.2.log.gz

@gardran
Copy link
Contributor Author

gardran commented May 24, 2024

extents actually went up?

Are you referring to onode_extens perf counter here? IMO that's a wrong metric as it's an amount of logical (not physical) extents in the cache(!)
Proper way would be to check the report from "ceph tell osd.0 bluestore allocator dump block" and see how fragmented it is.
Output of "bluestore allocator fragmentation histogram block" is another mean for that analysis.

@gardran
Copy link
Contributor Author

gardran commented May 24, 2024

Here's a debug_bluestore = 10 log while deep-scrubbing one of the other PGs (2.1fb). osd.2.log.gz

Looks good to me, the majority of reads has log output like:
75+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read layout reformatting = both span stat {s=524288 ch=0 a=524288 sc=0 ac=0 sh=0 e=8 f=28}
2024-05-23T12:58:06.176+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read recompress info, 0x80000 vs. 0x80000
2024-05-23T12:58:06.176+0000 7fed59ea4700 10 HybridAllocator allocate 0x80000/1000,80000,0
2024-05-23T12:58:06.176+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read preallocated: 0x80000 new frags:1 defragment: 1
Meaning there had been 28 physical fragments originally and a single one to be written out.

@gardran
Copy link
Contributor Author

gardran commented May 24, 2024

ted": 4368,
"reformat_compress_omitte

Here's a debug_bluestore = 10 log while deep-scrubbing one of the other PGs (2.1fb). osd.2.log.gz

I could see no initial compression, e.g.
2024-05-23T12:58:04.863+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read layout reformatting = both span stat {s=524288 ch=0 a=524288 sc=0 ac=0 sh=0 e=8 f=11}

sc (aka storage compressed) = 0
ac (aka allocated compressed) = 0
mean that no data has been compressed before scrubbing. And no recompression is applied:

2024-05-23T12:58:04.864+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read recompress info, 0x80000 vs. 0x80000

the above line shows no saving due to recompression - recompressed data needs the same 512K of space.

So I presume your input data flow isn't compressible. Or something wrong with compression settings...

@markhpc
Copy link
Member

markhpc commented May 24, 2024

extents actually went up?

Are you referring to onode_extens perf counter here? IMO that's a wrong metric as it's an amount of logical (not physical) extents in the cache(!) Proper way would be to check the report from "ceph tell osd.0 bluestore allocator dump block" and see how fragmented it is. Output of "bluestore allocator fragmentation histogram block" is another mean for that analysis.

Yikes! That's confusing. I don't know if I knew that at one point, but clearly I didn't remember it. :) We should rename it. I'll try to take a look with the histogram.

@markhpc
Copy link
Member

markhpc commented May 24, 2024

I could see no initial compression, e.g. 2024-05-23T12:58:04.863+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read layout reformatting = both span stat {s=524288 ch=0 a=524288 sc=0 ac=0 sh=0 e=8 f=11}

sc (aka storage compressed) = 0 ac (aka allocated compressed) = 0 mean that no data has been compressed before scrubbing. And no recompression is applied:

2024-05-23T12:58:04.864+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read recompress info, 0x80000 vs. 0x80000

the above line shows no saving due to recompression - recompressed data needs the same 512K of space.

So I presume your input data flow isn't compressible. Or something wrong with compression settings...

There was no initial compression. I wrote the data out uncompressed with data_reformatting off, then set it to "defragment", then "both". Next I set the compression mode to passive, then aggressive, then force. Each time, I ran a deep scrub on different PGs to see what would happen.

@dvanders
Copy link
Contributor

I could see no initial compression, e.g. 2024-05-23T12:58:04.863+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read layout reformatting = both span stat {s=524288 ch=0 a=524288 sc=0 ac=0 sh=0 e=8 f=11}
sc (aka storage compressed) = 0 ac (aka allocated compressed) = 0 mean that no data has been compressed before scrubbing. And no recompression is applied:
2024-05-23T12:58:04.864+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read recompress info, 0x80000 vs. 0x80000
the above line shows no saving due to recompression - recompressed data needs the same 512K of space.
So I presume your input data flow isn't compressible. Or something wrong with compression settings...

There was no initial compression. I wrote the data out uncompressed with data_reformatting off, then set it to "defragment", then "both". Next I set the compression mode to passive, then aggressive, then force. Each time, I ran a deep scrub on different PGs to see what would happen.

This is a great point. I think the intuitive user expectation is that this feature could be used to compress old objects written uncompressed before this feature was available.

Also earlier we spoke about a feature like:

  • pool compression is disabled, so initial writes are written uncompressed
  • scrub recompress works to compress the objects in the background

@gardran gardran force-pushed the wip-gdran-defregment branch from 462da8c to 56f74a1 Compare May 27, 2024 16:48
length = o->onode.size;

r = _do_read(c, o, offset, length, bl, op_flags);
r = _do_read(c, o, offset, length, bl, op_flags, 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is very strongly tied to existing BlueStore execution paths.
This is a problem, because it gives very little room to actually work on optimization algorithms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true but we really want to save some cycles by not doing independent standalone optimization.

IOContext* ioc)
{
IOContext* ioc,
span_stat_t* res_span_stat)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_prepare_read_ioc is modified to provide scanning for optimization.
We should have a separate function that will scan object and decide action.
For example it might happen that in object we have 3 chunks for total of 4MB,
and in attempt to defragment it we allocated 3 chunks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_prepare_read_ioc is modified to provide scanning for optimization. We should have a separate function that will scan object and decide action.

That's again an attempt to design things for an independent optimization process. Which still sounds questionable to me.

For example it might happen that in object we have 3 chunks for total of 4MB, and in attempt to defragment it we allocated 3 chunks.

Not sure I understand this example. When speaking of defragmentation only - it would happen if amount of new fragments is less than original one for a specific data span (512K when triggered from deep scrub).
If recompression is enabled as well -reformatting unconditionally occurs when new required space for a data span is less than the original one. Fragmentation is taken into consideration when recompression is not beneficial only. This recompression+defragmentation prioritization might look a bit inflexible - one might want prefer defragmentation over recompression but for now it looks like an overkill to me.

Copy link
Member

@markhpc markhpc Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The big reason I think we want to piggyback on deep-scrub is because we're already reading the content of the object to do the scrub, so it's the perfect time to make decisions about de-fragmentation and compression as the most expensive part of this can easily be reading the fragmented extents. I don't think we should scan objects as a separate loop unless we have a very good reason. It's much better in my mind to opportunistically defrag/compress when we've already done the expensive fragmented read.

Copy link
Contributor

@aclamk aclamk Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markhpc
I (now) agree that performing optimization to object as an extra step during deep-scrub is good.
"I don't think we should scan objects as a separate loop unless we have a very good reason."
The code here is scanning the object - it's just piggybacking _read_cache to do it.
There is no change to behaviour of _read_cache.
The filling of span_stat_t could be done in separate loop at a cost of a loop.
Similarly, we now load shared blobs in _prepare_read_ioc - useless for reading process.
That would made optimization sense, if significant part of processing was reused - but its only the loop.

@gardran My opinion is that whenever we make a new feature we should make it as detached from existing code as possible to minimize accidental bug injection. Within reason of course...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gardran My opinion is that whenever we make a new feature we should make it as detached from existing code as possible to minimize accidental bug injection. Within reason of course...

Generally I agree with this point when we're talking about a severe intervention into the execution path: changing run flow, having new algorithms, etc...
Instead I just perform stats collection (by using simple arithmetic ops) here. So apart from a trivial null/invalid pointer dereferencing this shouldn't introduce any issues to existing functionality. Moreover if refactoring is disabled this function collects stats into a dummy local var which is an absolutely trivial operation.
So IMO that sort of changes (in this specific function) is quite safe and doesn't worth the separation - basic code review/testing should reveal any potential issues if any.

@github-actions
Copy link

github-actions bot commented Jul 10, 2025

Config Diff Tool Output

! changed: bluestore_compression_mode: old:  (global.yaml.in)
! changed: bluestore_compression_mode: new: ['bluestore_prefer_deferred_size'] (global.yaml.in)
! changed: bluestore_compression_mode: old: 'none' means never use compression.  'passive' means use compression when clients hint that data is compressible.  'aggressive' means use compression unless clients hint that data is not compressible.  This option is used when the per-pool property for the compression mode is not present. (global.yaml.in)
! changed: bluestore_compression_mode: new: 'none' means never use compression.  'passive' means use compression when clients hint that data is compressible.  'aggressive' means use compression unless clients hint that data is not compressible.  '*_lazy' counterparts instruct to apply compression as per above during deep-scrubbing only. Which has to be additionally enabled at per-pool level using 'deep_scrub_recompression' pool setting. This option is used when the per-pool property for the compression mode is not present. (global.yaml.in)
! changed: bluestore_compression_mode: old: ['none', 'passive', 'aggressive', 'force'] (global.yaml.in)
! changed: bluestore_compression_mode: new: ['none', 'passive', 'aggressive', 'force', 'passive_lazy', 'aggressive_lazy', 'force_lazy'] (global.yaml.in)
! changed: bluestore_compression_mode: old: The default policy for using compression if the per-pool property ``compression_mode`` is not set. ``none`` means never use compression. ``passive`` means use compression when :c:func:`clients hint <rados_set_alloc_hint>` that data is compressible.  ``aggressive`` means use compression unless clients hint that data is not compressible.  ``force`` means use compression under all circumstances even if the clients hint that the data is not compressible. (global.yaml.in)
! changed: bluestore_compression_mode: new: The default policy for using compression if the per-pool property ``compression_mode`` is not set. ``none`` means never use compression. ``passive`` means use compression when :c:func:`clients hint <rados_set_alloc_hint>` that data is compressible.  ``aggressive`` means use compression unless clients hint that data is not compressible.  ``force`` means use compression under all circumstances even if the clients hint that the data is not compressible. ''*_lazy'' modes are similar to their counterpart ones but compression to be performed during deep scrubbing only. Which has to be additionally enabled on per-pool basis using ''deep_scrub_recompress'' pool setting. (global.yaml.in)

The above configuration changes are found in the PR. Please update the relevant release documentation if necessary.
Ignore this comment if docs are already updated. To make the "Check ceph config changes" CI check pass, please comment /config check ok and re-run the test.

@github-actions github-actions bot removed the stale label Jul 10, 2025
@github-actions
Copy link

github-actions bot commented Sep 8, 2025

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Sep 8, 2025
@markhpc
Copy link
Member

markhpc commented Sep 16, 2025

retest this please

@github-actions github-actions bot removed the stale label Sep 16, 2025
@dvanders
Copy link
Contributor

Any news on this PR?
IMHO it's more important than ever now that we have fast EC, which I expect will increase fragmentation.

@ifed01
Copy link
Contributor

ifed01 commented Dec 9, 2025

Might need rebasing on top of this once merged: #66552

@github-actions
Copy link

github-actions bot commented Feb 7, 2026

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@github-actions github-actions bot removed the stale label Feb 9, 2026
Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>
Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>
Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>
Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>
Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>
Plus a few more PR comment resolutions from Adam.

Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>
Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>
Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>
This is required to avoid buffer caching which would prevent content
reformatting.

Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>
This could occur when store_test tunes
'bluestore_allocator_lookup_policy' config parameter and restores
default value on shutdown.

Signed-off-by: Garry Drankovich <garry.dranckovich@clyso.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants