Skip to content

os/bluestore: Recompression, part 3. Segmented onode.#57448

Merged
idryomov merged 9 commits intoceph:mainfrom
aclamk:wip-aclamk-bs-recompression-segmented-data
Apr 9, 2025
Merged

os/bluestore: Recompression, part 3. Segmented onode.#57448
idryomov merged 9 commits intoceph:mainfrom
aclamk:wip-aclamk-bs-recompression-segmented-data

Conversation

@aclamk
Copy link
Contributor

@aclamk aclamk commented May 13, 2024

This is adaptation of standalone #55968 for recompression PR.
The original PR will be closed in the future.

#54075 Nice debugs.
#54504 New write path.
#57448 Segmented onode.
#56975 Main
#57450 Test

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@aclamk aclamk requested a review from a team as a code owner May 13, 2024 12:21
@aclamk aclamk changed the title bluestore: Recompression part 3, segmented onode os/bluestore: Recompression part 3, segmented onode May 13, 2024
@aclamk aclamk changed the title os/bluestore: Recompression part 3, segmented onode os/bluestore: Recompression, part 3, Segmented onode. May 13, 2024
@aclamk aclamk changed the title os/bluestore: Recompression, part 3, Segmented onode. os/bluestore: Recompression, part 3. Segmented onode. May 13, 2024
@aclamk aclamk force-pushed the wip-aclamk-bs-recompression-segmented-data branch 3 times, most recently from db369ad to 293e5cb Compare May 27, 2024 07:03
return -ENOENT;
std::unique_lock l{c->lock};
c->pool_opts = opts;
c->segment_size = c->pool_opts.value_or(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this to be a per-pool option? The latter implies this is user-configurable, What would inspire users to tune that? Any metrics, recommendations, etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Segmentation brings hard data separation. It limits the scan range for compression / recompression and for onode sharding.
Having larger segments enables larger compressed blob and reduced wastage.
Having smaller segments guarantees finer grain for sharding that does not create spanning blobs.

Do we really need this to be a per-pool option? Maybe not, but we already have options for min/max compressed blob size, minimal compression gain and compression mode. It just follows the pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should leave this out, until the customers request the pool option.

Copy link
Contributor Author

@aclamk aclamk Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some relevant sense one could argue that there is a relation between

  • onode_segment_size
    and
  • max_compressed_blob_size

But how to explain why non-compressed onodes are using settings for compressed case?

The logic of setting perfect onode_segment_size is very convoluted. It impacts size of onode shards. And we know we want our shards not too big, but at the same time we do not want to have too many of them.
Could we devise some formula that takes into account:

  • onode checksum type
  • onode checksum size
  • typical encoding sizes of allocated extents

But it is difficult to predict compression ratio (and this determines amount of checksum data)

@aclamk aclamk force-pushed the wip-aclamk-bs-recompression-segmented-data branch from 293e5cb to 12db6d8 Compare June 21, 2024 15:05
@aclamk aclamk force-pushed the wip-aclamk-bs-recompression-segmented-data branch 2 times, most recently from ed7b4c5 to f643ee6 Compare July 1, 2024 10:38
@github-actions
Copy link

github-actions bot commented Jul 9, 2024

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@aclamk aclamk force-pushed the wip-aclamk-bs-recompression-segmented-data branch from f643ee6 to 364bad7 Compare July 12, 2024 09:52
@aclamk aclamk force-pushed the wip-aclamk-bs-recompression-segmented-data branch 2 times, most recently from 4ef5b1d to ff87a86 Compare July 14, 2024 16:35
@@ -5103,7 +5103,7 @@ options:
type: bool
level: advanced
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be 'dev' IMO

// bound encode
size_t bound = 0;
denc(o->onode, bound);
uint64_t flag = onode_segmentation ? 0 : bluestore_onode_t::FLAG_DEBUG_FORCE_V2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to use bitmap flags for everything: 'onode_segmentation' var, Onode::decode(), etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really getting your suggestion.

// we want to decide whether to continue streaming to the current shard
// or move to the next one
if ((estimate >= target /*we have enough already*/) ||
(estimate + encoded_segment_estimate >= (target * 3 / 2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not using slop here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because slop is static config. Here we have segment that might be different for different collections / objects.
The other question might be: should slop and segment work together ?
Maybe, but that's like full reimplementation of reshard().
Which technically, I am all for it, but not this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I'm getting your point..
For 'segment_size == 0' case we use 'target + slop' expression:
if ((estimate > 0)
&& (estimate + extent_avg > target + (would_span ? slop : 0))) {
make_shard_here = true;
}

where
unsigned slop = target * cct->_conf->bluestore_extent_map_shard_target_size_slop;

But for 'segment_size != 0' case you're using 'target * 3 /2' as a threshold:
if ((estimate >= target /we have enough already/) ||
(estimate + encoded_segment_estimate >= (target * 3 / 2))
/we will be too large if we wait for next segment/) {
make_shard_here = true;
}

I.e. given default bluestore_extent_map_shard_target_size_slop value = 0.2
we have 1.2 target vs. 1.5 target in the expressions. Not a big difference. So why using hard-coded 3/2 instead of configurable bluestore_extent_map_shard_target_size_slop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slop means : if closing shard here would require creation of spanning blob, go and take one more extent; unless you already overbooked "slop" more, then accept creation of spanning blob.

There is no such logic with segmentation. It is not allowed to take a bit of next segment; we know we would have to take entire next segment, and we know that at that point we could close the shard.

I see at different cases that should not be covered by same setting.

However, what I think can be done, the condition can be improved.
How about:
estimate >= target - encoded_segment_estimate/2 &&
estimate <= target + encoded_segment_estimate/2
That way we could approach best fit for bluestore_extent_map_shard_target_size from both sides.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Sounds reasonable to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked condition.

@aclamk aclamk force-pushed the wip-aclamk-bs-recompression-segmented-data branch from ac9de0a to aede0d4 Compare March 26, 2025 08:53
(*pm)["bluestore_min_alloc_size"] = stringify(min_alloc_size);
(*pm)["bluestore_allocation_from_file"] = stringify(fm && fm->is_null_manager());
(*pm)["bluestore_write_mode"] = use_write_v2 ? "new" : "classic";
(*pm)["bluestore_onode_segmentation"] = segment_size == 0 ? "inactive" : "active";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not reporting segment size as-is? It's more helpful IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's about this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about printing size here, but segment_size might differ for different pools.
I preferred to be imprecise than wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, agree with this statement if global segment_size would control segmentation for any onode, including ones with persistent non-zero segment size. ;)

@aclamk aclamk force-pushed the wip-aclamk-bs-recompression-segmented-data branch from aede0d4 to f8783c9 Compare March 26, 2025 13:26
@aclamk
Copy link
Contributor Author

aclamk commented Mar 28, 2025

jenkins test make check

@aclamk aclamk force-pushed the wip-aclamk-bs-recompression-segmented-data branch 2 times, most recently from 257b63f to ae9bf6f Compare March 28, 2025 10:00
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

aclamk added 9 commits March 31, 2025 07:32
Split object data into segments of conf.bluestore_onode_segment_size bytes.
This means that no blob will be in two segments at the same time.
Modified reshard function to prefer segment separation lines.
As a result no spanning blobs are created.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Add perf counter "onode_spanning_blobs".
Measures how many spanning blobs are currently present in loaded
onodes.
The intended use is to verify if conf parameter
"bluestore_onode_segment_size" does eliminate spanning blobs.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
We had 3 constructors that do basically the same.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Add segment_size field for bluestore_onode_t.
It makes more reliable when onode keeps its segment size.
Upgraded v2 to v3.

Object creation on v3 (new BlueStore):
Object gets its segment_size field initialized from bluestore_onode_segment_size.
However, if pool opt compression_max_blob_size is set and it is larger, it will be used.

Upgrade:
Object upgraded has its segment_size = 0 by default, so new BlueStore
does not use segmentation. Spanning blobs can be created.

Downgrade:
Object downgraded (by being written by older BlueStore) is losing its segment_size setting.
Older BlueStore will have no problem understanding it.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
This debug grade conf selects bluestore_onode_t v2 or v3.
In v2 mode it can read v3 but drops onode_segmentation and pretends
it does not exist. It will act as a version that does not handle v3.
Also it will write onodes in v2 back to DB, clearing segmentation.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Added 4 variants of bluestore onode segment size as random selections:
0 (disabled)
256K
512K
1024K

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
This is the variant of DENC_START that manually controls encoding version.
As being incompatible with static compile-time checking of
StructVChecker, new (legacy) mode had to be introduced.

Also,
Removed struct_v and struct_compat from denc_finish.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
blustore_onode_t encoding version is directly controlled by options.
We cannot use static checker.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
1. Inserted fallback code to avoid div 0
2. Fixed inconsistency in segment_size value between
  - onode created
  - onode upgraded
  When settings tell to disable segmentation.
3. Improved decision on appending one more segment to current shard

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
@aclamk aclamk force-pushed the wip-aclamk-bs-recompression-segmented-data branch from 09200f6 to 7472911 Compare March 31, 2025 07:51
@aclamk
Copy link
Contributor Author

aclamk commented Apr 2, 2025

jenkins test make check

@aclamk aclamk added aclamk-testing-ganymede bluestore testing and removed aclamk-testing-phoebe bluestore testing labels Apr 2, 2025
@aclamk
Copy link
Contributor Author

aclamk commented Apr 7, 2025

Approved by https://tracker.ceph.com/issues/70787

@aclamk
Copy link
Contributor Author

aclamk commented Apr 7, 2025

jenkins test api

@aclamk
Copy link
Contributor Author

aclamk commented Apr 7, 2025

jenkins test make check arm64

@aclamk
Copy link
Contributor Author

aclamk commented Apr 7, 2025

jenkins test api

@idryomov idryomov merged commit 43369eb into ceph:main Apr 9, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants