os/bluestore: Recompression, part 3. Segmented onode.#57448
os/bluestore: Recompression, part 3. Segmented onode.#57448
Conversation
db369ad to
293e5cb
Compare
src/os/bluestore/BlueStore.cc
Outdated
| return -ENOENT; | ||
| std::unique_lock l{c->lock}; | ||
| c->pool_opts = opts; | ||
| c->segment_size = c->pool_opts.value_or( |
There was a problem hiding this comment.
Do we really need this to be a per-pool option? The latter implies this is user-configurable, What would inspire users to tune that? Any metrics, recommendations, etc?
There was a problem hiding this comment.
Segmentation brings hard data separation. It limits the scan range for compression / recompression and for onode sharding.
Having larger segments enables larger compressed blob and reduced wastage.
Having smaller segments guarantees finer grain for sharding that does not create spanning blobs.
Do we really need this to be a per-pool option? Maybe not, but we already have options for min/max compressed blob size, minimal compression gain and compression mode. It just follows the pattern.
There was a problem hiding this comment.
Maybe we should leave this out, until the customers request the pool option.
There was a problem hiding this comment.
In some relevant sense one could argue that there is a relation between
- onode_segment_size
and - max_compressed_blob_size
But how to explain why non-compressed onodes are using settings for compressed case?
The logic of setting perfect onode_segment_size is very convoluted. It impacts size of onode shards. And we know we want our shards not too big, but at the same time we do not want to have too many of them.
Could we devise some formula that takes into account:
- onode checksum type
- onode checksum size
- typical encoding sizes of allocated extents
But it is difficult to predict compression ratio (and this determines amount of checksum data)
293e5cb to
12db6d8
Compare
ed7b4c5 to
f643ee6
Compare
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
f643ee6 to
364bad7
Compare
4ef5b1d to
ff87a86
Compare
| @@ -5103,7 +5103,7 @@ options: | |||
| type: bool | |||
| level: advanced | |||
src/os/bluestore/BlueStore.cc
Outdated
| // bound encode | ||
| size_t bound = 0; | ||
| denc(o->onode, bound); | ||
| uint64_t flag = onode_segmentation ? 0 : bluestore_onode_t::FLAG_DEBUG_FORCE_V2; |
There was a problem hiding this comment.
Wouldn't it be better to use bitmap flags for everything: 'onode_segmentation' var, Onode::decode(), etc?
There was a problem hiding this comment.
I'm not really getting your suggestion.
src/os/bluestore/BlueStore.cc
Outdated
| // we want to decide whether to continue streaming to the current shard | ||
| // or move to the next one | ||
| if ((estimate >= target /*we have enough already*/) || | ||
| (estimate + encoded_segment_estimate >= (target * 3 / 2)) |
There was a problem hiding this comment.
Because slop is static config. Here we have segment that might be different for different collections / objects.
The other question might be: should slop and segment work together ?
Maybe, but that's like full reimplementation of reshard().
Which technically, I am all for it, but not this PR.
There was a problem hiding this comment.
Not sure I'm getting your point..
For 'segment_size == 0' case we use 'target + slop' expression:
if ((estimate > 0)
&& (estimate + extent_avg > target + (would_span ? slop : 0))) {
make_shard_here = true;
}
where
unsigned slop = target * cct->_conf->bluestore_extent_map_shard_target_size_slop;
But for 'segment_size != 0' case you're using 'target * 3 /2' as a threshold:
if ((estimate >= target /we have enough already/) ||
(estimate + encoded_segment_estimate >= (target * 3 / 2))
/we will be too large if we wait for next segment/) {
make_shard_here = true;
}
I.e. given default bluestore_extent_map_shard_target_size_slop value = 0.2
we have 1.2 target vs. 1.5 target in the expressions. Not a big difference. So why using hard-coded 3/2 instead of configurable bluestore_extent_map_shard_target_size_slop?
There was a problem hiding this comment.
Slop means : if closing shard here would require creation of spanning blob, go and take one more extent; unless you already overbooked "slop" more, then accept creation of spanning blob.
There is no such logic with segmentation. It is not allowed to take a bit of next segment; we know we would have to take entire next segment, and we know that at that point we could close the shard.
I see at different cases that should not be covered by same setting.
However, what I think can be done, the condition can be improved.
How about:
estimate >= target - encoded_segment_estimate/2 &&
estimate <= target + encoded_segment_estimate/2
That way we could approach best fit for bluestore_extent_map_shard_target_size from both sides.
There was a problem hiding this comment.
OK. Sounds reasonable to me.
ac9de0a to
aede0d4
Compare
| (*pm)["bluestore_min_alloc_size"] = stringify(min_alloc_size); | ||
| (*pm)["bluestore_allocation_from_file"] = stringify(fm && fm->is_null_manager()); | ||
| (*pm)["bluestore_write_mode"] = use_write_v2 ? "new" : "classic"; | ||
| (*pm)["bluestore_onode_segmentation"] = segment_size == 0 ? "inactive" : "active"; |
There was a problem hiding this comment.
why not reporting segment size as-is? It's more helpful IMO.
There was a problem hiding this comment.
I thought about printing size here, but segment_size might differ for different pools.
I preferred to be imprecise than wrong.
There was a problem hiding this comment.
Well, agree with this statement if global segment_size would control segmentation for any onode, including ones with persistent non-zero segment size. ;)
aede0d4 to
f8783c9
Compare
|
jenkins test make check |
257b63f to
ae9bf6f
Compare
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
Split object data into segments of conf.bluestore_onode_segment_size bytes. This means that no blob will be in two segments at the same time. Modified reshard function to prefer segment separation lines. As a result no spanning blobs are created. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Add perf counter "onode_spanning_blobs". Measures how many spanning blobs are currently present in loaded onodes. The intended use is to verify if conf parameter "bluestore_onode_segment_size" does eliminate spanning blobs. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
We had 3 constructors that do basically the same. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Add segment_size field for bluestore_onode_t. It makes more reliable when onode keeps its segment size. Upgraded v2 to v3. Object creation on v3 (new BlueStore): Object gets its segment_size field initialized from bluestore_onode_segment_size. However, if pool opt compression_max_blob_size is set and it is larger, it will be used. Upgrade: Object upgraded has its segment_size = 0 by default, so new BlueStore does not use segmentation. Spanning blobs can be created. Downgrade: Object downgraded (by being written by older BlueStore) is losing its segment_size setting. Older BlueStore will have no problem understanding it. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
This debug grade conf selects bluestore_onode_t v2 or v3. In v2 mode it can read v3 but drops onode_segmentation and pretends it does not exist. It will act as a version that does not handle v3. Also it will write onodes in v2 back to DB, clearing segmentation. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Added 4 variants of bluestore onode segment size as random selections: 0 (disabled) 256K 512K 1024K Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
This is the variant of DENC_START that manually controls encoding version. As being incompatible with static compile-time checking of StructVChecker, new (legacy) mode had to be introduced. Also, Removed struct_v and struct_compat from denc_finish. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
blustore_onode_t encoding version is directly controlled by options. We cannot use static checker. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
1. Inserted fallback code to avoid div 0 2. Fixed inconsistency in segment_size value between - onode created - onode upgraded When settings tell to disable segmentation. 3. Improved decision on appending one more segment to current shard Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
09200f6 to
7472911
Compare
|
jenkins test make check |
|
Approved by https://tracker.ceph.com/issues/70787 |
|
jenkins test api |
|
jenkins test make check arm64 |
|
jenkins test api |
This is adaptation of standalone #55968 for recompression PR.
The original PR will be closed in the future.
#54075 Nice debugs.
#54504 New write path.
#57448 Segmented onode.
#56975 Main
#57450 Test
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e