os/bluestore: Recompression, part 2. New write path.#54504
os/bluestore: Recompression, part 2. New write path.#54504
Conversation
9c7fb12 to
fb1615f
Compare
fb1615f to
21b5dfb
Compare
21b5dfb to
736de0c
Compare
src/os/bluestore/BlueStore.cc
Outdated
| _dump_extent_map<LogLevelV>(cct, o.extent_map); | ||
| } | ||
|
|
||
| std::ostream& operator<<(std::ostream& out, const BlueStore::Onode::printer &p) |
There was a problem hiding this comment.
Shouldn't this go to BlueStore_debug.cc module?
There was a problem hiding this comment.
More generally - doesn't it make sense to move all Bluestore's operator<< to BlueStore_debug.cc module?
There was a problem hiding this comment.
Moved.
I think we may move all operator<< to there, but I prefer another PR for that.
src/os/bluestore/BlueStore.h
Outdated
| } | ||
| struct printer : public BlueStore::printer { | ||
| const Extent& ext; | ||
| uint8_t mode; |
There was a problem hiding this comment.
shouldn't this be uint16_t to match the other printers?
There was a problem hiding this comment.
It only used first few bits, but I synchronized it to uint16_t.
| blob->shared_blob->get_cache()->rm_extent(); | ||
| } | ||
| } | ||
| struct printer : public BlueStore::printer { |
There was a problem hiding this comment.
We can declare BlueStore::printer as a template class implementing all this type specific logic? Hence wouldn't need to copy-paste for each "rinted" struct/class.
There was a problem hiding this comment.
We could, but I tied verbosity of printing to dout levels. It works very nicely I don't want to lose it.
|
Ping. |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
736de0c to
b0235a6
Compare
|
Now #54075 is a base PR for this one. |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
| /// shared_changed - set of shared blobs that are modified, | ||
| /// including the case of shared blob being empty | ||
| /// statfs_delta - delta of stats | ||
| BlueStore::extent_map_t::iterator BlueStore::_punch_hole_2( |
There was a problem hiding this comment.
I really don't like the idea to have methods from non-Writer classes being defined in Writer.cc
There was a problem hiding this comment.
Makes sense.
New file then? It will never be inlined, so there is no penalty.
There was a problem hiding this comment.
since this is a BlueStore's method likely this has to be in BlueStore.cc
There was a problem hiding this comment.
We might want to split BlueStore.cc into multiple modules later though.
| } | ||
|
|
||
| // should _maybe_expand_blob go to Blob ? | ||
| inline void BlueStore::Writer::_maybe_expand_blob( |
There was a problem hiding this comment.
IMO better move close to other Writer's methods
There was a problem hiding this comment.
Or indeed move to Blob class and locate accordingly
src/os/bluestore/Writer.cc
Outdated
| for (auto loc : disk_allocs) { | ||
| ceph_assert(initial_offset <= loc.length); | ||
| bufferlist data_chunk; | ||
| uint32_t data_to_write = std::min(data.length(), loc.length - initial_offset); |
There was a problem hiding this comment.
IMO better to check data_to_write for zero explicitly before submitting an I/O. Just to be sure there is no any unexpected behavior due to that.
src/os/bluestore/Writer.cc
Outdated
| bluestore_deferred_op_t *op = bstore->_get_deferred_op(txc, data.length()); | ||
| op->op = bluestore_deferred_op_t::OP_WRITE; | ||
| for (auto& e : disk_allocs) { | ||
| op->extents.emplace_back(e); |
There was a problem hiding this comment.
We can resize op->extents explicitly before the loop.
And curious if we need disk_allocs after the call and whether we can start using swap instead.
There was a problem hiding this comment.
const on disk_allocs is not useful.
Dropping. swap() will be used.
There was a problem hiding this comment.
Update. Cant do. In some cases disk_allocs are used.
There was a problem hiding this comment.
but I can op->extents = disk_allocs; instead of loop
src/os/bluestore/bluestore_types.cc
Outdated
| if (p->length - offset) { | ||
| released_disk->emplace_back(p->offset + offset, p->length - offset); | ||
| } | ||
| released_length += p->length - offset; |
There was a problem hiding this comment.
This and two next lines could be moved under the above 'if' statement.
There was a problem hiding this comment.
On the other hand - (p->length - offset) is always greater than zero so no need in the above 'if' at all...
There was a problem hiding this comment.
Lets optimize, even if a bit.
On the other hand, I wonder if optimizer will not squeeze it anyways.
src/os/bluestore/bluestore_types.cc
Outdated
| auto anchor_it = extents.begin() + (anchor - begin); | ||
| if (insert_element_cnt != 0) { | ||
| if (insert_element_cnt > 0) { | ||
| anchor_it = extents.insert(anchor_it, insert_element_cnt, bluestore_pextent_t(0, 0)); |
There was a problem hiding this comment.
nit: IMO better insert bluestore_pextent_t(EMPTY, 0)
src/os/bluestore/bluestore_types.cc
Outdated
| // and insert hold in this place | ||
| int32_t insert_element_cnt = hold_size - (p - anchor); | ||
| auto anchor_it = extents.begin() + (anchor - begin); | ||
| if (insert_element_cnt != 0) { |
There was a problem hiding this comment.
IMO more readable/elegant would be:
if (insert_element_cnt >0) {
...
} else if ((insert_element_cnt < 0) {
...
}
There was a problem hiding this comment.
changed it; now two independent ifs.
| anchor_it = extents.erase(anchor_it, anchor_it + (-insert_element_cnt)); | ||
| } | ||
| } | ||
| for (uint32_t i = 0; i < hold_size; i++) { |
There was a problem hiding this comment.
hint: one could use std::copy if define hold as std::array.
There was a problem hiding this comment.
I loot at this function and I think it is too clever.
What do you think about refactor to make it 2 phased:
- find it_begin, offset_begin; it_end, offset_end
- move to released, cut out KNOWN region
do you feel it might produce clearer code?
b0235a6 to
9566e0f
Compare
src/os/bluestore/Writer.cc
Outdated
| for (auto q: bd) { | ||
| *_dout << " " << lof << "~" << q.disk_data.length(); | ||
| lof += q.disk_data.length(); | ||
| } |
There was a problem hiding this comment.
skip this if not on level 25?
There was a problem hiding this comment.
We do code to execute only on level 25+:
dout(25);
... conditional code in new scope...
*_dout << dendl;
| dout(25) << "after punch_hole_2: " << std::endl << onode->print(pp_mode) << dendl; | ||
|
|
||
| uint32_t au_size = bstore->min_alloc_size; | ||
| if (au_size != bstore->block_size) { |
There was a problem hiding this comment.
why can't au_size != block-size for this to work?
There was a problem hiding this comment.
That would be an empty run.
The code is intended to use portions of blobs that are allocated (AU=64K) but have never been written.
| if (left_do_pad) { | ||
| tmp.append_zero(location - left_location); | ||
| } else { | ||
| tmp = _read_self(left_location, location - left_location); |
There was a problem hiding this comment.
we may be able to make this read async.
There was a problem hiding this comment.
I am considering it, but first lets get metrics how often it actually happens.
Maybe telemetry can help in this case.
src/os/bluestore/Writer.cc
Outdated
| // whether to use deferred or direct writes. | ||
| // The assumption behind it is that having parts of write executed as | ||
| // deferred and other parts as direct is suboptimal in any case. | ||
| void BlueStore::Writer::_deferred_decision(uint32_t need_size) |
There was a problem hiding this comment.
add something like "allocates" to function name so that we know this is where disk allocations really happen?
There was a problem hiding this comment.
_defer_or_allocate ?
src/os/bluestore/Writer.cc
Outdated
| _try_put_data_on_allocated(location, data_end, ref_end, bd, after_punch_it); | ||
| } | ||
| if (location != data_end) { | ||
| uint32_t need_size = p2roundup(data_end, au_size) - p2align(location, au_size); |
There was a problem hiding this comment.
dumb question: what if need_size is too big it conflicts with other writes causing a enospc?
There was a problem hiding this comment.
This particular code got changed under pressure of compressed blobs.
Now need_size is calculated differently.
New "write_lat" tracks wallclock time spent in execution of BlueStore::_write(). Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Added _do_write_v2 function to BlueStore. Function is not integrated yet. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Add new conf variable. bluestore_write_v2 = true : use new _do_write_v2() function bluestore_write_v2 = false : use legacy _do_write() function This variable is read only at start time. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
We need to scan all extents that are under proposed read / padding. It was broken and we could clear out something defined. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
The search must go over entire range where we can have aligned mutable blobs. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
A review proposal. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Result of ongoing review. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
It was not checked if necessary location is still within blob range. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Plus making const& variables in `for`. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
BufferSpace is not with Onode, not Blob. Modify code to adapt to this change. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Usually the data we put to disk is AU aligned. In weird cases like AU=16K we put less data than we allocated. _crop_allocs_to_io trims allocated extents into disk block extents to reflect real IO. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Added missing files to alienstore CMake list. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Clang fails at _construct_at(). Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
More diligent calcualtion algorithm of need_size. Takes into account front and back alignment. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
1) Algorithm assumed that blob->blob_start() is aligned to csum size. It is true for blobs created by write_v2, but write_v1 can generate blob like: begin = 0x9000, size = 0x6000, csum = 0x2000. 2) Blobs with unused were selected even if those need to be expanded. This is illegal since we cannot expand unused. Fixed blob selection algorithm. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
For write_v2 create fallback to write_v1 if compression is selected. This is temporary until compression dedicated to benefit from v2 is merged. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Added conf.bluestore_write_v2_random. This is useful only for testing. If set, it overrides value of bluestore_write_v2 with a random true/false selection. It is useful for v1 / v2 compatibility testing. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Add framework for various random options for debug bluestore. Use framework to select: - write_v1 - write_v2 - write_v1 / write_v2 selected at random Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
…e_extents The #1 and #2 elements could form a continuous sequence but still not joined: Expected equality of these values: result Which is: { 0x7b138000~48000, 0x883b0000~48000, 0xf0c10000~10000, 0x727b8000~38000 } mid Which is: { 0x7b138000~30000, 0x7b168000~18000, 0x883b0000~48000, 0xf0c10000~10000, 0x727b8000~38000 } Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
1) moved stats and blobs update to Writer::do_write 2) preallocate space in Writer:_split_data 3) fixed Writer::_write_expand_l that could check one extent too much Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
4) remove Writer::shared_changed and use txc::shared_blobs directly Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
746e82c to
59857a7
Compare
|
jenkins test make check |
1 similar comment
|
jenkins test make check |
|
jenkins test make check |
Refactor write path.
The goal is to make write path like:
The difference is that one is no longer limited by the shape and locations of blobs that were present before.
If deferred write will be in use, reused allocations can be used in any blob and any place,
and not only by the blobs that had it before. As a result, a compressed blobs of sizes different then before can replace previous blobs.
#54075 Nice debugs.
#54504 New write path.
#57448 Segmented onode.
#56975 Main
#57450 Test
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e