Skip to content

os/bluestore: Recompression, part 2. New write path.#54504

Merged
aclamk merged 43 commits intoceph:mainfrom
aclamk:wip-aclamk-bs-refactor-write-path
Aug 13, 2024
Merged

os/bluestore: Recompression, part 2. New write path.#54504
aclamk merged 43 commits intoceph:mainfrom
aclamk:wip-aclamk-bs-refactor-write-path

Conversation

@aclamk
Copy link
Contributor

@aclamk aclamk commented Nov 14, 2023

Refactor write path.
The goal is to make write path like:

  1. punch hole
  2. stream blobs into empty place
    The difference is that one is no longer limited by the shape and locations of blobs that were present before.
    If deferred write will be in use, reused allocations can be used in any blob and any place,
    and not only by the blobs that had it before. As a result, a compressed blobs of sizes different then before can replace previous blobs.

#54075 Nice debugs.
#54504 New write path.
#57448 Segmented onode.
#56975 Main
#57450 Test

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@aclamk aclamk requested a review from a team as a code owner November 14, 2023 17:21
@aclamk aclamk marked this pull request as draft November 14, 2023 17:21
@aclamk aclamk requested a review from pereman2 November 14, 2023 17:31
@aclamk aclamk force-pushed the wip-aclamk-bs-refactor-write-path branch from 9c7fb12 to fb1615f Compare January 5, 2024 09:12
@aclamk aclamk force-pushed the wip-aclamk-bs-refactor-write-path branch from fb1615f to 21b5dfb Compare January 16, 2024 11:21
@aclamk aclamk force-pushed the wip-aclamk-bs-refactor-write-path branch from 21b5dfb to 736de0c Compare January 16, 2024 11:22
@aclamk aclamk marked this pull request as ready for review January 23, 2024 14:36
_dump_extent_map<LogLevelV>(cct, o.extent_map);
}

std::ostream& operator<<(std::ostream& out, const BlueStore::Onode::printer &p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this go to BlueStore_debug.cc module?

Copy link
Contributor

@ifed01 ifed01 Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More generally - doesn't it make sense to move all Bluestore's operator<< to BlueStore_debug.cc module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved.
I think we may move all operator<< to there, but I prefer another PR for that.

}
struct printer : public BlueStore::printer {
const Extent& ext;
uint8_t mode;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be uint16_t to match the other printers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only used first few bits, but I synchronized it to uint16_t.

blob->shared_blob->get_cache()->rm_extent();
}
}
struct printer : public BlueStore::printer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can declare BlueStore::printer as a template class implementing all this type specific logic? Hence wouldn't need to copy-paste for each "rinted" struct/class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but I tied verbosity of printing to dout levels. It works very nicely I don't want to lose it.

@rzarzynski
Copy link
Contributor

Ping.

@github-actions
Copy link

github-actions bot commented Feb 6, 2024

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@aclamk aclamk force-pushed the wip-aclamk-bs-refactor-write-path branch from 736de0c to b0235a6 Compare February 28, 2024 20:56
@aclamk
Copy link
Contributor Author

aclamk commented Feb 29, 2024

Now #54075 is a base PR for this one.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Apr 29, 2024
/// shared_changed - set of shared blobs that are modified,
/// including the case of shared blob being empty
/// statfs_delta - delta of stats
BlueStore::extent_map_t::iterator BlueStore::_punch_hole_2(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like the idea to have methods from non-Writer classes being defined in Writer.cc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.
New file then? It will never be inlined, so there is no penalty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is a BlueStore's method likely this has to be in BlueStore.cc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to split BlueStore.cc into multiple modules later though.

}

// should _maybe_expand_blob go to Blob ?
inline void BlueStore::Writer::_maybe_expand_blob(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO better move close to other Writer's methods

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or indeed move to Blob class and locate accordingly

for (auto loc : disk_allocs) {
ceph_assert(initial_offset <= loc.length);
bufferlist data_chunk;
uint32_t data_to_write = std::min(data.length(), loc.length - initial_offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO better to check data_to_write for zero explicitly before submitting an I/O. Just to be sure there is no any unexpected behavior due to that.

bluestore_deferred_op_t *op = bstore->_get_deferred_op(txc, data.length());
op->op = bluestore_deferred_op_t::OP_WRITE;
for (auto& e : disk_allocs) {
op->extents.emplace_back(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can resize op->extents explicitly before the loop.
And curious if we need disk_allocs after the call and whether we can start using swap instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const on disk_allocs is not useful.
Dropping. swap() will be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update. Cant do. In some cases disk_allocs are used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I can op->extents = disk_allocs; instead of loop

if (p->length - offset) {
released_disk->emplace_back(p->offset + offset, p->length - offset);
}
released_length += p->length - offset;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and two next lines could be moved under the above 'if' statement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand - (p->length - offset) is always greater than zero so no need in the above 'if' at all...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets optimize, even if a bit.
On the other hand, I wonder if optimizer will not squeeze it anyways.

auto anchor_it = extents.begin() + (anchor - begin);
if (insert_element_cnt != 0) {
if (insert_element_cnt > 0) {
anchor_it = extents.insert(anchor_it, insert_element_cnt, bluestore_pextent_t(0, 0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: IMO better insert bluestore_pextent_t(EMPTY, 0)

// and insert hold in this place
int32_t insert_element_cnt = hold_size - (p - anchor);
auto anchor_it = extents.begin() + (anchor - begin);
if (insert_element_cnt != 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO more readable/elegant would be:
if (insert_element_cnt >0) {
...
} else if ((insert_element_cnt < 0) {
...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it; now two independent ifs.

anchor_it = extents.erase(anchor_it, anchor_it + (-insert_element_cnt));
}
}
for (uint32_t i = 0; i < hold_size; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hint: one could use std::copy if define hold as std::array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I loot at this function and I think it is too clever.
What do you think about refactor to make it 2 phased:

  1. find it_begin, offset_begin; it_end, offset_end
  2. move to released, cut out KNOWN region
    do you feel it might produce clearer code?

Comment on lines +876 to +901
for (auto q: bd) {
*_dout << " " << lof << "~" << q.disk_data.length();
lof += q.disk_data.length();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skip this if not on level 25?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do code to execute only on level 25+:

dout(25);
... conditional code in new scope...
*_dout << dendl;

dout(25) << "after punch_hole_2: " << std::endl << onode->print(pp_mode) << dendl;

uint32_t au_size = bstore->min_alloc_size;
if (au_size != bstore->block_size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't au_size != block-size for this to work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be an empty run.
The code is intended to use portions of blobs that are allocated (AU=64K) but have never been written.

if (left_do_pad) {
tmp.append_zero(location - left_location);
} else {
tmp = _read_self(left_location, location - left_location);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may be able to make this read async.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am considering it, but first lets get metrics how often it actually happens.
Maybe telemetry can help in this case.

// whether to use deferred or direct writes.
// The assumption behind it is that having parts of write executed as
// deferred and other parts as direct is suboptimal in any case.
void BlueStore::Writer::_deferred_decision(uint32_t need_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add something like "allocates" to function name so that we know this is where disk allocations really happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_defer_or_allocate ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

_try_put_data_on_allocated(location, data_end, ref_end, bd, after_punch_it);
}
if (location != data_end) {
uint32_t need_size = p2roundup(data_end, au_size) - p2align(location, au_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dumb question: what if need_size is too big it conflicts with other writes causing a enospc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular code got changed under pressure of compressed blobs.
Now need_size is calculated differently.

aclamk added 23 commits August 7, 2024 10:55
New "write_lat" tracks wallclock time spent in execution of BlueStore::_write().

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Added _do_write_v2 function to BlueStore.
Function is not integrated yet.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Add new conf variable.
bluestore_write_v2 = true : use new _do_write_v2() function
bluestore_write_v2 = false : use legacy _do_write() function
This variable is read only at start time.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
We need to scan all extents that are under proposed read / padding.
It was broken and we could clear out something defined.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
The search must go over entire range where we can have aligned mutable blobs.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
A review proposal.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Result of ongoing review.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
It was not checked if necessary location is still within blob range.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Plus making const& variables in `for`.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
BufferSpace is not with Onode, not Blob.
Modify code to adapt to this change.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Usually the data we put to disk is AU aligned.
In weird cases like AU=16K we put less data than we allocated.
_crop_allocs_to_io trims allocated extents into disk block extents
to reflect real IO.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Added missing files to alienstore CMake list.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Clang fails at _construct_at().

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
More diligent calcualtion algorithm of need_size.
Takes into account front and back alignment.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
1) Algorithm assumed that blob->blob_start() is aligned to csum size.
It is true for blobs created by write_v2, but write_v1 can generate
blob like: begin = 0x9000, size = 0x6000, csum = 0x2000.
2) Blobs with unused were selected even if those need to be expanded.
This is illegal since we cannot expand unused.

Fixed blob selection algorithm.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
For write_v2 create fallback to write_v1 if compression is selected.
This is temporary until compression dedicated to benefit from v2 is
merged.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Added conf.bluestore_write_v2_random. This is useful only for testing.
If set, it overrides value of bluestore_write_v2 with a random
true/false selection.
It is useful for v1 / v2 compatibility testing.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Add framework for various random options for debug bluestore.
Use framework to select:
- write_v1
- write_v2
- write_v1 / write_v2 selected at random

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
…e_extents

The #1 and #2 elements could form a continuous sequence but still not
joined:
Expected equality of these values:
  result
    Which is: { 0x7b138000~48000, 0x883b0000~48000, 0xf0c10000~10000, 0x727b8000~38000 }
  mid
    Which is: { 0x7b138000~30000, 0x7b168000~18000, 0x883b0000~48000, 0xf0c10000~10000, 0x727b8000~38000 }

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
1) moved stats and blobs update to Writer::do_write
2) preallocate space in Writer:_split_data
3) fixed Writer::_write_expand_l that could check one extent too much

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
4) remove Writer::shared_changed and use txc::shared_blobs directly

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
@aclamk aclamk force-pushed the wip-aclamk-bs-refactor-write-path branch from 746e82c to 59857a7 Compare August 7, 2024 10:56
@ifed01
Copy link
Contributor

ifed01 commented Aug 8, 2024

jenkins test make check

1 similar comment
@aclamk
Copy link
Contributor Author

aclamk commented Aug 8, 2024

jenkins test make check

@aclamk
Copy link
Contributor Author

aclamk commented Aug 12, 2024

jenkins test make check

@aclamk aclamk merged commit a787a91 into ceph:main Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants