Skip to content

crimson/os/seastore: add fine-grained cache#57787

Closed
ljx023 wants to merge 6 commits intoceph:mainfrom
ljx023:wip-seastore-fine-grained-cache
Closed

crimson/os/seastore: add fine-grained cache#57787
ljx023 wants to merge 6 commits intoceph:mainfrom
ljx023:wip-seastore-fine-grained-cache

Conversation

@ljx023
Copy link

@ljx023 ljx023 commented May 30, 2024

The fine-grained cache is introduced to improve the performance and reduce read amplification when reading small chunks of data from large extents.
Overall, we add a structure named "BufferSpace" to control the small buffers in CachedExtent so that it can load the exact required data from disk. Now the ObjectDataBlock can be incomplete. Some paths are modified to adapt to this change.
Currently, we don't calculate the crc while the extent is not fully loaded.
This is a refreshed version of PR50421.

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@ljx023 ljx023 requested a review from a team as a code owner May 30, 2024 08:55
@ljx023
Copy link
Author

ljx023 commented May 30, 2024

image

@ljx023 ljx023 force-pushed the wip-seastore-fine-grained-cache branch 2 times, most recently from 8672ada to 3b222d2 Compare May 31, 2024 06:32
@ljx023 ljx023 force-pushed the wip-seastore-fine-grained-cache branch from 3b222d2 to 8ad32ea Compare June 6, 2024 08:07
@cyx1231st
Copy link
Member

I've tried fio 4KB random read, the device-level read size has become 4KB after this change.

@cyx1231st
Copy link
Member

cyx1231st commented Jun 7, 2024

@ljx023 I think we should by-default disable seastore_max_data_allocation_size in this PR according to #50421 (comment)

Copy link
Contributor

@xxhdx1985126 xxhdx1985126 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a UT case for this feature?

@ljx023 ljx023 force-pushed the wip-seastore-fine-grained-cache branch from 8ad32ea to a0a0d64 Compare June 13, 2024 06:16
@ljx023
Copy link
Author

ljx023 commented Jun 13, 2024

@ljx023 I think we should by-default disable seastore_max_data_allocation_size in this PR according to #50421 (comment)

shall we delete related code or just disalbe seastore_max_data_allocation_size?

@ljx023 ljx023 force-pushed the wip-seastore-fine-grained-cache branch from a0a0d64 to 476dd1e Compare June 13, 2024 06:39
@ljx023
Copy link
Author

ljx023 commented Jun 13, 2024

should we add a UT case for this feature?

Current UT in test_object_data_handler includes partial read from large extent which presented in transaction and cache.

@cyx1231st
Copy link
Member

shall we delete related code or just disalbe seastore_max_data_allocation_size?

By-default disable it, see 0ce3d36

Cherry-picking that commit into this PR or adjust it if necessary.

@ljx023 ljx023 force-pushed the wip-seastore-fine-grained-cache branch from 476dd1e to 7c801dc Compare June 19, 2024 05:32
@ljx023 ljx023 force-pushed the wip-seastore-fine-grained-cache branch 2 times, most recently from c4b4f8b to fbc522f Compare June 19, 2024 07:02
@cyx1231st cyx1231st self-requested a review June 20, 2024 02:18
Copy link
Member

@cyx1231st cyx1231st left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an important question:

I think reading requests may be concurrent, is it considerred in the fine-grained-cache?

For example: before the range is really loaded from a read request, another read may request the same/overlapping range, how is it handled in this PR?

@ljx023 ljx023 force-pushed the wip-seastore-fine-grained-cache branch from fbc522f to 48ee917 Compare June 24, 2024 09:03
@cyx1231st
Copy link
Member

I think reading requests may be concurrent, is it considerred in the fine-grained-cache?

Concurrent partial readings inside an extent are serialized by CachedExtent::wait_io() in this PR. This can be improved later as a next step.

Copy link
Member

@cyx1231st cyx1231st left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably sufficient comments for the next round of review.

@ljx023 ljx023 force-pushed the wip-seastore-fine-grained-cache branch 5 times, most recently from 32cdeb3 to 150915c Compare June 26, 2024 12:34
@ljx023 ljx023 force-pushed the wip-seastore-fine-grained-cache branch 3 times, most recently from 45c461c to 5b8334f Compare June 27, 2024 03:52
Copy link
Member

@cyx1231st cyx1231st left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looking.

@ljx023 ljx023 force-pushed the wip-seastore-fine-grained-cache branch from 5b8334f to 20d80ce Compare June 28, 2024 02:05
Copy link
Member

@cyx1231st cyx1231st left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be sufficient comments for the next round of review.

@ljx023 ljx023 force-pushed the wip-seastore-fine-grained-cache branch 2 times, most recently from 37d0f10 to ce279b7 Compare July 5, 2024 07:52
Jianxin Li and others added 6 commits July 8, 2024 09:54
Signed-off-by: Jianxin Li <jianxin1.li@intel.com>
get unload interval in extent and load them in parallel.

Signed-off-by: Jianxin Li <jianxin1.li@intel.com>
Signed-off-by: Jianxin Li <jianxin1.li@intel.com>
Signed-off-by: Jianxin Li <jianxin1.li@intel.com>
…ault

Supposing that fine-grained-cache should address the read amplification
issue. By-default disable seastore_max_data_allocation_size with
fine-grained-cache since seastore_full_integrity_check is by-default
disabled.

Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Currently, partial read will occupy the whole extent's io lock.
We will support partial read check in the future.

Signed-off-by: Jianxin Li <jianxin1.li@intel.com>
@ljx023 ljx023 force-pushed the wip-seastore-fine-grained-cache branch from ce279b7 to 4e7c872 Compare July 8, 2024 01:55
Copy link
Member

@cyx1231st cyx1231st left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looking, #57787 (comment) isn't addressed.

CachedExtent(ceph::bufferptr &&_ptr) : ptr(std::move(_ptr)) {
length = ptr->length();
loaded_length = length;
assert(length > 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert(is_fully_loaded());

CachedExtent(extent_len_t _length)
: length(_length),
buffer_space(std::in_place) {
assert(_length > 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert(!is_fully_loaded());

auto bp = alloc_cache_buf(p_extent->get_length());
p_extent->set_bptr(std::move(bp));
return read_extent<CachedExtent>(CachedExtentRef(p_extent));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be related to @xxhdx1985126 's suggestion (#57787 (comment)) about handling the partially loaded extents outside the lba manager implementations, that seems fine.

However, I think we should add necessary comments to remember this special design, for example:

// Note, get_extent_viewable_by_trans() is designed to leak not-fully-loaded extents
// to the lba manager for simplicity, however, it also means that not-fully-loaded extents
// must be properly handled after that.
//
// Currently they are the users of PhysicalNodeMapping::get_logical_extent(). 
// They should be handled by Cache::read_range_in_extent() properly.

Please add it to the comment of this method if reasonable.

return fut.si_then([&t, this, e_off, e_len](auto npin) {
// checking the lba child must be atomic with creating
// and linking the absent child
auto ret = get_extent_if_linked<T>(t, std::move(npin));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See another comment, as the user of PhysicalNodeMapping::get_logical_extent(), I think TM::read_pin_by_type() is leaking the partially loaded states to its users.

extent_len_t e_len)
{
return extent->wait_io(
).then([extent=std::move(extent), e_off, e_len, this]() mutable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, unnecessary 2-space extra indentions for the entire code section.

get_extent_ertr::ready_future_marker{},
std::move(extent));
} else {
assert(extent->is_stable());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I think we should assert this at the begining of read_range_in_extent(), right?

) {
assert(extent->state == CachedExtent::extent_state_t::CLEAN_PENDING ||
extent->state == CachedExtent::extent_state_t::EXIST_CLEAN ||
extent->state == CachedExtent::extent_state_t::CLEAN);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's assert(!extent->is_pending_io()); before setting the io wait.

} else {
regions = extent->get_unloaded_ranges(offset, length);
}
return seastar::do_with(regions, [extent, this](auto &read_regions) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the below indentions clear by keeping 2 spaces per nested logic.

Copy link
Member

@cyx1231st cyx1231st left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we need to validate that the size of LRU is accurate in the debug build.

}

lru.maybe_increase_cached_size(
*extent, extent->get_loaded_length() - old_length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems wrong to adjust LRU size and change the actual loaded_length(in extent->load_range()) separately.

I think they need to be atomic because otherwise if the extent is removed in the middle (lru.remove_from_lru(extent)), the LRU size will become wrong because extent->get_loaded_length() might have been changed implicitly.

How about following the original design to 1.create the buffer, 2.increase extent size and 3.increase lru size atomically at the begining of Cache::read_extent() ?

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@cyx1231st
Copy link
Member

Local fio randread test hits an assert failure.

crimson-osd: /root/ceph/src/crimson/os/seastore/cached_extent.h:645:
virtual const bufferptr& crimson::os::seastore::CachedExtent::get_bptr() const: Assertion `ptr.has_value()' failed.

Backtrace unavailble, and cannot reproduce yet.

@cyx1231st
Copy link
Member

Reproduced with info build, and got backtrace:

...
ceph::buffer::v15_2_0::ptr::c_str() const at /root/cyx/ceph/src/common/buffer.cc:456 (discriminator 1)
crimson::os::seastore::CachedExtent::calc_crc32c() const at /root/cyx/ceph/src/crimson/os/seastore/cached_extent.h:635 (discriminator 1)
crimson::os::seastore::Cache::read_extent<crimson::os::seastore::ObjectDataBlock>(boost::intrusive_ptr<crimson::os::seastore::ObjectDataBlock>&&, unsigned int, unsigned int)::{lambda()#1}::operator()() at /root/cyx/ceph/src/crimson/os/seastore/cache.h:1750 (discriminator 1)
crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<5> > >::_future<crimson::errorated_future_marker<boost::intrusive_ptr<crimson::os::seastore::ObjectDataBlock> > > seastar::futurize<crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<5> > >::_future<crimson::errorated_future_marker<boost::intrusive_ptr<crimson::os::seastore::ObjectDataBlock> > > >::invoke<crimson::os::seastore::Cache::read_extent<crimson::os::seastore::ObjectDataBlock>(boost::intrusive_ptr<crimson::os::seastore::ObjectDataBlock>&&, unsigned int, unsigned int)::{lambda()#1}>(crimson::os::seastore::Cache::read_extent<crimson::os::seastore::ObjectDataBlock>(boost::intrusive_ptr<crimson::os::seastore::ObjectDataBlock>&&, unsigned int, unsigned int)::{lambda()#1}&&) at /root/cyx/ceph/src/crimson/common/errorator.h:1399
(inlined by) auto seastar::futurize_invoke<crimson::os::seastore::Cache::read_extent<crimson::os::seastore::ObjectDataBlock>(boost::intrusive_ptr<crimson::os::seastore::ObjectDataBlock>&&, unsigned int, unsigned int)::{lambda()#1}>(crimson::os::seastore::Cache::read_extent<crimson::os::seastore::ObjectDataBlock>(boost::intrusive_ptr<crimson::os::seastore::ObjectDataBlock>&&, unsigned int, unsigned int)::{lambda()#1}&&) at /root/cyx/ceph/src/seastar/include/seastar/core/future.hh:2065
...

@cyx1231st
Copy link
Member

I'll pick up this PR.

@cyx1231st
Copy link
Member

Please refer to #60654

@cyx1231st cyx1231st closed this Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants