crimson/os/seastore: add fine-grained cache by ljx023 · Pull Request #57787 · ceph/ceph

ljx023 · 2024-05-30T08:55:56Z

The fine-grained cache is introduced to improve the performance and reduce read amplification when reading small chunks of data from large extents.
Overall, we add a structure named "BufferSpace" to control the small buffers in CachedExtent so that it can load the exact required data from disk. Now the ObjectDataBlock can be incomplete. Some paths are modified to adapt to this change.
Currently, we don't calculate the crc while the extent is not fully loaded.
This is a refreshed version of PR50421.

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

ljx023 · 2024-05-30T09:04:36Z

cyx1231st · 2024-06-07T08:07:30Z

I've tried fio 4KB random read, the device-level read size has become 4KB after this change.

cyx1231st · 2024-06-07T08:10:02Z

@ljx023 I think we should by-default disable seastore_max_data_allocation_size in this PR according to #50421 (comment)

xxhdx1985126

should we add a UT case for this feature?

src/crimson/os/seastore/cached_extent.h

src/crimson/os/seastore/cached_extent.cc

src/crimson/os/seastore/btree/fixed_kv_node.h

src/crimson/os/seastore/transaction_manager.h

ljx023 · 2024-06-13T06:34:05Z

@ljx023 I think we should by-default disable seastore_max_data_allocation_size in this PR according to #50421 (comment)

shall we delete related code or just disalbe seastore_max_data_allocation_size?

ljx023 · 2024-06-13T08:54:03Z

should we add a UT case for this feature?

Current UT in test_object_data_handler includes partial read from large extent which presented in transaction and cache.

cyx1231st · 2024-06-18T03:48:35Z

shall we delete related code or just disalbe seastore_max_data_allocation_size?

By-default disable it, see 0ce3d36

Cherry-picking that commit into this PR or adjust it if necessary.

cyx1231st

Maybe an important question:

I think reading requests may be concurrent, is it considerred in the fine-grained-cache?

For example: before the range is really loaded from a read request, another read may request the same/overlapping range, how is it handled in this PR?

src/crimson/os/seastore/cached_extent.h

src/crimson/os/seastore/cached_extent.cc

src/crimson/os/seastore/cached_extent.h

cyx1231st · 2024-06-25T01:56:54Z

I think reading requests may be concurrent, is it considerred in the fine-grained-cache?

Concurrent partial readings inside an extent are serialized by CachedExtent::wait_io() in this PR. This can be improved later as a next step.

cyx1231st

Probably sufficient comments for the next round of review.

src/crimson/os/seastore/cached_extent.h

src/crimson/os/seastore/cached_extent.cc

src/crimson/os/seastore/cached_extent.h

src/crimson/os/seastore/cached_extent.cc

src/crimson/os/seastore/cached_extent.h

src/crimson/os/seastore/btree/fixed_kv_node.h

cyx1231st

Still looking.

src/crimson/os/seastore/cached_extent.h

src/crimson/os/seastore/cache.h

src/crimson/os/seastore/transaction_manager.h

src/crimson/os/seastore/cache.h

cyx1231st

Should be sufficient comments for the next round of review.

src/crimson/os/seastore/cache.h

src/crimson/os/seastore/cached_extent.h

src/crimson/os/seastore/transaction_manager.h

src/crimson/os/seastore/cache.h

src/test/crimson/seastore/test_object_data_handler.cc

Signed-off-by: Jianxin Li <jianxin1.li@intel.com>

get unload interval in extent and load them in parallel. Signed-off-by: Jianxin Li <jianxin1.li@intel.com>

Signed-off-by: Jianxin Li <jianxin1.li@intel.com>

…ault Supposing that fine-grained-cache should address the read amplification issue. By-default disable seastore_max_data_allocation_size with fine-grained-cache since seastore_full_integrity_check is by-default disabled. Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>

Currently, partial read will occupy the whole extent's io lock. We will support partial read check in the future. Signed-off-by: Jianxin Li <jianxin1.li@intel.com>

cyx1231st

Still looking, #57787 (comment) isn't addressed.

cyx1231st · 2024-07-09T07:13:01Z

src/crimson/os/seastore/cached_extent.h

  CachedExtent(ceph::bufferptr &&_ptr) : ptr(std::move(_ptr)) {
    length = ptr->length();
+    loaded_length = length;
    assert(length > 0);


assert(is_fully_loaded());

cyx1231st · 2024-07-09T07:13:16Z

src/crimson/os/seastore/cached_extent.h

+  CachedExtent(extent_len_t _length)
+    : length(_length),
+      buffer_space(std::in_place) {
+    assert(_length > 0);


assert(!is_fully_loaded());

cyx1231st · 2024-07-09T08:23:06Z

src/crimson/os/seastore/cache.h

-      auto bp = alloc_cache_buf(p_extent->get_length());
-      p_extent->set_bptr(std::move(bp));
-      return read_extent<CachedExtent>(CachedExtentRef(p_extent));
-    }


This might be related to @xxhdx1985126 's suggestion (#57787 (comment)) about handling the partially loaded extents outside the lba manager implementations, that seems fine.

However, I think we should add necessary comments to remember this special design, for example:

// Note, get_extent_viewable_by_trans() is designed to leak not-fully-loaded extents // to the lba manager for simplicity, however, it also means that not-fully-loaded extents // must be properly handled after that. // // Currently they are the users of PhysicalNodeMapping::get_logical_extent(). // They should be handled by Cache::read_range_in_extent() properly.

Please add it to the comment of this method if reasonable.

cyx1231st · 2024-07-09T08:29:02Z

src/crimson/os/seastore/transaction_manager.h

+    return fut.si_then([&t, this, e_off, e_len](auto npin) {
      // checking the lba child must be atomic with creating
      // and linking the absent child
      auto ret = get_extent_if_linked<T>(t, std::move(npin));


See another comment, as the user of PhysicalNodeMapping::get_logical_extent(), I think TM::read_pin_by_type() is leaking the partially loaded states to its users.

cyx1231st · 2024-07-09T08:35:15Z

src/crimson/os/seastore/cache.h

+    extent_len_t e_len)
+  {
+    return extent->wait_io(
+      ).then([extent=std::move(extent), e_off, e_len, this]() mutable


nit, unnecessary 2-space extra indentions for the entire code section.

cyx1231st · 2024-07-09T08:45:05Z

src/crimson/os/seastore/cache.h

+            get_extent_ertr::ready_future_marker{},
+            std::move(extent));
+        } else {
+          assert(extent->is_stable());


Currently, I think we should assert this at the begining of read_range_in_extent(), right?

cyx1231st · 2024-07-09T08:46:48Z

src/crimson/os/seastore/cache.h

  ) {
    assert(extent->state == CachedExtent::extent_state_t::CLEAN_PENDING ||
      extent->state == CachedExtent::extent_state_t::EXIST_CLEAN ||
      extent->state == CachedExtent::extent_state_t::CLEAN);


Let's assert(!extent->is_pending_io()); before setting the io wait.

cyx1231st · 2024-07-09T09:05:02Z

src/crimson/os/seastore/cache.h

+    } else {
+      regions = extent->get_unloaded_ranges(offset, length);
+    }
+    return seastar::do_with(regions, [extent, this](auto &read_regions) {


Let's keep the below indentions clear by keeping 2 spaces per nested logic.

cyx1231st

Probably we need to validate that the size of LRU is accurate in the debug build.

cyx1231st · 2024-07-10T02:25:06Z

src/crimson/os/seastore/cache.h

        }

+        lru.maybe_increase_cached_size(
+          *extent, extent->get_loaded_length() - old_length);


Seems wrong to adjust LRU size and change the actual loaded_length(in extent->load_range()) separately.

I think they need to be atomic because otherwise if the extent is removed in the middle (lru.remove_from_lru(extent)), the LRU size will become wrong because extent->get_loaded_length() might have been changed implicitly.

How about following the original design to 1.create the buffer, 2.increase extent size and 3.increase lru size atomically at the begining of Cache::read_extent() ?

github-actions · 2024-07-23T02:43:22Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

cyx1231st · 2024-07-26T08:35:17Z

Local fio randread test hits an assert failure.

crimson-osd: /root/ceph/src/crimson/os/seastore/cached_extent.h:645:
virtual const bufferptr& crimson::os::seastore::CachedExtent::get_bptr() const: Assertion `ptr.has_value()' failed.

Backtrace unavailble, and cannot reproduce yet.

cyx1231st · 2024-07-29T07:20:28Z

Reproduced with info build, and got backtrace:

...
ceph::buffer::v15_2_0::ptr::c_str() const at /root/cyx/ceph/src/common/buffer.cc:456 (discriminator 1)
crimson::os::seastore::CachedExtent::calc_crc32c() const at /root/cyx/ceph/src/crimson/os/seastore/cached_extent.h:635 (discriminator 1)
crimson::os::seastore::Cache::read_extent<crimson::os::seastore::ObjectDataBlock>(boost::intrusive_ptr<crimson::os::seastore::ObjectDataBlock>&&, unsigned int, unsigned int)::{lambda()#1}::operator()() at /root/cyx/ceph/src/crimson/os/seastore/cache.h:1750 (discriminator 1)
crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<5> > >::_future<crimson::errorated_future_marker<boost::intrusive_ptr<crimson::os::seastore::ObjectDataBlock> > > seastar::futurize<crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<5> > >::_future<crimson::errorated_future_marker<boost::intrusive_ptr<crimson::os::seastore::ObjectDataBlock> > > >::invoke<crimson::os::seastore::Cache::read_extent<crimson::os::seastore::ObjectDataBlock>(boost::intrusive_ptr<crimson::os::seastore::ObjectDataBlock>&&, unsigned int, unsigned int)::{lambda()#1}>(crimson::os::seastore::Cache::read_extent<crimson::os::seastore::ObjectDataBlock>(boost::intrusive_ptr<crimson::os::seastore::ObjectDataBlock>&&, unsigned int, unsigned int)::{lambda()#1}&&) at /root/cyx/ceph/src/crimson/common/errorator.h:1399
(inlined by) auto seastar::futurize_invoke<crimson::os::seastore::Cache::read_extent<crimson::os::seastore::ObjectDataBlock>(boost::intrusive_ptr<crimson::os::seastore::ObjectDataBlock>&&, unsigned int, unsigned int)::{lambda()#1}>(crimson::os::seastore::Cache::read_extent<crimson::os::seastore::ObjectDataBlock>(boost::intrusive_ptr<crimson::os::seastore::ObjectDataBlock>&&, unsigned int, unsigned int)::{lambda()#1}&&) at /root/cyx/ceph/src/seastar/include/seastar/core/future.hh:2065
...

cyx1231st · 2024-09-10T07:24:57Z

I'll pick up this PR.

cyx1231st · 2024-11-07T07:11:20Z

Please refer to #60654

ljx023 requested a review from a team as a code owner May 30, 2024 08:55

github-actions bot added crimson tests labels May 30, 2024

ljx023 force-pushed the wip-seastore-fine-grained-cache branch 2 times, most recently from 8672ada to 3b222d2 Compare May 31, 2024 06:32

cyx1231st added the crimson-perf label Jun 5, 2024

ljx023 force-pushed the wip-seastore-fine-grained-cache branch from 3b222d2 to 8ad32ea Compare June 6, 2024 08:07

xxhdx1985126 reviewed Jun 7, 2024

View reviewed changes

ljx023 force-pushed the wip-seastore-fine-grained-cache branch from 8ad32ea to a0a0d64 Compare June 13, 2024 06:16

ljx023 force-pushed the wip-seastore-fine-grained-cache branch from a0a0d64 to 476dd1e Compare June 13, 2024 06:39

ljx023 force-pushed the wip-seastore-fine-grained-cache branch from 476dd1e to 7c801dc Compare June 19, 2024 05:32

github-actions bot added the common label Jun 19, 2024

ljx023 force-pushed the wip-seastore-fine-grained-cache branch 2 times, most recently from c4b4f8b to fbc522f Compare June 19, 2024 07:02

cyx1231st self-requested a review June 20, 2024 02:18

cyx1231st requested changes Jun 20, 2024

View reviewed changes

ljx023 force-pushed the wip-seastore-fine-grained-cache branch from fbc522f to 48ee917 Compare June 24, 2024 09:03

cyx1231st requested changes Jun 25, 2024

View reviewed changes

ljx023 force-pushed the wip-seastore-fine-grained-cache branch 5 times, most recently from 32cdeb3 to 150915c Compare June 26, 2024 12:34

ljx023 force-pushed the wip-seastore-fine-grained-cache branch 3 times, most recently from 45c461c to 5b8334f Compare June 27, 2024 03:52

cyx1231st requested changes Jun 27, 2024

View reviewed changes

ljx023 force-pushed the wip-seastore-fine-grained-cache branch from 5b8334f to 20d80ce Compare June 28, 2024 02:05

cyx1231st requested changes Jul 1, 2024

View reviewed changes

ljx023 force-pushed the wip-seastore-fine-grained-cache branch 2 times, most recently from 37d0f10 to ce279b7 Compare July 5, 2024 07:52

Jianxin Li and others added 6 commits July 8, 2024 09:54

crimson/os/seastore: add BufferSpace to manage small buffers

45ee592

Signed-off-by: Jianxin Li <jianxin1.li@intel.com>

crimson/os/seastore: modify read path for fine-grained cache

d3fd22e

get unload interval in extent and load them in parallel. Signed-off-by: Jianxin Li <jianxin1.li@intel.com>

crimson/os/seastore: update lru size based on loaded extent size

cc9c87d

Signed-off-by: Jianxin Li <jianxin1.li@intel.com>

test/crimson/seastore: add testcase for partial read

c5d15d8

Signed-off-by: Jianxin Li <jianxin1.li@intel.com>

crimson/os/seastore: add to-do work for read_extent

4e7c872

Currently, partial read will occupy the whole extent's io lock. We will support partial read check in the future. Signed-off-by: Jianxin Li <jianxin1.li@intel.com>

ljx023 force-pushed the wip-seastore-fine-grained-cache branch from ce279b7 to 4e7c872 Compare July 8, 2024 01:55

cyx1231st requested changes Jul 9, 2024

View reviewed changes

cyx1231st requested changes Jul 10, 2024

View reviewed changes

github-actions bot added the needs-rebase label Jul 23, 2024

cyx1231st mentioned this pull request Aug 1, 2024

crimson/os/seastore/cache: cleanups and comments #58983

Merged

14 tasks

cyx1231st mentioned this pull request Nov 7, 2024

crimson/os/seastore: support partial read for data extents #60654

Merged

14 tasks

cyx1231st closed this Nov 7, 2024

Conversation

ljx023 commented May 30, 2024 • edited by xxhdx1985126 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

ljx023 commented May 30, 2024

Uh oh!

cyx1231st commented Jun 7, 2024

Uh oh!

cyx1231st commented Jun 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xxhdx1985126 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ljx023 commented Jun 13, 2024

Uh oh!

ljx023 commented Jun 13, 2024

Uh oh!

cyx1231st commented Jun 18, 2024

Uh oh!

cyx1231st left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cyx1231st commented Jun 25, 2024

Uh oh!

cyx1231st left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cyx1231st left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cyx1231st left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ljx023 commented May 30, 2024 •

edited by xxhdx1985126

Loading

cyx1231st commented Jun 7, 2024 •

edited

Loading

cyx1231st left a comment •

edited

Loading