crimson/os/seastore: add fine-grained cache#57787
Conversation
8672ada to
3b222d2
Compare
3b222d2 to
8ad32ea
Compare
|
I've tried fio 4KB random read, the device-level read size has become 4KB after this change. |
|
@ljx023 I think we should by-default disable |
xxhdx1985126
left a comment
There was a problem hiding this comment.
should we add a UT case for this feature?
8ad32ea to
a0a0d64
Compare
shall we delete related code or just disalbe |
a0a0d64 to
476dd1e
Compare
Current UT in test_object_data_handler includes partial read from large extent which presented in transaction and cache. |
By-default disable it, see 0ce3d36 Cherry-picking that commit into this PR or adjust it if necessary. |
476dd1e to
7c801dc
Compare
c4b4f8b to
fbc522f
Compare
There was a problem hiding this comment.
Maybe an important question:
I think reading requests may be concurrent, is it considerred in the fine-grained-cache?
For example: before the range is really loaded from a read request, another read may request the same/overlapping range, how is it handled in this PR?
fbc522f to
48ee917
Compare
Concurrent partial readings inside an extent are serialized by |
cyx1231st
left a comment
There was a problem hiding this comment.
Probably sufficient comments for the next round of review.
32cdeb3 to
150915c
Compare
45c461c to
5b8334f
Compare
5b8334f to
20d80ce
Compare
cyx1231st
left a comment
There was a problem hiding this comment.
Should be sufficient comments for the next round of review.
37d0f10 to
ce279b7
Compare
Signed-off-by: Jianxin Li <jianxin1.li@intel.com>
get unload interval in extent and load them in parallel. Signed-off-by: Jianxin Li <jianxin1.li@intel.com>
Signed-off-by: Jianxin Li <jianxin1.li@intel.com>
Signed-off-by: Jianxin Li <jianxin1.li@intel.com>
…ault Supposing that fine-grained-cache should address the read amplification issue. By-default disable seastore_max_data_allocation_size with fine-grained-cache since seastore_full_integrity_check is by-default disabled. Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Currently, partial read will occupy the whole extent's io lock. We will support partial read check in the future. Signed-off-by: Jianxin Li <jianxin1.li@intel.com>
ce279b7 to
4e7c872
Compare
cyx1231st
left a comment
There was a problem hiding this comment.
Still looking, #57787 (comment) isn't addressed.
| CachedExtent(ceph::bufferptr &&_ptr) : ptr(std::move(_ptr)) { | ||
| length = ptr->length(); | ||
| loaded_length = length; | ||
| assert(length > 0); |
| CachedExtent(extent_len_t _length) | ||
| : length(_length), | ||
| buffer_space(std::in_place) { | ||
| assert(_length > 0); |
| auto bp = alloc_cache_buf(p_extent->get_length()); | ||
| p_extent->set_bptr(std::move(bp)); | ||
| return read_extent<CachedExtent>(CachedExtentRef(p_extent)); | ||
| } |
There was a problem hiding this comment.
This might be related to @xxhdx1985126 's suggestion (#57787 (comment)) about handling the partially loaded extents outside the lba manager implementations, that seems fine.
However, I think we should add necessary comments to remember this special design, for example:
// Note, get_extent_viewable_by_trans() is designed to leak not-fully-loaded extents
// to the lba manager for simplicity, however, it also means that not-fully-loaded extents
// must be properly handled after that.
//
// Currently they are the users of PhysicalNodeMapping::get_logical_extent().
// They should be handled by Cache::read_range_in_extent() properly.
Please add it to the comment of this method if reasonable.
| return fut.si_then([&t, this, e_off, e_len](auto npin) { | ||
| // checking the lba child must be atomic with creating | ||
| // and linking the absent child | ||
| auto ret = get_extent_if_linked<T>(t, std::move(npin)); |
There was a problem hiding this comment.
See another comment, as the user of PhysicalNodeMapping::get_logical_extent(), I think TM::read_pin_by_type() is leaking the partially loaded states to its users.
| extent_len_t e_len) | ||
| { | ||
| return extent->wait_io( | ||
| ).then([extent=std::move(extent), e_off, e_len, this]() mutable |
There was a problem hiding this comment.
nit, unnecessary 2-space extra indentions for the entire code section.
| get_extent_ertr::ready_future_marker{}, | ||
| std::move(extent)); | ||
| } else { | ||
| assert(extent->is_stable()); |
There was a problem hiding this comment.
Currently, I think we should assert this at the begining of read_range_in_extent(), right?
| ) { | ||
| assert(extent->state == CachedExtent::extent_state_t::CLEAN_PENDING || | ||
| extent->state == CachedExtent::extent_state_t::EXIST_CLEAN || | ||
| extent->state == CachedExtent::extent_state_t::CLEAN); |
There was a problem hiding this comment.
Let's assert(!extent->is_pending_io()); before setting the io wait.
| } else { | ||
| regions = extent->get_unloaded_ranges(offset, length); | ||
| } | ||
| return seastar::do_with(regions, [extent, this](auto &read_regions) { |
There was a problem hiding this comment.
Let's keep the below indentions clear by keeping 2 spaces per nested logic.
cyx1231st
left a comment
There was a problem hiding this comment.
Probably we need to validate that the size of LRU is accurate in the debug build.
| } | ||
|
|
||
| lru.maybe_increase_cached_size( | ||
| *extent, extent->get_loaded_length() - old_length); |
There was a problem hiding this comment.
Seems wrong to adjust LRU size and change the actual loaded_length(in extent->load_range()) separately.
I think they need to be atomic because otherwise if the extent is removed in the middle (lru.remove_from_lru(extent)), the LRU size will become wrong because extent->get_loaded_length() might have been changed implicitly.
How about following the original design to 1.create the buffer, 2.increase extent size and 3.increase lru size atomically at the begining of Cache::read_extent() ?
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
Local fio randread test hits an assert failure. Backtrace unavailble, and cannot reproduce yet. |
|
Reproduced with info build, and got backtrace: |
|
I'll pick up this PR. |
|
Please refer to #60654 |

The fine-grained cache is introduced to improve the performance and reduce read amplification when reading small chunks of data from large extents.
Overall, we add a structure named "BufferSpace" to control the small buffers in CachedExtent so that it can load the exact required data from disk. Now the ObjectDataBlock can be incomplete. Some paths are modified to adapt to this change.
Currently, we don't calculate the crc while the extent is not fully loaded.
This is a refreshed version of PR50421.
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e