Skip to content

crimson/os/seastore: support partial read for data extents#60654

Merged
cyx1231st merged 20 commits intoceph:mainfrom
cyx1231st:wip-seastore-fine-grained-cache
Nov 28, 2024
Merged

crimson/os/seastore: support partial read for data extents#60654
cyx1231st merged 20 commits intoceph:mainfrom
cyx1231st:wip-seastore-fine-grained-cache

Conversation

@cyx1231st
Copy link
Member

@cyx1231st cyx1231st commented Nov 7, 2024

This PR supersedes #57787

The first 10 commits are cleanups and preparations, the rest commits are the implementation:

In CachedExtent, introduce a BufferSpace to maintain/index partially loaded buffers. And once fully loaded, convert to page aligned bptr.

Cache must tolerate that in LRU, extents may be partially loaded, and they may be read by concurrent transactions. This is also possible for EXIST_CLEAN extents.

This PR is expected to control/minimize read amplification to 1x when the data extent sizes are inconsistent with the read sizes.

CC @ljx023 @zhscn

Performance impacts

To roughly understand the impacts, the tests were simplified.

Test scenarios

In the same local environment, based on main (d6dfc1c):
A. baseline_limit_32K: seastore max-alloc-size=32KB
B. baseline_unlimited: seastore max-alloc-size disabled
C. fine-grained-cache: seastore max-alloc-size disabled
D. classic+bluestore: osd_op_num_shards = 32 (no further tuning applied)

Constraints

  1. single OSD allocated 32 cores (the memory usage is tuned to be around 33GB)
  2. pool/image: 128 pgs, single replication, 4GB x 32 images (totaly 128GB)
  3. 32 fio-rbd clients running in another 32 cores concurrently
    3.1. rampup: fill the image with 4KB sequential writes
    3.2. workload: for each client, 60 seconds, depth=128, 4KB random reads

Results

A. baseline_limit_32K:

  • client read IOPS: ~338K
  • disk read IOPS: ~240K; io-size: 31.1KB
  • cpu: 3200%; reactor-utilization: 81%; memory: 32.9GB;
  • seastore cache (per-shard): 31.1KB/extent, 28K extents; lru in&out: 7.7K extents/second (236MB/s)
  • seastore cache hit ratio: ~28%

B. baseline_unlimited:

  • client read IOPS: ~26K
  • disk read IOPS: ~40.4K; io-size: 413KB
  • cpu: 2520%; reactor-utilization: 19%; memory: 33.9GB;
  • seastore cache (per-shard): 415KB/extent, 2K extents; lru in&out: 0.6K extents/second (263MB/s)
  • seastore cache hit ratio: ~19.8%

C. fine-grained-cache:

  • client read IOPS: ~427K
  • disk read IOPS: ~278K; io-size: 4KB
  • cpu: 3200%; reactor-utilization: 82%; memory: 33.1GB;
  • seastore cache (per-shard): 2.4KB/extent (?somehow lower than 4KB), 333K extents; lru in&out: 8.9K extents/second (34.9MiB/s)
  • seastore cache hit ratio: ~91.8%

D. classic+bluestore:

  • client read IOPS: ~340K
  • disk read IOPS: ~250K; io-size: 4KB
  • cpu: 1522%; memory: 32.6GB;

Short/rough conclusions

In this specifc setting, after this PR:

  1. From A to C: IOPS 1.26x better, disk read size 8x smaller, hit ratio 3.3x better, 11.9x more extents cached.
  2. From B to C: IOPS 16.4x better, disk read size 103x smaller, hit ratio 4.6x better, 555x more extents cached.
  3. From D to C: IOPS 1.25x better, disk read same size (no internal information to compare).

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@cyx1231st cyx1231st force-pushed the wip-seastore-fine-grained-cache branch 4 times, most recently from 6b0a2fc to bf70e77 Compare November 12, 2024 06:02
@cyx1231st
Copy link
Member Author

jenkins test make check

@cyx1231st cyx1231st marked this pull request as ready for review November 13, 2024 02:45
@cyx1231st cyx1231st requested a review from a team as a code owner November 13, 2024 02:45
Copy link
Contributor

@xxhdx1985126 xxhdx1985126 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looking.

@cyx1231st cyx1231st force-pushed the wip-seastore-fine-grained-cache branch from bf70e77 to b413c70 Compare November 15, 2024 03:27
@cyx1231st
Copy link
Member Author

Changeset: fixed mistakes causing test errors.

Copy link
Contributor

@xxhdx1985126 xxhdx1985126 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looking.

@cyx1231st
Copy link
Member Author

Updated the performance impact analysis: #60654 (comment)

@cyx1231st
Copy link
Member Author

jenkins test docs

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Copy link
Contributor

@Matan-B Matan-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

Comment on lines +108 to +109
desc: Max size in bytes that an extent can be, 0 to disable
default: 0
Copy link
Contributor

@Matan-B Matan-B Nov 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason to keep this option beside nice-to-have, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need it, otherwise the read amplification can be very large in scenarios where full extent integrity check is needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, CRC is still validated at the extent granularity, not 4K-grained. So when 4KB is loaded from a 4MB extent, CRC can't be checked.

@cyx1231st
Copy link
Member Author

@cyx1231st
Copy link
Member Author

cyx1231st commented Nov 18, 2024

^indirect_info_t::get_bptr() is wrong -- the extent may be partially loaded under indirection.

Fixing: see commit "implement and use maybe_indirect_extent_t::get_bl()"

@cyx1231st cyx1231st force-pushed the wip-seastore-fine-grained-cache branch 2 times, most recently from 8cb3b52 to b1b90d6 Compare November 18, 2024 08:53
cyx1231st and others added 20 commits November 28, 2024 09:32
Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
…ents

Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Mostly convert length to the hex format.

Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
… to get absent extent

Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Make it easier for TM::read_pin() users to consume extent without
worrying about the indirections.

This basically reverts 9cdcd06

Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
…reads

Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Signed-off-by: Jianxin Li <jianxin1.li@intel.com>
…ject_data_handler

Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Signed-off-by: Jianxin Li <jianxin1.li@intel.com>
Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Signed-off-by: Jianxin Li <jianxin1.li@intel.com>
…ault

Supposing that fine-grained-cache should address the read amplification
issue. By-default disable seastore_max_data_allocation_size with
fine-grained-cache since seastore_full_integrity_check is by-default
disabled.

Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Signed-off-by: Jianxin Li <jianxin1.li@intel.com>
…se of read

Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
…rect_extent_t::get_bl()

Return bufferlist because the extent may be partially loaded under
indirection.

Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
… rewritting it

Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
@cyx1231st cyx1231st force-pushed the wip-seastore-fine-grained-cache branch from 39cca24 to e4efceb Compare November 28, 2024 01:33
@cyx1231st
Copy link
Member Author

Rebased to see if the make check can pass.

@cyx1231st
Copy link
Member Author

jenkins test make check

@cyx1231st
Copy link
Member Author

jenkins test docs

@cyx1231st
Copy link
Member Author

jenkins test make check

1 similar comment
@cyx1231st
Copy link
Member Author

jenkins test make check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants