Skip to content

crimson/os/seastore: redirect log operations to 16K-leaf-size omap#59213

Merged
cyx1231st merged 18 commits intoceph:mainfrom
myoungwon:wip-dec-omap-leaf-size
Feb 28, 2025
Merged

crimson/os/seastore: redirect log operations to 16K-leaf-size omap#59213
cyx1231st merged 18 commits intoceph:mainfrom
myoungwon:wip-dec-omap-leaf-size

Conversation

@myoungwon
Copy link
Member

Signed-off-by: Myoungwon Oh myoungwon.oh@samsung.com

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@myoungwon myoungwon requested a review from a team as a code owner August 14, 2024 08:17
@myoungwon
Copy link
Member Author

myoungwon commented Aug 14, 2024

@cyx1231st I'm trying to figure out why the number is lower than the previous result.

@cyx1231st #57782 (comment) Is there any reason why OMAP_LEAF_BLOCK_SIZE was previously set to 64K? This change improves 4K random write performance in my environment to 3.8K IOPS (3.3 K --> 3.8 K)

@xxhdx1985126
Copy link
Contributor

@cyx1231st I'm trying to figure out why the number is lower than the previous result.

@cyx1231st #57782 (comment) Is there any reason why OMAP_LEAF_BLOCK_SIZE was previously set to 64K? This change improves 4K random write performance in my environment to 3.8K IOPS (3.3 K --> 3.8 K)

According to teuthology test cases, OmapManager needs to handle large omap values: #56114

@cyx1231st
Copy link
Member

Is there any reason why OMAP_LEAF_BLOCK_SIZE was previously set to 64K?

Please also refer to git blame: #55840

The configuration is related to the maximum value (and key) size seastore needs to support.

This change improves 4K random write performance in my environment to 3.8K IOPS (3.3 K --> 3.8 K)

What is the typical size of the omap value in your tests (and the omap key name)? I think this suggests that there are room to optimize the omap-tree to store small values more efficiently, but need to find a way to store large omap values.

@xxhdx1985126
Copy link
Contributor

@cyx1231st I'm trying to figure out why the number is lower than the previous result.

@cyx1231st #57782 (comment) Is there any reason why OMAP_LEAF_BLOCK_SIZE was previously set to 64K? This change improves 4K random write performance in my environment to 3.8K IOPS (3.3 K --> 3.8 K)

I think this degradation is caused by duplicating omap extents which is further caused by the pglog entry associated with each write request.

It looks to us that PGLog entries doesn't necessarily need to be kept in omaps, maybe this is an opportunity to consider some pglog specific mechanism in SeaStore. What do you think? @cyx1231st @athanatos

@xxhdx1985126
Copy link
Contributor

Is there any reason why OMAP_LEAF_BLOCK_SIZE was previously set to 64K?

Please also refer to git blame: #55840

The configuration is related to the maximum value (and key) size seastore needs to support.

This change improves 4K random write performance in my environment to 3.8K IOPS (3.3 K --> 3.8 K)

What is the typical size of the omap value in your tests (and the omap key name)? I think this suggests that there are room to optimize the omap-tree to store small values more efficiently, but need to find a way to store large omap values.

I think, for now, maybe we can make the omap extent size object-specific. For example, pglog entries are kept in omaps of the pg meta object, we can let pg meta object use small omap extent size, while keep user objects' omap extent size large.

@myoungwon
Copy link
Member Author

What is the typical size of the omap value in your tests (and the omap key name)? I think this suggests that there are room to optimize the omap-tree to store small values more efficiently, but need to find a way to store large omap values.

The test was done under RBD, issuing I/O using fio (4K random write workload). So, the data types stored in the omap tree include snapset (with key name snapset) and pglog (with key name evertion_t, which generates many leaf nodes). it seems that storing pglog is problematic for performance based on my observation; I have observed a significant performance improvement without storing pglog.

When it comes to handling the pglog, there is a limitation: a small-omap-leaf causes excessive split and new allocation, while a large-omap-leaf leads to memory copy overhead (e.g., duplicate_for_write).
Given this, I think the reasonable approach, for now, is to differentiate the omap leaf size depending on the data types. Can I take on this implementation to check other side effects while testing performance, if you agree on this?

Also, as @xxhdx1985126 mentioned, we probably need another way to handle pglog entries for performance as a long-term plan---a small log-structured structure could be a solution because it allows writing pg entries in a sequential manner (which means it makes easier to delete logs based on time) even though it comes at the cost of slower read performance.

@cyx1231st
Copy link
Member

I think, for now, maybe we can make the omap extent size object-specific. For example, pglog entries are kept in omaps of the pg meta object, we can let pg meta object use small omap extent size, while keep user objects' omap extent size large.

Yeah, this looks to be a simple workaround because the omap trees are independent per object.

The test was done under RBD, issuing I/O using fio (4K random write workload). So, the data types stored in the omap tree include snapset (with key name snapset) and pglog (with key name evertion_t, which generates many leaf nodes). it seems that storing pglog is problematic for performance based on my observation; I have observed a significant performance improvement without storing pglog.

In order to decide appropriately, we need to understand the most common value size of pglog.

Given this, I think the reasonable approach, for now, is to differentiate the omap leaf size depending on the data types. Can I take on this implementation to check other side effects while testing performance, if you agree on this?

@myoungwon I think the b+ tree leaf size needs to be the same per b+ tree to do rebalance, split and merge easily.

And sure! Please let us know what you find interesting.

For longer term, see #59213 (comment), I think this requirement is general among all the SeaStore b+ trees, because:

  • lba tree might need to store block-level crcs, which will be variable-sized;
  • onode tree value needs to store variable-sized attrs in-place, which can be both large or small;

@cyx1231st
Copy link
Member

It looks to us that PGLog entries doesn't necessarily need to be kept in omaps, maybe this is an opportunity to consider some pglog specific mechanism in SeaStore. What do you think? @cyx1231st @athanatos

When it comes to handling the pglog, there is a limitation: a small-omap-leaf causes excessive split and new allocation, while a large-omap-leaf leads to memory copy overhead (e.g., duplicate_for_write).

Actually, I'm not sure whether and how much this is worthwhile. The specific mechanism needs to specialize ObjectStore interfaces and OSD implementations, needs be transactional, and if the append sizes are small per transaction, minimizing write amplification to be block-aligned might be an issue.

Reusing the omap tree is much simpler considering the above requirements, and if we ought to find a way for the b+ trees to store variable-sized values much more efficiently, probably it is a meaningful direction to prioritize.

What do you think? @xxhdx1985126 @myoungwon

@myoungwon
Copy link
Member Author

@cyx1231st Reusing the omap tree is the priority if we can find a way to resolve the performance issue. I was just concerned that it would turn out a trade-off situation where the performance problem is difficult to fix with the current data structure.

@cyx1231st
Copy link
Member

cyx1231st commented Aug 15, 2024

Probably. Let's look at the 2 baselines first:

  1. Lowerbound: the performance after adjusting the omap leaf node size of pg-meta object appropriately.
  2. Upperbound: the performance excluded the pglog.

This should be useful to understand the margin of the possible performance improvements.
(Probably worth to see longer until the omap tree nodes become relatively stable from an empty state)

Also need more understandings about the real-time patterns of pglogs.

@myoungwon
Copy link
Member Author

myoungwon commented Aug 15, 2024

@cyx1231st Please note that the test has been done with 1 core, RBD, 4K random write, fio

  1. Lowerbound: 3810 IOPS
  2. Upperbound: 4388 IOPS

The number could improve if delta-overwrite is enabled (with a filled image).
Anyway, seastore needs to achieve more than 4500 IOPS to be competitive, compared to other stores (e.g., bluestore + crimson, bluestore + classic osd).

@cyx1231st
Copy link
Member

Thanks, I have further questions:

Please note that the test has been done with 1 core, RBD, 4K random write, fio

Does replication need to be more than 1 to include pglogs?

When it comes to handling the pglog, there is a limitation: a small-omap-leaf causes excessive split and new allocation, while a large-omap-leaf leads to memory copy overhead (e.g., duplicate_for_write).

Did you mean setting the omap leaf node size to 16KiB result in better performance than both smaller 4KiB and larger 64KiB due to the above reasons?

Anyway, seastore needs to achieve more than 4500 IOPS to be competitive, compared to other stores (e.g., bluestore + crimson, bluestore + classic osd).

Yeah, this is my current focus. Seems there are plenty of efforts (and opportunities) to make SeaStore faster. But the possible scenarios are vast, and the correctness and maintainability are important -- I'm working on making it easier to observe the internals of SeaStore under the configuration crimson_osd_stat_interval.

@myoungwon
Copy link
Member Author

Thanks, I have further questions:

Please note that the test has been done with 1 core, RBD, 4K random write, fio

Does replication need to be more than 1 to include pglogs?

Let me check this. I noticed an omap write with key name eversion_t, so I thought that is pg_log_entry.

When it comes to handling the pglog, there is a limitation: a small-omap-leaf causes excessive split and new allocation, while a large-omap-leaf leads to memory copy overhead (e.g., duplicate_for_write).

Did you mean setting the omap leaf node size to 16KiB result in better performance than both smaller 4KiB and larger 64KiB due to the above reasons?

Yeah, I didn't conduct the test with 4KiB block size, but 16KiB's performance is better than 64KiB.
I'll update both results here.

Anyway, seastore needs to achieve more than 4500 IOPS to be competitive, compared to other stores (e.g., bluestore + crimson, bluestore + classic osd).

Yeah, this is my current focus. Seems there are plenty of efforts (and opportunities) to make SeaStore faster. But the possible scenarios are vast, and the correctness and maintainability are important -- I'm working on making it easier to observe the internals of SeaStore under the configuration crimson_osd_stat_interval.

I agree. Correctness should be our priority, but we also need to focus on improving performance to demonstrate the advantage of seastore.

@cyx1231st
Copy link
Member

As for demonstrations, it is appriciated to have draft PRs (like this) so everyone can reproduce and discuss the appropriate ways to proceed.

@myoungwon
Copy link
Member Author

Sure. I'll post draft PRs.

@athanatos
Copy link
Contributor

athanatos commented Aug 21, 2024

Personally, I'd be in favor of creating a specialized structure and interface for pg logs -- alienstore could probably implement the interface using bluestore's omap as now. Until then, I'd also be fine with making some improvements to omap to better handle pg logs -- I think radosgw actually uses omaps to store some log-like data for its own purposes (maybe only for multisite?) in which case it probably wouldn't be wasted work.

@myoungwon
Copy link
Member Author

@cyx1231st To reuse the existing map tree and make the changes easily applicable (as a first step for optimization), I considered using two omap roots: HOT and COLD. The key idea is to insert items that are frequent small read/write/remove requests into the HOT omap root, while larger, long-lasting items are inserted into the COLD root.
This can be achieved by differentiating leaf node sizes (16K and 64K), which is relatively easier than other approaches. However, the challenge that arises from this change is how to handle range operations, such as omap_list(). A simple approach might be to call the range operation based on the key value. For example, if the key name is an integer, we could consider it as pg_log_entry_t or some kind of hot data. However, I’m not sure this is acceptable change. What do you think?

@cyx1231st
Copy link
Member

Seems like a working first-step. The name HOT/COLD might be inappropriate simply because small-leaf omap cannot store large values. How about naming them as SmallLeaf and LargeLeaf for the fact ?

I'm not sure whether the SmallLeaf size is 16K, probably 8K (same with the internal node size) is better if it's more efficient to store the most frequent values (in case of RBD)? Anyway, it depends on the real results.

Probably good to start with only putting pg_log_entry_t into the SmallLeaf omap tree, and ceph_assert that these value sizes must be small enough.

This is a temporary/intermediate optimization, so let's try to encapsulate the special modifications in order to make them easier to be replaced/reverted.

For example, if the key name is an integer, we could consider it as pg_log_entry_t or some kind of hot data.

I'm not sure whether "integer" is a reasonable classifier, what is the omap key pattern of the pg log entries?

@myoungwon
Copy link
Member Author

Seems like a working first-step. The name HOT/COLD might be inappropriate simply because small-leaf omap cannot store large values. How about naming them as SmallLeaf and LargeLeaf for the fact ?

OK.

I'm not sure whether the SmallLeaf size is 16K, probably 8K (same with the internal node size) is better if it's more efficient to store the most frequent values (in case of RBD)? Anyway, it depends on the real results.

Probably good to start with only putting pg_log_entry_t into the SmallLeaf omap tree, and ceph_assert that these value sizes must be small enough.

This is a temporary/intermediate optimization, so let's try to encapsulate the special modifications in order to make them easier to be replaced/reverted.

For example, if the key name is an integer, we could consider it as pg_log_entry_t or some kind of hot data.

I'm not sure whether "integer" is a reasonable classifier, what is the omap key pattern of the pg log entries?

The key name looks like:
BtreeOMapManager::omap_set_key: 0000000013.00000000000000000038 ->
This generated from the code as below.

    inline void get_key_name(char* key) const {
      // Below is equivalent of sprintf("%010u.%020llu");
      key[31] = 0;
      ritoa<uint64_t, 10, 20>(version, key + 31);
      key[10] = '.';
      ritoa<uint32_t, 10, 10>(epoch, key + 10);
    }

The key name contains integers and period. That was why I mentioned "integer".

@cyx1231st
Copy link
Member

Thanks for the pattern !

The key name contains integers and period. That was why I mentioned "integer".

I can think of 2 different ways to manage the 2 omap trees:

  1. Define an allowlist to redirect certain (pattern of) keys into the SmallLeaf tree, and ceph_assert that the value size must be small enough.
  2. Generally, upon insert, select the tree only by the value size -- but this means that query / update / remove / insert must access both trees.

@myoungwon
Copy link
Member Author

OK. I'll go ahead and prototype the two omap trees as we discussed.

@myoungwon myoungwon force-pushed the wip-dec-omap-leaf-size branch from f8c33b1 to a2df840 Compare October 3, 2024 09:17
@github-actions github-actions bot added the tests label Oct 3, 2024
@myoungwon myoungwon force-pushed the wip-dec-omap-leaf-size branch 4 times, most recently from d523134 to 50151da Compare October 4, 2024 04:45
myoungwon and others added 15 commits February 20, 2025 12:27
Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
…th the corresponding flag

Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
…apLeafNode in omap_load_context

Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
…tion

Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
Signed-off-by: Zhang Song <zhangsong02@qianxin.com>
Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
@myoungwon myoungwon force-pushed the wip-dec-omap-leaf-size branch from 34876dc to 5026c1d Compare February 20, 2025 12:29
@myoungwon
Copy link
Member Author

I found another issue while reviewing the results (https://pulpito.ceph.com/matan-2025-02-18_11:50:20-rados-ci-yingxin-seastore-omap-log-distro-default-smithi/8136040/). I've added a commit to fix it, as shown below. It seems that objectstore uses init_pg_ondisk() for testing without pool information.

 - if (pool->is_crimson()) {
 + if (pool && pool->is_crimson()) {

@Matan-B
Copy link
Contributor

Matan-B commented Feb 20, 2025

I found another issue while reviewing the results (https://pulpito.ceph.com/matan-2025-02-18_11:50:20-rados-ci-yingxin-seastore-omap-log-distro-default-smithi/8136040/). I've added a commit to fix it, as shown below. It seems that objectstore uses init_pg_ondisk() for testing without pool information.

 - if (pool->is_crimson()) {
 + if (pool && pool->is_crimson()) {

Cool! Thanks for taking a look, I'll reschedule with a new image.
@cyx1231st, I think that Crimson side should be ok and there's no need to retest

@Matan-B
Copy link
Contributor

Matan-B commented Feb 24, 2025

I found another issue while reviewing the results (https://pulpito.ceph.com/matan-2025-02-18_11:50:20-rados-ci-yingxin-seastore-omap-log-distro-default-smithi/8136040/). I've added a commit to fix it, as shown below. It seems that objectstore uses init_pg_ondisk() for testing without pool information.

 - if (pool->is_crimson()) {
 + if (pool && pool->is_crimson()) {

Cool! Thanks for taking a look, I'll reschedule with a new image. @cyx1231st, I think that Crimson side should be ok and there's no need to retest

https://shaman.ceph.com/builds/ceph/wip-matanb-crimson-59213

https://pulpito.ceph.com/matan-2025-02-24_12:31:07-rados-wip-matanb-crimson-59213-distro-default-smithi/

8149875, 8149963 - target is busy https://tracker.ceph.com/issues/43591 
8149877 - cls_rgw_reshard_get_ret https://tracker.ceph.com/issues/69009
8149923 - BLUESTORE_SLOW_OP_ALERT https://tracker.ceph.com/issues/68337
8149955 -  erasure_code_plugin_exists https://tracker.ceph.com/issues/69758
8149894 - infra

@cyx1231st, lgtm

@cyx1231st
Copy link
Member

I'll merge this week if there are no more concerns.

@cyx1231st cyx1231st merged commit 326f877 into ceph:main Feb 28, 2025
11 checks passed
@myoungwon
Copy link
Member Author

myoungwon commented May 8, 2025

To further improve performance and address the remaining tasks (#59213 (comment), #59213 (comment)), I created a daft document outlining a proposed manager structure for storing PG log entries, called the Log Store Manager. Can you take a look?

https://docs.google.com/document/d/1LX5nDX91kAAsEJD3BOfPv1o0F3HeMSwuJMXFNLl-QXw/edit?usp=sharing

@Matan-B
Copy link
Contributor

Matan-B commented May 8, 2025

To further improve performance and address the remaining tasks, I created a daft document outlining a proposed manager structure for storing PG log entries, called the Log Store Manager. Can you take a look?

https://docs.google.com/document/d/1LX5nDX91kAAsEJD3BOfPv1o0F3HeMSwuJMXFNLl-QXw/edit?usp=sharing

Can you please share this in the crimson slack channel?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Merged (Pre Tentacle Freeze)

Development

Successfully merging this pull request may close these issues.

6 participants