crimson/os/seastore: redirect log operations to 16K-leaf-size omap#59213
crimson/os/seastore: redirect log operations to 16K-leaf-size omap#59213
Conversation
@cyx1231st #57782 (comment) Is there any reason why OMAP_LEAF_BLOCK_SIZE was previously set to 64K? This change improves 4K random write performance in my environment to 3.8K IOPS (3.3 K --> 3.8 K) |
According to teuthology test cases, OmapManager needs to handle large omap values: #56114 |
Please also refer to git blame: #55840 The configuration is related to the maximum value (and key) size seastore needs to support.
What is the typical size of the omap value in your tests (and the omap key name)? I think this suggests that there are room to optimize the omap-tree to store small values more efficiently, but need to find a way to store large omap values. |
I think this degradation is caused by duplicating omap extents which is further caused by the pglog entry associated with each write request. It looks to us that PGLog entries doesn't necessarily need to be kept in omaps, maybe this is an opportunity to consider some pglog specific mechanism in SeaStore. What do you think? @cyx1231st @athanatos |
I think, for now, maybe we can make the omap extent size object-specific. For example, pglog entries are kept in omaps of the pg meta object, we can let pg meta object use small omap extent size, while keep user objects' omap extent size large. |
The test was done under RBD, issuing I/O using fio (4K random write workload). So, the data types stored in the omap tree include When it comes to handling the pglog, there is a limitation: a small-omap-leaf causes excessive split and new allocation, while a large-omap-leaf leads to memory copy overhead (e.g., duplicate_for_write). Also, as @xxhdx1985126 mentioned, we probably need another way to handle pglog entries for performance as a long-term plan---a small log-structured structure could be a solution because it allows writing pg entries in a sequential manner (which means it makes easier to delete logs based on time) even though it comes at the cost of slower read performance. |
Yeah, this looks to be a simple workaround because the omap trees are independent per object.
In order to decide appropriately, we need to understand the most common value size of pglog.
@myoungwon I think the b+ tree leaf size needs to be the same per b+ tree to do rebalance, split and merge easily. And sure! Please let us know what you find interesting. For longer term, see #59213 (comment), I think this requirement is general among all the SeaStore b+ trees, because:
|
Actually, I'm not sure whether and how much this is worthwhile. The specific mechanism needs to specialize ObjectStore interfaces and OSD implementations, needs be transactional, and if the append sizes are small per transaction, minimizing write amplification to be block-aligned might be an issue. Reusing the omap tree is much simpler considering the above requirements, and if we ought to find a way for the b+ trees to store variable-sized values much more efficiently, probably it is a meaningful direction to prioritize. What do you think? @xxhdx1985126 @myoungwon |
|
@cyx1231st Reusing the omap tree is the priority if we can find a way to resolve the performance issue. I was just concerned that it would turn out a trade-off situation where the performance problem is difficult to fix with the current data structure. |
|
Probably. Let's look at the 2 baselines first:
This should be useful to understand the margin of the possible performance improvements. Also need more understandings about the real-time patterns of pglogs. |
|
@cyx1231st Please note that the test has been done with 1 core, RBD, 4K random write, fio
The number could improve if delta-overwrite is enabled (with a filled image). |
|
Thanks, I have further questions:
Does replication need to be more than 1 to include pglogs?
Did you mean setting the omap leaf node size to 16KiB result in better performance than both smaller 4KiB and larger 64KiB due to the above reasons?
Yeah, this is my current focus. Seems there are plenty of efforts (and opportunities) to make SeaStore faster. But the possible scenarios are vast, and the correctness and maintainability are important -- I'm working on making it easier to observe the internals of SeaStore under the configuration |
Let me check this. I noticed an omap write with key name eversion_t, so I thought that is pg_log_entry.
Yeah, I didn't conduct the test with 4KiB block size, but 16KiB's performance is better than 64KiB.
I agree. Correctness should be our priority, but we also need to focus on improving performance to demonstrate the advantage of seastore. |
|
As for demonstrations, it is appriciated to have draft PRs (like this) so everyone can reproduce and discuss the appropriate ways to proceed. |
|
Sure. I'll post draft PRs. |
|
Personally, I'd be in favor of creating a specialized structure and interface for pg logs -- alienstore could probably implement the interface using bluestore's omap as now. Until then, I'd also be fine with making some improvements to omap to better handle pg logs -- I think radosgw actually uses omaps to store some log-like data for its own purposes (maybe only for multisite?) in which case it probably wouldn't be wasted work. |
|
@cyx1231st To reuse the existing map tree and make the changes easily applicable (as a first step for optimization), I considered using two omap roots: HOT and COLD. The key idea is to insert items that are frequent small read/write/remove requests into the HOT omap root, while larger, long-lasting items are inserted into the COLD root. |
|
Seems like a working first-step. The name HOT/COLD might be inappropriate simply because small-leaf omap cannot store large values. How about naming them as SmallLeaf and LargeLeaf for the fact ? I'm not sure whether the SmallLeaf size is 16K, probably 8K (same with the internal node size) is better if it's more efficient to store the most frequent values (in case of RBD)? Anyway, it depends on the real results. Probably good to start with only putting This is a temporary/intermediate optimization, so let's try to encapsulate the special modifications in order to make them easier to be replaced/reverted.
I'm not sure whether "integer" is a reasonable classifier, what is the omap key pattern of the pg log entries? |
OK.
The key name looks like: The key name contains integers and period. That was why I mentioned "integer". |
|
Thanks for the pattern !
I can think of 2 different ways to manage the 2 omap trees:
|
|
OK. I'll go ahead and prototype the two omap trees as we discussed. |
f8c33b1 to
a2df840
Compare
d523134 to
50151da
Compare
Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
…th the corresponding flag Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
…apLeafNode in omap_load_context Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
…tion Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
Signed-off-by: Zhang Song <zhangsong02@qianxin.com> Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com> Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
34876dc to
5026c1d
Compare
|
I found another issue while reviewing the results (https://pulpito.ceph.com/matan-2025-02-18_11:50:20-rados-ci-yingxin-seastore-omap-log-distro-default-smithi/8136040/). I've added a commit to fix it, as shown below. It seems that objectstore uses init_pg_ondisk() for testing without pool information. |
Cool! Thanks for taking a look, I'll reschedule with a new image. |
https://shaman.ceph.com/builds/ceph/wip-matanb-crimson-59213 @cyx1231st, lgtm |
|
I'll merge this week if there are no more concerns. |
|
To further improve performance and address the remaining tasks (#59213 (comment), #59213 (comment)), I created a daft document outlining a proposed manager structure for storing PG log entries, called the Log Store Manager. Can you take a look? https://docs.google.com/document/d/1LX5nDX91kAAsEJD3BOfPv1o0F3HeMSwuJMXFNLl-QXw/edit?usp=sharing |
Can you please share this in the crimson slack channel? |
Signed-off-by: Myoungwon Oh myoungwon.oh@samsung.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e