[RFC] test/objectstore/store_test: First pass at an OmapBench test#39976
[RFC] test/objectstore/store_test: First pass at an OmapBench test#39976
Conversation
|
Additional thoughts: Create multiple collections? Create/Remove lots of kv pairs between repeated iteration tests? |
|
Quick wallclock profile of thread 1 in both bluefs_buffered_io cases repeatedly doing the lower_bound iteration test over and over: bluefs_buffered_io=false bluefs_buffered_io=true |
|
Going to retry profile using the detach/attach method in gdbpmp which is much slower but may provide better sample quality. |
|
|
@mattbenjamin I probably should run with: which will probably help (and I think/hope how we are still running OSDs in production). This is why I keep harping on trying to recycle memory (or objects!) for short lived stuff though. What a pain. Having said that, the thing I'm especially noticing is that we are indeed following the different prefetch paths in rocksdb depending on whether buffered IO is used. With bluefs_buffered_io = false rocksdb does prefetch in the FilePrefetchBuffer via the BlockFetcher, but in the bluefs_buffered_io=true case, it does it directly in the NewDataBlockIterator. This was what we thought happened based on the analysis in #38044 (comment) so it's nice to see we were right. Also I'm noticing that in the direct case we are spending a ton of time in ShardedCache::Lookup, but never in ShardedCache::Insert like in the buffered case! I'm not sure what that means exactly yet, but clearly we're seeing very different behavior in rocksdb when buffered IO is enabled. |
|
Ok, final update for tonight. I manually disabled Prefetch in bluerocksenv and saw no change at all in performance but I haven't profiled yet. What I have notices is that in the direct IO case, the first low_bound iteration test is slow, then subsequent ones are fast for a little while, before becoming forever slow again: This may be key to understanding what's going on. |
|
I think the drop in performance is likely due to hitting the 4GB memory limit defined by the osd_memory_target. Going to rewrite the benchmark to submit data in batches. |
41e44a0 to
7bf8ebe
Compare
|
Ok, the slowdown in the previous results was due to in-memory datastructures laying around causing the prioritycachemanager to shrink the in-memory caches since it wants to keep process memory below the osd_memory_target. Now we can optionally do omap_setkeys in batches to stop that from happening. Also now the benchmark can span OMAP across multiple objects which also reduces the submission size. Current test results for 100K objects wtih 100 keys each:
These results are pretty odd looking with some suspiciously good looking results for filestore (and kstore for setkeys). The speed difference for bluestore with buffered IO seems to be pretty apparent in the iteration tests though. Edit: I should add that filestore has access to a lot (384GB!) of pagecache on this node. I probably need to figure out if I can use a cgroup to limit pagecache or something. |
|
Ok, I ran some tests with different cgroup memory limits. I'm still wary of these filestore numbers. We should also keep in mind that this doesn't exactly represent how an OSD would work in real situations (much less parallelism!) Also remember that filestore completely ignores the osd_memory_target and uses rocksdb much less aggressively (relying on the filesystem for more). In any event, one thing I notice here in the iteration tests is that bluestore can do very well with direct IO when there is enough memory to seemingly keep all omap blocks present in the blockcache. I'm back to strongly suspecting that the block cache is being polluted by something else when iteration gets slow. Since this is master it should no longer be onodes, but we still might be thrashing the cache with other stuff (allocation data seems like a strong contender). Need to investigate more. omap_setkeys
omap_get
seek_to_first iteration
lower_bound iteration
remove
|
|
I'm trusting the filestore results a little more. We do see some slowdown (but ti's still surprisingly good) even at much lower cgroup memory limits: Filestore cgroup memory.limit_in_bytes tests:
Filestore really appears to be faster in these tests, likely due to significantly less rocksdbs usage and thus higher ability to keep things cached with less memory (but I should say that bluestore is still far faster in many non-omap tests!) |
aclamk
left a comment
There was a problem hiding this comment.
Nice, simple and very functional bench test.
7bf8ebe to
701a943
Compare
|
Latest test results are available here: https://docs.google.com/spreadsheets/d/1l2gJ3xsmh4AdpOUFmv5oLyEofoJ9FEMbkBm9DhfqVn0/edit?usp=sharing |
aclamk
left a comment
There was a problem hiding this comment.
Good OMAP test.
Only cosmetic changes required.
Signed-off-by: Mark Nelson <mnelson@redhat.com>
701a943 to
6ed95b8
Compare
|
Before we merge this, we probably want to figure out if we want to be able to pass in user-supplied options and what the default options should be. This test can take an extremely long time (1 hour+) to run in some configurations (low memory, direct IO, unsorted). |
At least: controlling the number of objects (to enable short verification tests) |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
@markhpc ping |
|
Probably the next step for this is to figure out how to turn it into an independent benchmark. It talks to the objectstore directly so might need some rethinking if we want to run it (or something like it) against existing clusters. Wouldn't be impossible though to just rip the gtest part of it out and replace it with some kind of cli wrapper. |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
we still want this! |
|
As Neha already said:
|
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
This is a quick and dirty first pass at a new gtest microbenchmark for omap that heavily borrows from the generic OmapSimple test but could be extended further. It creates a configurable number of random kv pairs with configurable (if you modify the test) key and value lengths. Arguably we may want something a little more sophisticated for kv pairs with common prefixes or other cases that this isn't covering.
For now, here's a quick set of tests from Officinalis using key/value sizes roughly based on what we saw when investigating for the trocksdb folks:
https://docs.google.com/spreadsheets/d/1fNFI8U-JRkU5uaRJzgg5rNxqhgRJFlDB4TsTAVsuYkk/edit?usp=sharing
1M 64 byte keys with 256 byte values:
(buffered off)
(buffered on)
Edit: A quick couple of passes make these results look relatively repeatable, though literally I just populated the OmapSimple tests with more random data and timed the tests and haven't thought about the specifics much yet.
If anyone would like to give it a try, you can run something like:
Signed-off-by: Mark Nelson mnelson@redhat.com
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox