common/PriorityCache: Updated Implementation of Cache Age Binning#43299
common/PriorityCache: Updated Implementation of Cache Age Binning#43299yuriw merged 1 commit intoceph:masterfrom
Conversation
8af5780 to
a15c204
Compare
|
So ultimately this PR does what it was intended to do as far as I can tell. We do a much better job of favoring hot caches than cold caches as can be seen in various workload scenarios here: https://docs.google.com/spreadsheets/d/1lSp2cLzYmRfPILDCyLMXciIfdf0OvSFngwXukQFXIqQ/edit?usp=sharing The bad news is that overall it's basically a draw from a performance and CPU usage perspective when testing on our very fast Mako nodes. It's possible that further tuning may show additional differences, but one of the primary lessons from developing this code is that caching OMAP in the rocksdb block cache is much less useful than I expected. For instance in RGW, one of the only common use cases where having cached omap appears to be helpful is during bucket index listing. Since that involves doing iterative scans, the far more important thing to have is readahead vs a typical LRU style cache. The one exception might be if you have many clients doing overlapping bucket index listing. Never-the-less, this may partially explain why it was far better in previous versions of Ceph to give almost all of the cache to bluestore's onode cache rather than dedicating a significant amount to the rocksdb block cache. There are a couple of cases where cache age binning is slightly faster (primarily when we have repeated data cache reads) but also some cases where it's a little slower (primarily in cases where we cached data on large writes that was seldomly read but forced out semi-cold onodes that later had to be re-read). At this point the primary benefit of this PR appears to be that it more aggressively trims old onodes leading to less memory fragmentation (almost like background compaction) and provides much better insight into the relative ages of different caches so that we now can better understand and communicate how we are prioritizing memory. It also gives us much finer grained control over how we prioritize different caches. One other possibility is that this PR may show a bigger improvement with slower devices, especially with mixed workloads. On this test system, data cache misses during reads had no noticeable penalty due to the high performance of the underlying device. Only a mixed read/write workload showed improvement with cached data. At the moment I don't have easy access to a system with HDDs to test on so we have not yet tried a scenario with slower devices. |
a15c204 to
164d9c7
Compare
|
@markhpc The http://pulpito.front.sepia.ceph.com/yuriw-2021-10-05_14:30:24-rados-wip-yuri-testing-master-2021-10-04-1227-distro-basic-smithi/6423250/ |
164d9c7 to
ebd9c1e
Compare
|
Ok, reproduced the segfault fairly easily on my test node and traced it back to some shady unprotected accesses to the age bin data structure. Fixed those and got through a couple of rounds of ceph_test_objectstore without segfaults. I did hit a couple of failures but I don't think they are likely related.
|
|
Failures, unrelated: Details: |
|
jenkins retest this please |
|
@Neha @aclamk if one of you wants to approve we can go for it. I was initially nervous about the locking issues after the first round of testing but it was more straightforward to diagnose and fix than I was expecting. Basically an unprotected access to the age-bin along with a locked read/modify/locked write that was completely unnecessary and easily replaced with straight locked write. We aren't likely to see significant performance wins with this which is disappointing, but it does give us (and power users) much better insight and control into memory allocation behavior. |
|
jenkins retest this please |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
aclamk
left a comment
There was a problem hiding this comment.
Nice, simple, functional code.
Cleanup still required.
55a3aad to
0d8e913
Compare
Signed-off-by: Mark Nelson <mnelson@redhat.com>
0d8e913 to
5635183
Compare
|
@neha-ojha Ok, updated, rebased on master, compiles locally. I'll run some performance tests with #44538 applied and rerun make check just to dot the is. Feel free to run this through teuthology as well. |
|
local ceph_test_objectstore passed with only failures in kstore. |
|
jenkins test api |
|
Performance numbers below are from the age binning test suite (using a base ceph version of de40e94 + PR #44538). Tests are performed in descending order as shown. Intial RBD read performance might have dropped a little bit, though subsequent iterations of tests show almost exactly the same numbers as in master. It was initially hoped that the age binning code would have dramatically improved the behavior during transition switches from rbd to rgw and back, but it turns out that favoring onode cache is almost always the right choice even for RGW (and we tend to do this implicitly as rocksdb compaction appears to often invalidate data in the block cache). It's possible we may want to revisit the bin<->priority mappings but for now the primary benefit of this PR is gaining more control and insight into how the caches are receiving and utilizing memory. Edit: looking at the perf-test results from jenkins, we may see possibly a slight gain in rados bench tests, but it's also pretty close. Prefill
Trail 0
Trail 1
Trail 2
|
This should eliminate duplicate onode releases that could happen before. Additionally onode pinning is performed during cache trimming not onode ref count increment. [Hopefully] fixes: https://tracker.ceph.com/issues/53002 Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit a3057f4) Conflicts: src/os/bluestore/BlueStore.cc src/os/bluestore/BlueStore.h <lack of ceph#46036 and ceph#43299 backporting>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit 5aaa2e0) Conflicts: src/os/bluestore/BlueStore.cc <no backport for ceph#43299>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit 1d4abf7) Conflicts: src/os/bluestore/BlueStore.cc <no bacport for ceph#43299>
This should eliminate duplicate onode releases that could happen before. Additionally onode pinning is performed during cache trimming not onode ref count increment. [Hopefully] fixes: https://tracker.ceph.com/issues/53002 Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit a3057f4) Conflicts: src/os/bluestore/BlueStore.cc src/os/bluestore/BlueStore.h <lack of ceph#46036 and ceph#43299 backporting> (cherry picked from commit 4a80641) Resolves: rhbz#2218445
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit 5aaa2e0) Conflicts: src/os/bluestore/BlueStore.cc <no backport for ceph#43299> (cherry picked from commit 61d094d) Resolves: rhbz#2218445
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit 1d4abf7) Conflicts: src/os/bluestore/BlueStore.cc <no bacport for ceph#43299> (cherry picked from commit e5d9afc) Resolves: rhbz#2218445
Updated version of the age binning code. Functions and appears to improve cache behavior but performance tests are showing little gains right now (significantly more testing is needed).
Signed-off-by: Mark Nelson mnelson@redhat.com
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox