Skip to content

common/PriorityCache: Updated Implementation of Cache Age Binning#43299

Merged
yuriw merged 1 commit intoceph:masterfrom
markhpc:wip-age-binning-rebase-20210923
Jan 13, 2022
Merged

common/PriorityCache: Updated Implementation of Cache Age Binning#43299
yuriw merged 1 commit intoceph:masterfrom
markhpc:wip-age-binning-rebase-20210923

Conversation

@markhpc
Copy link
Member

@markhpc markhpc commented Sep 24, 2021

Updated version of the age binning code. Functions and appears to improve cache behavior but performance tests are showing little gains right now (significantly more testing is needed).

Signed-off-by: Mark Nelson mnelson@redhat.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@markhpc
Copy link
Member Author

markhpc commented Oct 1, 2021

So ultimately this PR does what it was intended to do as far as I can tell. We do a much better job of favoring hot caches than cold caches as can be seen in various workload scenarios here:

https://docs.google.com/spreadsheets/d/1lSp2cLzYmRfPILDCyLMXciIfdf0OvSFngwXukQFXIqQ/edit?usp=sharing

The bad news is that overall it's basically a draw from a performance and CPU usage perspective when testing on our very fast Mako nodes. It's possible that further tuning may show additional differences, but one of the primary lessons from developing this code is that caching OMAP in the rocksdb block cache is much less useful than I expected. For instance in RGW, one of the only common use cases where having cached omap appears to be helpful is during bucket index listing. Since that involves doing iterative scans, the far more important thing to have is readahead vs a typical LRU style cache. The one exception might be if you have many clients doing overlapping bucket index listing. Never-the-less, this may partially explain why it was far better in previous versions of Ceph to give almost all of the cache to bluestore's onode cache rather than dedicating a significant amount to the rocksdb block cache.

There are a couple of cases where cache age binning is slightly faster (primarily when we have repeated data cache reads) but also some cases where it's a little slower (primarily in cases where we cached data on large writes that was seldomly read but forced out semi-cold onodes that later had to be re-read).

At this point the primary benefit of this PR appears to be that it more aggressively trims old onodes leading to less memory fragmentation (almost like background compaction) and provides much better insight into the relative ages of different caches so that we now can better understand and communicate how we are prioritizing memory. It also gives us much finer grained control over how we prioritize different caches.

One other possibility is that this PR may show a bigger improvement with slower devices, especially with mixed workloads. On this test system, data cache misses during reads had no noticeable penalty due to the high performance of the underlying device. Only a mixed read/write workload showed improvement with cached data. At the moment I don't have easy access to a system with HDDs to test on so we have not yet tried a scenario with slower devices.

@markhpc markhpc changed the title [DNM] common/PriorityCache: Updated Implementation of Cache Age Binning common/PriorityCache: Updated Implementation of Cache Age Binning Oct 1, 2021
@markhpc markhpc force-pushed the wip-age-binning-rebase-20210923 branch from a15c204 to 164d9c7 Compare October 1, 2021 17:42
@neha-ojha
Copy link
Member

@markhpc The rados/objectstore tests are failing with the following

2021-10-05T22:01:47.368 INFO:teuthology.orchestra.run.smithi078.stderr:*** Caught signal (Segmentation fault) **
2021-10-05T22:01:47.371 INFO:teuthology.orchestra.run.smithi078.stderr: in thread 7f563b969240 thread_name:ceph_test_objec
2021-10-05T22:01:47.381 INFO:teuthology.orchestra.run.smithi078.stderr: ceph version 17.0.0-8027-g46ca647a (46ca647adc2b185994da3f30968749d1167d0e3a) quincy (dev)
2021-10-05T22:01:47.382 INFO:teuthology.orchestra.run.smithi078.stderr: 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7f563d19b3c0]
2021-10-05T22:01:47.382 INFO:teuthology.orchestra.run.smithi078.stderr: 2: (rocksdb_cache::BinnedLRUCacheShard::LRU_Insert(rocksdb_cache::BinnedLRUHandle*)+0x102) [0x560407a0af22]
2021-10-05T22:01:47.382 INFO:teuthology.orchestra.run.smithi078.stderr: 3: (rocksdb_cache::BinnedLRUCacheShard::Release(rocksdb::Cache::Handle*, bool)+0x1f4) [0x560407a0b1e4]
2021-10-05T22:01:47.383 INFO:teuthology.orchestra.run.smithi078.stderr: 4: (rocksdb::CachableEntry<rocksdb::Block>::ReleaseCacheHandle(void*, void*)+0x94) [0x560407d4bcc3]
2021-10-05T22:01:47.383 INFO:teuthology.orchestra.run.smithi078.stderr: 5: (rocksdb::Cleanable::DoCleanup()+0x3b) [0x560407b34b6b]
2021-10-05T22:01:47.383 INFO:teuthology.orchestra.run.smithi078.stderr: 6: (rocksdb::Cleanable::~Cleanable()+0x1c) [0x560407d7915a]
2021-10-05T22:01:47.383 INFO:teuthology.orchestra.run.smithi078.stderr: 7: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x231) [0x5604079e4651]
2021-10-05T22:01:47.384 INFO:teuthology.orchestra.run.smithi078.stderr: 8: ceph_test_objectstore(+0xb842af) [0x5604077d72af]
2021-10-05T22:01:47.384 INFO:teuthology.orchestra.run.smithi078.stderr: 9: ceph_test_objectstore(+0xb67ca1) [0x5604077baca1]
2021-10-05T22:01:47.385 INFO:teuthology.orchestra.run.smithi078.stderr: 10: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x217) [0x5604078207d7]
2021-10-05T22:01:47.385 INFO:teuthology.orchestra.run.smithi078.stderr: 11: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)11, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x244) [0x56040783f4c4]
2021-10-05T22:01:47.385 INFO:teuthology.orchestra.run.smithi078.stderr: 12: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a24) [0x560407862f84]
2021-10-05T22:01:47.385 INFO:teuthology.orchestra.run.smithi078.stderr: 13: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x17e6) [0x560407867ac6]
2021-10-05T22:01:47.386 INFO:teuthology.orchestra.run.smithi078.stderr: 14: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x698) [0x560407872828]
2021-10-05T22:01:47.386 INFO:teuthology.orchestra.run.smithi078.stderr: 15: (StoreTest::doSyntheticTest(int, unsigned long, unsigned long, unsigned long)+0xc95) [0x560407638205]

http://pulpito.front.sepia.ceph.com/yuriw-2021-10-05_14:30:24-rados-wip-yuri-testing-master-2021-10-04-1227-distro-basic-smithi/6423250/
http://pulpito.front.sepia.ceph.com/yuriw-2021-10-04_21:47:53-rados-wip-yuri-testing-master-2021-10-04-1227-distro-basic-smithi/6421719/

@markhpc markhpc force-pushed the wip-age-binning-rebase-20210923 branch from 164d9c7 to ebd9c1e Compare January 4, 2022 04:56
@markhpc
Copy link
Member Author

markhpc commented Jan 4, 2022

Ok, reproduced the segfault fairly easily on my test node and traced it back to some shady unprotected accesses to the age bin data structure. Fixed those and got through a couple of rounds of ceph_test_objectstore without segfaults. I did hit a couple of failures but I don't think they are likely related.

[ OK ] ObjectStore/StoreTestOmapUpgrade.LargeLegacyToPG/2 (6056 ms)
[ RUN ] ObjectStore/StoreTestOmapUpgrade.LargeLegacyToPG/3
[ OK ] ObjectStore/StoreTestOmapUpgrade.LargeLegacyToPG/3 (0 ms)
[----------] 12 tests from ObjectStore/StoreTestOmapUpgrade (15639 ms total)

[----------] Global test environment tear-down
[==========] 484 tests from 3 test suites ran. (3788154 ms total)
[ PASSED ] 478 tests.
[ FAILED ] 6 tests, listed below:
[ FAILED ] ObjectStore/StoreTest.CompressionTest/2, where GetParam() = "bluestore"
[ FAILED ] ObjectStore/StoreTest.ZeroLengthZero/3, where GetParam() = "kstore"
[ FAILED ] ObjectStore/StoreTest.OMapTest/3, where GetParam() = "kstore"
[ FAILED ] ObjectStore/StoreTest.OMapIterator/3, where GetParam() = "kstore"
[ FAILED ] ObjectStore/StoreTestSpecificAUSize.BluestoreStatFSTest/2, where GetParam() = "bluestore"
[ FAILED ] ObjectStore/StoreTestSpecificAUSize.garbageCollection/2, where GetParam() = "bluestore"

@ljflores
Copy link
Member

ljflores commented Jan 5, 2022

http://pulpito.front.sepia.ceph.com/yuriw-2022-01-04_21:52:15-rados-wip-yuri7-testing-2022-01-04-1159-distro-default-smithi/

Failures, unrelated:
https://tracker.ceph.com/issues/53723
https://tracker.ceph.com/issues/38357
https://tracker.ceph.com/issues/53294
https://tracker.ceph.com/issues/53424
https://tracker.ceph.com/issues/53680
https://tracker.ceph.com/issues/53782
https://tracker.ceph.com/issues/53781

Details:
Bug_#53723: Cephadm agent fails to report and causes a health timeout - Ceph - Orchestrator
Bug_#38357: ClsLock.TestExclusiveEphemeralStealEphemeral failed - Ceph - RADOS
Bug_#53294: rados/test.sh hangs while running LibRadosTwoPoolsPP.TierFlushDuringFlush - Ceph - RADOS
Bug_#53424: CEPHADM_DAEMON_PLACE_FAIL in orch:cephadm/mgr-nfs-upgrade/ - Ceph - Orchestrator
Bug_#53680: ERROR:tasks.rook:'waiting for service removal' reached maximum tries (90) after waiting for 900 seconds - Ceph - Orchestrator
Bug_#53782: site-packages/paramiko/transport.py: Invalid packet blocking causes unexpected end of data - Infrastructure
Bug_#53781: cephadm/test_dashboard_e2e.sh: Unable to find element cd-modal when testing on orchestrator/03-inventory.e2e-spec.ts - Ceph - Mgr - Dashboard

@ljflores
Copy link
Member

ljflores commented Jan 5, 2022

jenkins retest this please

@ljflores
Copy link
Member

ljflores commented Jan 5, 2022

@markhpc @aclamk Tests are passing, but this still needs a final review from Adam.

@markhpc
Copy link
Member Author

markhpc commented Jan 6, 2022

@Neha @aclamk if one of you wants to approve we can go for it. I was initially nervous about the locking issues after the first round of testing but it was more straightforward to diagnose and fix than I was expecting. Basically an unprotected access to the age-bin along with a locked read/modify/locked write that was completely unnecessary and easily replaced with straight locked write.

We aren't likely to see significant performance wins with this which is disappointing, but it does give us (and power users) much better insight and control into memory allocation behavior.

@ljflores
Copy link
Member

ljflores commented Jan 6, 2022

jenkins retest this please

@github-actions
Copy link

github-actions bot commented Jan 8, 2022

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Copy link
Contributor

@aclamk aclamk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, simple, functional code.
Cleanup still required.

@markhpc markhpc force-pushed the wip-age-binning-rebase-20210923 branch from 55a3aad to 0d8e913 Compare January 11, 2022 20:30
Signed-off-by: Mark Nelson <mnelson@redhat.com>
@markhpc markhpc force-pushed the wip-age-binning-rebase-20210923 branch from 0d8e913 to 5635183 Compare January 11, 2022 21:17
@markhpc
Copy link
Member Author

markhpc commented Jan 11, 2022

@neha-ojha Ok, updated, rebased on master, compiles locally. I'll run some performance tests with #44538 applied and rerun make check just to dot the is. Feel free to run this through teuthology as well.

@markhpc
Copy link
Member Author

markhpc commented Jan 12, 2022

local ceph_test_objectstore passed with only failures in kstore.

@yuriw
Copy link
Contributor

yuriw commented Jan 12, 2022

jenkins test api

@markhpc
Copy link
Member Author

markhpc commented Jan 12, 2022

Performance numbers below are from the age binning test suite (using a base ceph version of de40e94 + PR #44538). Tests are performed in descending order as shown. Intial RBD read performance might have dropped a little bit, though subsequent iterations of tests show almost exactly the same numbers as in master. It was initially hoped that the age binning code would have dramatically improved the behavior during transition switches from rbd to rgw and back, but it turns out that favoring onode cache is almost always the right choice even for RGW (and we tend to do this implicitly as rocksdb compaction appears to often invalidate data in the block cache). It's possible we may want to revisit the bin<->priority mappings but for now the primary benefit of this PR is gaining more control and insight into how the caches are receiving and utilizing memory.

Edit: looking at the perf-test results from jenkins, we may see possibly a slight gain in rados bench tests, but it's also pretty close.

Prefill

test notes
rbd 4 128GB volumes using 4MB writes
rgw 4 million 4KB objects

Trail 0

test master wip-age-binning
4K RBD randread 104628 iops 100544 iops
4K RBD randwrite 62239 iops 62679 iops
4K RGW gets 3718 gets/s 3731 gets/s
RGW Bucket List 155.93 s 158.53 s

Trail 1

test master wip-age-binning
4K RBD randread 98197 iops 97362 iops
4K RBD randwrite 61445 iops 62371 iops
4K RGW gets 3742 gets/s 3744 gets/s
RGW Bucket List 158.63 s 158.18 s

Trail 2

test master wip-age-binning
4K RBD randread 97601 iops 96828 iops
4K RBD randwrite 59776 iops 59992 iops
4K RGW gets 3742 gets/s 3719 gets/s
RGW Bucket List 157.80 s 158.09 s

@yuriw yuriw merged commit 10be79e into ceph:master Jan 13, 2022
@markhpc markhpc deleted the wip-age-binning-rebase-20210923 branch January 13, 2022 13:45
ifed01 added a commit to ifed01/ceph that referenced this pull request Feb 10, 2023
This should eliminate duplicate onode releases that could happen before.
Additionally onode pinning is performed during cache trimming not onode
ref count increment.

[Hopefully] fixes: https://tracker.ceph.com/issues/53002

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit a3057f4)

 Conflicts:
	src/os/bluestore/BlueStore.cc
	src/os/bluestore/BlueStore.h
 <lack of ceph#46036 and
  ceph#43299 backporting>
ifed01 added a commit to ifed01/ceph that referenced this pull request Feb 10, 2023
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit 5aaa2e0)

 Conflicts:
	src/os/bluestore/BlueStore.cc
 <no backport for ceph#43299>
ifed01 added a commit to ifed01/ceph that referenced this pull request Feb 10, 2023
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit 1d4abf7)

  Conflicts:
    src/os/bluestore/BlueStore.cc
  <no bacport for ceph#43299>
mkogan1 pushed a commit to mkogan1/ceph that referenced this pull request Mar 25, 2024
This should eliminate duplicate onode releases that could happen before.
Additionally onode pinning is performed during cache trimming not onode
ref count increment.

[Hopefully] fixes: https://tracker.ceph.com/issues/53002

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit a3057f4)

 Conflicts:
	src/os/bluestore/BlueStore.cc
	src/os/bluestore/BlueStore.h
 <lack of ceph#46036 and
  ceph#43299 backporting>

(cherry picked from commit 4a80641)

Resolves: rhbz#2218445
mkogan1 pushed a commit to mkogan1/ceph that referenced this pull request Mar 25, 2024
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit 5aaa2e0)

 Conflicts:
	src/os/bluestore/BlueStore.cc
 <no backport for ceph#43299>

(cherry picked from commit 61d094d)

Resolves: rhbz#2218445
mkogan1 pushed a commit to mkogan1/ceph that referenced this pull request Mar 25, 2024
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit 1d4abf7)

  Conflicts:
    src/os/bluestore/BlueStore.cc
  <no bacport for ceph#43299>

(cherry picked from commit e5d9afc)

Resolves: rhbz#2218445
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants