Skip to content

os/bluestore: set upper and lower bounds on rocksdb omap iterators#45904

Merged
yuriw merged 3 commits intoceph:masterfrom
cfsnyder:fix_rocksdb_iter_perf
Apr 25, 2022
Merged

os/bluestore: set upper and lower bounds on rocksdb omap iterators#45904
yuriw merged 3 commits intoceph:masterfrom
cfsnyder:fix_rocksdb_iter_perf

Conversation

@cfsnyder
Copy link
Contributor

@cfsnyder cfsnyder commented Apr 13, 2022

Limits RocksDB omap Seek operations to the relevant key range of the object's omap.
This prevents RocksDB from unnecessarily iterating over delete range tombstones in
irrelevant omap CF shards. Avoids extreme performance degradation commonly caused
by tombstones generated from RGW bucket resharding cleanup. Also prefer CFIteratorImpl
over ShardMergeIteratorImpl when we can determine that all keys within specified
IteratorBounds must be in a single CF.

Fixes: https://tracker.ceph.com/issues/55324
Signed-off-by: Cory Snyder csnyder@iland.com

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)

@cfsnyder
Copy link
Contributor Author

jenkins test api

@cfsnyder cfsnyder marked this pull request as ready for review April 15, 2022 07:08
@cfsnyder cfsnyder requested a review from cbodley April 15, 2022 07:08
@cfsnyder
Copy link
Contributor Author

jenkins test api

Comment on lines +664 to +667
* If the specified IteratorBounds arg has both an upper and a lower bound defined, and they have equal placement hash
* strings, we can be sure that the entire iteration range exists in a single CF. In that case, we return the relevant
* CF handle. In all other cases, we return a nullptr to indicate that the specified bounds cannot necessarily be mapped
* to a single CF.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed elegant! Are there specific things that users can do in their config to ensure this is the case most of the time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is always the case for omap, as long as the bluestore_rocksdb_cfs is set to the default value (or is using an alternative hash constraint that still maps all keys for an individual omap object to the same CF).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, great - just wondering if it's worth throwing notes in the config docs so that users know what the optimized path is.

Comment on lines +664 to +667
* If the specified IteratorBounds arg has both an upper and a lower bound defined, and they have equal placement hash
* strings, we can be sure that the entire iteration range exists in a single CF. In that case, we return the relevant
* CF handle. In all other cases, we return a nullptr to indicate that the specified bounds cannot necessarily be mapped
* to a single CF.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed elegant! Are there specific things that users can do in their config to ensure this is the case most of the time?

Copy link
Contributor

@cbodley cbodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@cfsnyder cfsnyder force-pushed the fix_rocksdb_iter_perf branch from 5cce1d3 to 0839807 Compare April 18, 2022 15:43
Copy link
Contributor

@cbodley cbodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise looks great. i'll defer to bluestore folks for final approval

Limits RocksDB omap Seek operations to the relevant key range of the object's omap.
This prevents RocksDB from unnecessarily iterating over delete range tombstones in
irrelevant omap CF shards. Avoids extreme performance degradation commonly caused
by tombstones generated from RGW bucket resharding cleanup. Also prefer CFIteratorImpl
over ShardMergeIteratorImpl when we can determine that all keys within specified
IteratorBounds must be in a single CF.

Fixes: https://tracker.ceph.com/issues/55324
Signed-off-by: Cory Snyder <csnyder@iland.com>
@cfsnyder cfsnyder force-pushed the fix_rocksdb_iter_perf branch from 0839807 to 850c16c Compare April 18, 2022 16:34
@cfsnyder
Copy link
Contributor Author

@cbodley should I tag someone specific from the bluestore side for review?

@ifed01
Copy link
Contributor

ifed01 commented Apr 21, 2022

Generally this looks good to me. I have a concern though: may be it's better to make this feature's enablement configurable given that we're planning to backport it ASAP and hence can provide quite a limited testing.
@cfsnyder @neha-ojha @aclamk @markhpc - what do you think?

@markhpc
Copy link
Member

markhpc commented Apr 21, 2022

@ifed01 It's probably a valid concern given how touchy this stuff can be. We've been bitten by strange corner cases in the past.

Copy link
Member

@markhpc markhpc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

superficially looks good to me, I trust @ifed01 and @cbodley's take on the code. Once the option to disable the behavior gets added (default on) let's get this tested and merged.

@ljflores
Copy link
Member

@yuriw ^

rocksdb::ColumnFamilyHandle *get_key_cf(const prefix_shards& shards, const char* key, const size_t keylen);
rocksdb::ColumnFamilyHandle *get_cf_handle(const std::string& prefix, const std::string& key);
rocksdb::ColumnFamilyHandle *get_cf_handle(const std::string& prefix, const char* key, size_t keylen);
rocksdb::ColumnFamilyHandle *get_cf_handle(const std::string& prefix, const IteratorBounds& bounds);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love that this new function is made similar to existing ones.

…isabled

Add osd_rocksdb_iterator_bounds_enabled config option to allow rocksdb iterator bounds to be disabled.
Also includes minor refactoring to shorten code associated with IteratorBounds initialization in bluestore.

Signed-off-by: Cory Snyder <csnyder@iland.com>
@cfsnyder cfsnyder force-pushed the fix_rocksdb_iter_perf branch from ab4141c to ca3ccd9 Compare April 21, 2022 17:28
@ljflores
Copy link
Member

@yuriw extra commit has been added ^

…rBounds)

Adds a precondition to RocksDBStore::get_cf_handle(string, IteratorBounds)
to avoid duplicating logic of the only caller (RocksDBStore::get_iterator).
Assertions will fail if preconditions are not met.

Signed-off-by: Cory Snyder <csnyder@iland.com>
}
auto iter = cf_handles.find(prefix);
if (iter == cf_handles.end() || iter->second.hash_l != 0) {
return nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see here a check for hash_l==0.
This makes the implementation valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markhpc @ifed01 I personally don't feel the need to add config option that would control this modification.
We do not modify here any parts that modify DB state.
I do not believe that reading LESS could have negative impact.

Copy link
Contributor

@aclamk aclamk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent improvement!

@ifed01
Copy link
Contributor

ifed01 commented Apr 22, 2022

jenkins test make check

@cfsnyder
Copy link
Contributor Author

@neha-ojha anything else that I can do to get this one merged?

@ljflores
Copy link
Member

@cfsnyder it's going through some final testing. We plan to get this merged today once the testing is validated.

@ljflores
Copy link
Member

@neha-ojha
Copy link
Member

jenkins test make check

Copy link
Member

@neha-ojha neha-ojha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cfsnyder!

@neha-ojha
Copy link
Member

jenkins test make check

1 similar comment
@neha-ojha
Copy link
Member

jenkins test make check

@yuriw yuriw merged commit f6d1cb6 into ceph:master Apr 25, 2022
@idryomov
Copy link
Contributor

This change broke upgrades, from octopus to pacific at least. Reverting #45963 backport from pacific restores omap functionality. See https://tracker.ceph.com/issues/55444 for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants