Fix PinSelf() read-after-free in DB::GetMergeOperands()#9507
Fix PinSelf() read-after-free in DB::GetMergeOperands()#9507hx235 wants to merge 4 commits intofacebook:mainfrom
Conversation
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
c79bf82 to
669e53b
Compare
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
669e53b to
139351d
Compare
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
139351d to
4801e05
Compare
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
4801e05 to
b75f73d
Compare
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
b75f73d to
7daf2b9
Compare
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
Got you - is it because it's slow? Then now it makes sense to me what the previous PR was not using shared_ptr stuff. Will update this. |
siying
left a comment
There was a problem hiding this comment.
Thank you for debugging this and found the bug!
db/db_impl/db_impl.cc
Outdated
There was a problem hiding this comment.
Besides the good comment by @ajkr , shared_ptr here doesn't appear to be necessary, as it is never shared. A unique_ptr is more straight-forward, and something allocated in the stack would be even better, if possible.
db/version_set.cc
Outdated
There was a problem hiding this comment.
Passing shared_ptr in this way can be expensive, as it will increment an atomic counter. We had several performance regressions due to this.
There was a problem hiding this comment.
Not related to this PR, but the way we do it is that if the caller holds the/an ownership to this object, we just pass the raw pointer down to the child functions. Not sure whether there is going to be a better way of doing it with C++17 though.
I agree that we should use stack-allocated object. And just to clarify, for stack-allocated object, we can't use shared_ptr nor unique_ptr so your first suggestion about unique_ptr won't be adopted. |
7daf2b9 to
d1011cc
Compare
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
db/version_set.cc
Outdated
There was a problem hiding this comment.
Do we still need to allocate it here if in the common path, there is one passed in?
There was a problem hiding this comment.
I don't like this either - still working around it but couldn't think of another solution to bypass the allocation other than (void) it later. (In fact, not even sure if void-ing it actually frees the memory earlier).
The issue I had is I tried to do some conditional allocation like
if (use_local_pinned_iters_mgr) {
PinnedIteratorsManager local_pinned_iters_mgr;
pinned_iters_mgr = & local_pinned_iters_mgr;
}
But then local_pinned_iters_mgr will be local to the conditional branch scope.
Maybe I'm just not aware of a general coding pattern to this kind of issue - any thought?
There was a problem hiding this comment.
Can we always force the caller to allocate one?
There was a problem hiding this comment.
Good question - I was thinking of a minimum code change so didn't touch any of the callers. But yeah we could do that to all the callers (6 of them) and make sure the passed-in PinnedIteratorsManager life-time is well taken cared. Will update in the next draft.
d1011cc to
8f4a0c7
Compare
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
Update:
|
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
ajkr
left a comment
There was a problem hiding this comment.
LGTM. I don't know what the range of measured values was in the 10 benchmark runs but am assuming the regression on average would be well within the noise.
Let me double check with the team |
Update - will run more benchmarking as suggested by team. |
|
@ajkr - updated the benchmark command and result. The result seems fine to me but would like to have an extra review on that before merging the PR. |
8f4a0c7 to
944ea6a
Compare
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
Update: rebase |
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
LGTM still!
std of micros/op
I had a feeling this was about to show up. @jay-zhuang's google benchmark integration will probably be useful for automating this process of finding a stable result.
Summary:
**Context:**
Running the new test `DBMergeOperandTest.MergeOperandReadAfterFreeBug` prior to this fix surfaces the read-after-free bug of PinSef() as below:
```
READ of size 8 at 0x60400002529d thread T0
#5 0x7f199a in rocksdb::PinnableSlice::PinSelf(rocksdb::Slice const&) include/rocksdb/slice.h:171
#6 0x7f199a in rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&) db/db_impl/db_impl.cc:1919
#7 0x540d63 in rocksdb::DBImpl::GetMergeOperands(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*, rocksdb::GetMergeOperandsOptions*, int*) db/db_impl/db_impl.h:203
freed by thread T0 here:
#3 0x1191399 in rocksdb::cache_entry_roles_detail::RegisteredDeleter<rocksdb::Block, (rocksdb::CacheEntryRole)0>::Delete(rocksdb::Slice const&, void*) cache/cache_entry_roles.h:99
#4 0x719348 in rocksdb::LRUHandle::Free() cache/lru_cache.h:205
#5 0x71047f in rocksdb::LRUCacheShard::Release(rocksdb::Cache::Handle*, bool) cache/lru_cache.cc:547
#6 0xa78f0a in rocksdb::Cleanable::DoCleanup() include/rocksdb/cleanable.h:60
#7 0xa78f0a in rocksdb::Cleanable::Reset() include/rocksdb/cleanable.h:38
#8 0xa78f0a in rocksdb::PinnedIteratorsManager::ReleasePinnedData() db/pinned_iterators_manager.h:71
#9 0xd0c21b in rocksdb::PinnedIteratorsManager::~PinnedIteratorsManager() db/pinned_iterators_manager.h:24
#10 0xd0c21b in rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, rocksdb::MergeContext*, unsigned long*, bool*, bool*, unsigned long*, rocksdb::ReadCallback*, bool*, bool) db/pinned_iterators_manager.h:22
#11 0x7f0fdf in rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&) db/db_impl/db_impl.cc:1886
#12 0x540d63 in rocksdb::DBImpl::GetMergeOperands(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*, rocksdb::GetMergeOperandsOptions*, int*) db/db_impl/db_impl.h:203
previously allocated by thread T0 here:
#1 0x1239896 in rocksdb::AllocateBlock(unsigned long, **rocksdb::MemoryAllocator*)** memory/memory_allocator.h:35
#2 0x1239896 in rocksdb::BlockFetcher::CopyBufferToHeapBuf() table/block_fetcher.cc:171
#3 0x1239896 in rocksdb::BlockFetcher::GetBlockContents() table/block_fetcher.cc:206
#4 0x122eae5 in rocksdb::BlockFetcher::ReadBlockContents() table/block_fetcher.cc:325
#5 0x11b1f45 in rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::Block>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, bool, rocksdb::CachableEntry<rocksdb::Block>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::BlockContents*) const table/block_based/block_based_table_reader.cc:1503
```
Here is the analysis:
- We have [PinnedIteratorsManager](https://github.com/facebook/rocksdb/blob/6.28.fb/db/version_set.cc#L1980) with `Cleanable` capability in our `Version::Get()` path. It's responsible for managing the life-time of pinned iterator and invoking registered cleanup functions during its own destruction.
- For example in case above, the merge operands's clean-up gets associated with this manger in [GetContext::push_operand](https://github.com/facebook/rocksdb/blob/6.28.fb/table/get_context.cc#L405). During PinnedIteratorsManager's [destruction](https://github.com/facebook/rocksdb/blob/6.28.fb/db/pinned_iterators_manager.h#L67), the release function associated with those merge operand data is invoked.
**And that's what we see in "freed by thread T955 here" in ASAN.**
- Bug 🐛: `PinnedIteratorsManager` is local to `Version::Get()` while the data of merge operands need to outlive `Version::Get` and stay till they get [PinSelf()](https://github.com/facebook/rocksdb/blob/6.28.fb/db/db_impl/db_impl.cc#L1905), **which is the read-after-free in ASAN.**
- This bug is likely to be an overlook of `PinnedIteratorsManager` when developing the API `DB::GetMergeOperands` cuz the current logic works fine with the existing case of getting the *merged value* where the operands do not need to live that long.
- This bug was not surfaced much (even in its unit test) due to the release function associated with the merge operands (which are actually blocks put in cache as you can see in `BlockBasedTable::MaybeReadBlockAndLoadToCache` **in "previously allocated by" in ASAN report**) is a cache entry deleter.
The deleter will call `Cache::Release()` which, for LRU cache, won't immediately deallocate the block based on LRU policy [unless the cache is full or being instructed to force erase](https://github.com/facebook/rocksdb/blob/6.28.fb/cache/lru_cache.cc#L521-L531)
- `DBMergeOperandTest.MergeOperandReadAfterFreeBug` makes the cache extremely small to force cache full.
**Summary:**
- Fix the bug by align `PinnedIteratorsManager`'s lifetime with the merge operands
Pull Request resolved: #9507
Test Plan:
- New test `DBMergeOperandTest.MergeOperandReadAfterFreeBug`
- db bench on read path
- Setup (LSM tree with several levels, cache the whole db to avoid read IO, warm cache with readseq to avoid read IO): `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="fillrandom,readseq -num=1000000 -cache_size=100000000 -write_buffer_size=10000 -statistics=1 -max_bytes_for_level_base=10000 -level0_file_num_compaction_trigger=1``TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="readrandom" -num=1000000 -cache_size=100000000 `
- Actual command run (run 20-run for 20 times and then average the 20-run's average micros/op)
- `for j in {1..20}; do (for i in {1..20}; do rm -rf /dev/shm/rocksdb/ && TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="fillrandom,readseq,readrandom" -num=1000000 -cache_size=100000000 -write_buffer_size=10000 -statistics=1 -max_bytes_for_level_base=10000 -level0_file_num_compaction_trigger=1 | egrep 'readrandom'; done > rr_output_pre.txt && (awk '{sum+=$3; sum_sqrt+=$3^2}END{print sum/20, sqrt(sum_sqrt/20-(sum/20)^2)}' rr_output_pre.txt) >> rr_output_pre_2.txt); done`
- **Result: Pre-change: 3.79193 micros/op; Post-change: 3.79528 micros/op (+0.09%)**
(pre-change)sorted avg micros/op of each 20-run | std of micros/op of each 20-run | (post-change) sorted avg micros/op of each 20-run | std of micros/op of each 20-run
-- | -- | -- | --
3.58355 | 0.265209 | 3.48715 | 0.382076
3.58845 | 0.519927 | 3.5832 | 0.382726
3.66415 | 0.452097 | 3.677 | 0.563831
3.68495 | 0.430897 | 3.68405 | 0.495355
3.70295 | 0.482893 | 3.68465 | 0.431438
3.719 | 0.463806 | 3.71945 | 0.457157
3.7393 | 0.453423 | 3.72795 | 0.538604
3.7806 | 0.527613 | 3.75075 | 0.444509
3.7817 | 0.426704 | 3.7683 | 0.468065
3.809 | 0.381033 | 3.8086 | 0.557378
3.80985 | 0.466011 | 3.81805 | 0.524833
3.8165 | 0.500351 | 3.83405 | 0.529339
3.8479 | 0.430326 | 3.86285 | 0.44831
3.85125 | 0.434108 | 3.8717 | 0.544098
3.8556 | 0.524602 | 3.895 | 0.411679
3.8656 | 0.476383 | 3.90965 | 0.566636
3.8911 | 0.488477 | 3.92735 | 0.608038
3.898 | 0.493978 | 3.9439 | 0.524511
3.97235 | 0.515008 | 3.9623 | 0.477416
3.9768 | 0.519993 | 3.98965 | 0.521481
- CI
Reviewed By: ajkr
Differential Revision: D34030519
Pulled By: hx235
fbshipit-source-id: a99ac585c11704c5ed93af033cb29ba0a7b16ae8
Example of this in test plan of #9611 |
Context:
Running the new test
DBMergeOperandTest.MergeOperandReadAfterFreeBugprior to this fix surfaces the read-after-free bug of PinSef() as below:Here is the analysis:
Cleanablecapability in ourVersion::Get()path. It's responsible for managing the life-time of pinned iterator and invoking registered cleanup functions during its own destruction.And that's what we see in "freed by thread T955 here" in ASAN.
PinnedIteratorsManageris local toVersion::Get()while the data of merge operands need to outliveVersion::Getand stay till they get PinSelf(), which is the read-after-free in ASAN.PinnedIteratorsManagerwhen developing the APIDB::GetMergeOperandscuz the current logic works fine with the existing case of getting the merged value where the operands do not need to live that long.BlockBasedTable::MaybeReadBlockAndLoadToCachein "previously allocated by" in ASAN report) is a cache entry deleter.The deleter will call
Cache::Release()which, for LRU cache, won't immediately deallocate the block based on LRU policy unless the cache is full or being instructed to force eraseDBMergeOperandTest.MergeOperandReadAfterFreeBugmakes the cache extremely small to force cache full.Summary:
PinnedIteratorsManager's lifetime with the longest possible lifetime needed for merge operandsTest:
DBMergeOperandTest.MergeOperandReadAfterFreeBugTEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="fillrandom,readseq -num=1000000 -cache_size=100000000 -write_buffer_size=10000 -statistics=1 -max_bytes_for_level_base=10000 -level0_file_num_compaction_trigger=1TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="readrandom" -num=1000000 -cache_size=100000000for j in {1..20}; do (for i in {1..20}; do rm -rf /dev/shm/rocksdb/ && TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="fillrandom,readseq,readrandom" -num=1000000 -cache_size=100000000 -write_buffer_size=10000 -statistics=1 -max_bytes_for_level_base=10000 -level0_file_num_compaction_trigger=1 | egrep 'readrandom'; done > rr_output_pre.txt && (awk '{sum+=$3; sum_sqrt+=$3^2}END{print sum/20, sqrt(sum_sqrt/20-(sum/20)^2)}' rr_output_pre.txt) >> rr_output_pre_2.txt); done