Skip to content

bluestore: do not _colletion_list when _remove_collection#49617

Closed
Rethan wants to merge 1 commit intoceph:mainfrom
Rethan:wip-remove-pg-check
Closed

bluestore: do not _colletion_list when _remove_collection#49617
Rethan wants to merge 1 commit intoceph:mainfrom
Rethan:wip-remove-pg-check

Conversation

@Rethan
Copy link
Contributor

@Rethan Rethan commented Jan 3, 2023

Very high latencies on Bluestore::_collection_list() operations during PG removal. These latencies regularly exceed the OSD suicide timeout and cause the OSDs to thrash. We can remove the check here. Since PG::do_delete_work does check with collection_list
to ensure there is no objects in db for the coll before remove_collection.

Fixes: https://tracker.ceph.com/issues/58274
Signed-off-by: haoyixing haoyixing@kuaishou.com

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

Very high latencies on Bluestore::_collection_list() operations during PG removal.
These latencies regularly exceed the OSD suicide timeout and cause the OSDs to thrash.
We can remove the check here. Since PG::do_delete_work does check with collection_list
 to ensure there is no objects in db for the coll before remove_collection.

Fixes: https://tracker.ceph.com/issues/58274
Signed-off-by: haoyixing <haoyixing@kuaishou.com>
@Rethan Rethan requested a review from a team as a code owner January 3, 2023 07:59
@cfsnyder
Copy link
Contributor

cfsnyder commented Jan 8, 2023

I do agree that it would be prudent to avoid iterating over the PG namespace twice during PG deletion, since we know that iterating over the tombstones can be so painful, however I would argue that it makes more sense to remove the iteration within do_delete_work at the OSD level vs. this one at the Bluestore level. I think it's important that Bluestore maintains and enforces the invariant that a collection must be empty before it can be removed so that we can definitively avoid dark data without relying on each Bluestore client to uphold that promise. We should be able to handle the ENOTEMPTY return value from Bluestore::_remove_collection within do_delete_work to maintain effectively the same logic for iteratively retrying the PG deletion in the case where new objects are detected.

@cfsnyder
Copy link
Contributor

cfsnyder commented Jan 9, 2023

Although, with closer inspection of the code, I do acknowledge that even the bluestore level check does not provide a "definitive" guarantee that there are no objects left behind since the check does not happen transactionally.

@Rethan
Copy link
Contributor Author

Rethan commented Jan 9, 2023

Although, with closer inspection of the code, I do acknowledge that even the bluestore level check does not provide a "definitive" guarantee that there are no objects left behind since the check does not happen transactionally.

Based on this, it seems that we can still remove the check in bluestore, since it does not provide guarantees and there is already a check in OSD, and OSD is the main user of bluestore so bluestore no need to check itselves.

@ifed01
Copy link
Contributor

ifed01 commented Jan 9, 2023

Honestly I have conflicting feelings about this patch.
On one hand this final check is useless most of time (indeed ENOTEMPTY return triggers assertion at BlueStore::_txc_add_transaction() which we don't face on a regular basis) and it might have some negative impact on the performance indeed.
On the other hand - do_delete_work calls collection_list() multiple times and the fix just decrements amount of such calls by one. Which is apparently not a bit gain given the amount of listings obtained during large collection removal... Not to mention that the removed stuff is helpful to QA in catching newly introduced bugs.
@Rethan do you have any estimates on the saving this could provide? Does this last check cause OSD suicide much more frequently then collection_list() calls from do_delete_work()? And wouldn't #49438 provide sufficient improvement?

@Rethan
Copy link
Contributor Author

Rethan commented Jan 10, 2023

Honestly I have conflicting feelings about this patch. On one hand this final check is useless most of time (indeed ENOTEMPTY return triggers assertion at BlueStore::_txc_add_transaction() which we don't face on a regular basis) and it might have some negative impact on the performance indeed. On the other hand - do_delete_work calls collection_list() multiple times and the fix just decrements amount of such calls by one. Which is apparently not a bit gain given the amount of listings obtained during large collection removal... Not to mention that the removed stuff is helpful to QA in catching newly introduced bugs. @Rethan do you have any estimates on the saving this could provide? Does this last check cause OSD suicide much more frequently then collection_list() calls from do_delete_work()? And wouldn't #49438 provide sufficient improvement?

In PG::do_delete_work, the parameter max of collection list was set to osd_target_transaction_size (default 30), but in bluestore, it was set to nonexistent_count (num of deleted objects) which could be far more large than 30. I think (personally) that's the reason why osd assert when bluestore _collection_list. The former does small checks for several times, the latter does one big check. @ifed01
Split iteration into small ones ONLY inside BlueStore::_collection_list won't be sufficient enough, since for the caller BlueStore::_remove_collection still needs to wait for a long time. Limiting iteration is a good idea to solve this, after extra modifications.

@cfsnyder
Copy link
Contributor

On the other hand - do_delete_work calls collection_list() multiple times and the fix just decrements amount of such calls by one. Which is apparently not a bit gain given the amount of listings obtained during large collection removal...

Not all collection_list calls are problematic for performance, and it isn't really related to the max parameter. None of the initial collection_list calls in do_delete_work are iterating over tombstones because they're part of the initial sequential iteration that is generating the tombstones. There are two invocations of collection_list that are problematic - the very last one in do_delete_work (the one that verifies that no new objects were created while individual objects were being deleted iteratively), and the one in Bluestore::_remove_collection. The reason that these two are problematic is because they set the start and end parameters to the start/end of the PG keyspace, which is filled with contiguous tombstones. When RocksDB is attempting to seek or iterate over a bunch of contiguous tombstones like that, it makes those operations O(N * log(N)) vs the normal O(log(N)).

So removing one of those two iterations will be significant, but we still need further optimization for the other remaining iteration if we're going to really solve the issue.

@cfsnyder
Copy link
Contributor

cfsnyder commented Jan 10, 2023

My latest thought on this problem is that we might consider adding an optional deadline parameter to Bluestore::collection_list that passes through to the rocksdb::ReadOptions::deadline param on underlying iterators. We'd also have to update error handling logic to bubble up a TIMEOUT error through collection_list. With that, we could set a reasonable deadline timeout on the final iteration within do_delete_work (maybe 1 second or something) to give us a sort of side-channel indication of whether our tombstones have been sufficiently compacted (when tombstones are completely compacted within our query range, we'd expect that collection_list op to take like < 1 ms since it'd be just a single parallel binary search op across SST tables). If we get a TIMEOUT error, we just continually reschedule the OSD op with a delay to give more time for flushes / compactions. As far as I know, adding a delay on the final removal of the PG would not be problematic. With effective use of @markhpc's Compact on Delete tunables, it shouldn't take too long for the tombstones to be cleaned up.

If we were to go that route, then neither of these two collection_list invocations is problematic because nether would be excessively iterating over tombstones.

@github-actions
Copy link

github-actions bot commented Feb 9, 2023

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@github-actions
Copy link

github-actions bot commented Aug 2, 2023

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Aug 2, 2023
@github-actions
Copy link

github-actions bot commented Sep 2, 2023

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Sep 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants