s3queue: fix logical error "Cannot unregister: table uuid is not registered"#78541
s3queue: fix logical error "Cannot unregister: table uuid is not registered"#78541
Conversation
de39b76 to
bb14ee2
Compare
| if (supports_remove_recursive) | ||
| { | ||
| requests.push_back(zkutil::makeCheckRequest(registry_path, stat.version)); | ||
| requests.push_back(zkutil::makeRemoveRecursiveRequest(zookeeper_path, -1)); |
There was a problem hiding this comment.
The main fix of this PR is here: makeRemoveRecursiveRequest which removes zookeeper_path (e.g. all s3queue metadata in keeper) in the same keeper transaction with check for registry_path's version. Because another table could try to register concurrently and before this change there would be race between checking metadata count (ref count), seeing it is equal to 0 and removing all metadata recursively while some other table could have registered in the middle.
There was a problem hiding this comment.
The 2nd parameter is unsigned:
Coordination::RequestPtr makeRemoveRecursiveRequest(const std::string & path, uint32_t remove_nodes_limit)
Maybe let's avoid the implicit conversion by passing std::numeric_limits<uint32_t>::max()?
| try | ||
| { | ||
| LOG_WARNING(log, "Cannot unregister: registry does not exist"); | ||
| chassert(false); | ||
| return 0; | ||
| } | ||
| zk_client = getZooKeeper(); |
There was a problem hiding this comment.
Here part of old code was just moved inside try-catch, because trySet/tryGet actually throw an exception in case of "keeper session expired error" instead of returning the error code like for other error types, therefore this error was not retried before.
jkartseva
left a comment
There was a problem hiding this comment.
Looks good. I left a couple of questions.
| if (supports_remove_recursive) | ||
| { | ||
| requests.push_back(zkutil::makeCheckRequest(registry_path, stat.version)); | ||
| requests.push_back(zkutil::makeRemoveRecursiveRequest(zookeeper_path, -1)); |
There was a problem hiding this comment.
The 2nd parameter is unsigned:
Coordination::RequestPtr makeRemoveRecursiveRequest(const std::string & path, uint32_t remove_nodes_limit)
Maybe let's avoid the implicit conversion by passing std::numeric_limits<uint32_t>::max()?
| auto drop_lock = zkutil::EphemeralNodeHolder::existing(drop_lock_path, *zk_client); | ||
| try | ||
| { | ||
| zk_client->removeRecursive(zookeeper_path); | ||
| } | ||
| catch (const zkutil::KeeperMultiException & e) | ||
| { | ||
| if (Coordination::isHardwareError(e.code)) | ||
| { | ||
| LOG_TEST(log, "Lost connection to zookeeper, will retry"); | ||
| continue; |
There was a problem hiding this comment.
Is the idea that the drop_lock_path node is being removed in the EphemeralNodeHolder() destructor when drop_lock goes out of scope? I don't usually work with Zookeeper, so I'm asking.
If an exception occurs during zk_client->removeRecursive(zookeeper_path);, and the loop is retried, do we still want to remove the ephemeral node in the error path? There is setAlreadyRemoved() method that makes it no-op.
There was a problem hiding this comment.
Is the idea that the drop_lock_path node is being removed in the EphemeralNodeHolder() destructor when drop_lock goes out of scope?
yes
If an exception occurs during zk_client->removeRecursive(zookeeper_path);, and the loop is retried, do we still want to remove the ephemeral node in the error path?
I think yes, in case of exception it is better to retry everything from scratch. It is possible to optimize and not remove the ephemeral node, but it will overcomplicate the code, which I would like to avoid, because optimization here is not that important.
f7e2691 to
f5a876b
Compare
03229_query_condition_cache_profile_events |
Cherry pick #78541 to 25.3: s3queue: fix logical error "Cannot unregister: table uuid is not registered"
…r: table uuid is not registered"
Backport #78541 to 25.3: s3queue: fix logical error "Cannot unregister: table uuid is not registered"
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
In storage S3Queue fix logical error "Cannot unregister: table uuid is not registered". Closes #78285.
Documentation entry for user-facing changes