Skip to content

s3queue: fix logical error "Cannot unregister: table uuid is not registered"#78541

Merged
kssenii merged 5 commits intomasterfrom
s3queue-fix-logical-error-cannot-unregister
Apr 8, 2025
Merged

s3queue: fix logical error "Cannot unregister: table uuid is not registered"#78541
kssenii merged 5 commits intomasterfrom
s3queue-fix-logical-error-cannot-unregister

Conversation

@kssenii
Copy link
Copy Markdown
Member

@kssenii kssenii commented Apr 1, 2025

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

In storage S3Queue fix logical error "Cannot unregister: table uuid is not registered". Closes #78285.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Apr 1, 2025

Workflow [PR], commit [f5a876b]

@clickhouse-gh clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Apr 1, 2025
@jkartseva jkartseva self-assigned this Apr 1, 2025
@kssenii kssenii marked this pull request as draft April 1, 2025 20:59
@kssenii kssenii force-pushed the s3queue-fix-logical-error-cannot-unregister branch from de39b76 to bb14ee2 Compare April 7, 2025 15:15
Comment on lines +748 to +763
if (supports_remove_recursive)
{
requests.push_back(zkutil::makeCheckRequest(registry_path, stat.version));
requests.push_back(zkutil::makeRemoveRecursiveRequest(zookeeper_path, -1));
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main fix of this PR is here: makeRemoveRecursiveRequest which removes zookeeper_path (e.g. all s3queue metadata in keeper) in the same keeper transaction with check for registry_path's version. Because another table could try to register concurrently and before this change there would be race between checking metadata count (ref count), seeing it is equal to 0 and removing all metadata recursively while some other table could have registered in the middle.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2nd parameter is unsigned:

Coordination::RequestPtr makeRemoveRecursiveRequest(const std::string & path, uint32_t remove_nodes_limit)

Maybe let's avoid the implicit conversion by passing std::numeric_limits<uint32_t>::max()?

Comment on lines +706 to +720
try
{
LOG_WARNING(log, "Cannot unregister: registry does not exist");
chassert(false);
return 0;
}
zk_client = getZooKeeper();
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here part of old code was just moved inside try-catch, because trySet/tryGet actually throw an exception in case of "keeper session expired error" instead of returning the error code like for other error types, therefore this error was not retried before.

@kssenii kssenii marked this pull request as ready for review April 7, 2025 15:18
Copy link
Copy Markdown
Member

@jkartseva jkartseva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I left a couple of questions.

Comment on lines +748 to +763
if (supports_remove_recursive)
{
requests.push_back(zkutil::makeCheckRequest(registry_path, stat.version));
requests.push_back(zkutil::makeRemoveRecursiveRequest(zookeeper_path, -1));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2nd parameter is unsigned:

Coordination::RequestPtr makeRemoveRecursiveRequest(const std::string & path, uint32_t remove_nodes_limit)

Maybe let's avoid the implicit conversion by passing std::numeric_limits<uint32_t>::max()?

Comment on lines +805 to +815
auto drop_lock = zkutil::EphemeralNodeHolder::existing(drop_lock_path, *zk_client);
try
{
zk_client->removeRecursive(zookeeper_path);
}
catch (const zkutil::KeeperMultiException & e)
{
if (Coordination::isHardwareError(e.code))
{
LOG_TEST(log, "Lost connection to zookeeper, will retry");
continue;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that the drop_lock_path node is being removed in the EphemeralNodeHolder() destructor when drop_lock goes out of scope? I don't usually work with Zookeeper, so I'm asking.
If an exception occurs during zk_client->removeRecursive(zookeeper_path);, and the loop is retried, do we still want to remove the ephemeral node in the error path? There is setAlreadyRemoved() method that makes it no-op.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that the drop_lock_path node is being removed in the EphemeralNodeHolder() destructor when drop_lock goes out of scope?

yes

If an exception occurs during zk_client->removeRecursive(zookeeper_path);, and the loop is retried, do we still want to remove the ephemeral node in the error path?

I think yes, in case of exception it is better to retry everything from scratch. It is possible to optimize and not remove the ephemeral node, but it will overcomplicate the code, which I would like to avoid, because optimization here is not that important.

@CheSema CheSema mentioned this pull request Apr 8, 2025
1 task
@kssenii kssenii force-pushed the s3queue-fix-logical-error-cannot-unregister branch from f7e2691 to f5a876b Compare April 8, 2025 13:55
@kssenii kssenii added the pr-must-backport Pull request should be backported intentionally. Use this label with great care! label Apr 8, 2025
@kssenii
Copy link
Copy Markdown
Member Author

kssenii commented Apr 8, 2025

Stateless tests (release, ParallelReplicas, s3 storage)
Stateless tests (release, ParallelReplicas, s3 storage) — fail: 1

03229_query_condition_cache_profile_events

@kssenii kssenii added this pull request to the merge queue Apr 8, 2025
Merged via the queue into master with commit dff33f5 Apr 8, 2025
118 of 120 checks passed
@kssenii kssenii deleted the s3queue-fix-logical-error-cannot-unregister branch April 8, 2025 21:21
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 8, 2025
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-backports-created-cloud deprecated label, NOOP label Apr 28, 2025
robot-ch-test-poll2 added a commit that referenced this pull request May 2, 2025
Cherry pick #78541 to 25.3: s3queue: fix logical error "Cannot unregister: table uuid is not registered"
robot-clickhouse added a commit that referenced this pull request May 2, 2025
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label May 2, 2025
alexey-milovidov added a commit that referenced this pull request May 3, 2025
Backport #78541 to 25.3: s3queue: fix logical error "Cannot unregister: table uuid is not registered"
@robot-clickhouse robot-clickhouse added the pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR label Jul 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-backports-created-cloud deprecated label, NOOP pr-bugfix Pull request with bugfix, not backported by default pr-must-backport Pull request should be backported intentionally. Use this label with great care! pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DB::Exception: Cannot unregister: table 'UUID' is not registered. (LOGICAL_ERROR)

7 participants