Skip to content

ZK watch leak in DDLWorker #26036

@nvartolomei

Description

@nvartolomei

DDLWorker thread checks in a loop for new tasks by calling ZooKeeper GetChildren with a watch callback.

Strings queue_nodes = zookeeper->getChildren(queue_dir, nullptr, queue_updated_event);

Due to a recent change from #25373 instead of blocking on new events we retry the loop every 10 seconds instead of when watch callback is triggered.

/// FIXME It may hang for unknown reason. Timeout is just a hotfix.
constexpr int queue_wait_timeout_ms = 10000;
queue_updated_event->tryWait(queue_wait_timeout_ms);

This causes a zookeeper watch callback leak. It is probably not very important as all that happens is that a new std::function is added to the watches list for the ddl worker path every 10 seconds. What is 32 bytes (sizeof(std::function)?) every 10 seconds?

However, it is annoying to see that on metrics. The metric is somewhat useful for catching bugs and this isn't an important one.

I understand that we could probably fix this "temporary fix" in DDLWorker and remove the tryWait + retrying, but that pattern was seen in other places and is useful for implementing a non-blocking watch, also used in waitForTableReplicaToProcessLogEntry.

Is there a laconic fix for that? Maybe we could somehow remove the callback when not needed/before retry? Maybe we could change the behaviour slightly and add watchcallback only if it wasn't already added (this wouldn't work for getChildren which wraps a Poco Event in a watch)? Anything else beside just ignoring this?


I understand that all the callbacks will be cleared when the event is triggered.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions