-
Notifications
You must be signed in to change notification settings - Fork 8.3k
ZK watch leak in DDLWorker #26036
Description
DDLWorker thread checks in a loop for new tasks by calling ZooKeeper GetChildren with a watch callback.
ClickHouse/src/Interpreters/DDLWorker.cpp
Line 374 in 5b0bc8a
| Strings queue_nodes = zookeeper->getChildren(queue_dir, nullptr, queue_updated_event); |
Due to a recent change from #25373 instead of blocking on new events we retry the loop every 10 seconds instead of when watch callback is triggered.
ClickHouse/src/Interpreters/DDLWorker.cpp
Lines 1140 to 1142 in 5b0bc8a
| /// FIXME It may hang for unknown reason. Timeout is just a hotfix. | |
| constexpr int queue_wait_timeout_ms = 10000; | |
| queue_updated_event->tryWait(queue_wait_timeout_ms); |
This causes a zookeeper watch callback leak. It is probably not very important as all that happens is that a new std::function is added to the watches list for the ddl worker path every 10 seconds. What is 32 bytes (sizeof(std::function)?) every 10 seconds?
However, it is annoying to see that on metrics. The metric is somewhat useful for catching bugs and this isn't an important one.
I understand that we could probably fix this "temporary fix" in DDLWorker and remove the tryWait + retrying, but that pattern was seen in other places and is useful for implementing a non-blocking watch, also used in waitForTableReplicaToProcessLogEntry.
Is there a laconic fix for that? Maybe we could somehow remove the callback when not needed/before retry? Maybe we could change the behaviour slightly and add watchcallback only if it wasn't already added (this wouldn't work for getChildren which wraps a Poco Event in a watch)? Anything else beside just ignoring this?
I understand that all the callbacks will be cleared when the event is triggered.