Skip to content

ddl requests hang for engine=kafka() table #10740

@Brutalis64

Description

@Brutalis64
  • There is table with engine=kafka() that watches to kafka topic with ~ 10K events per second. Actually, it happens on different topics with different eps, but they must be "heavy". Also, there is 3 more tables: Materialized view for reading and inserting data, table with data on disk and distributed table, for cluster reason.
  • After some time(i can't get exactly how much, some times hour, sometimes more) when enough data was inserted, i had detached MV table on one random cluster node. And it succedded.
  • Then i try to detach or drop table with kafka engine. Request fails with timeout error:
#> drop table kafka.kafka_engine_table
[2020-05-07 16:27:01] Read timed out

In logs i sow something like that:

2020.05.07 16:22:01.745935 [ 256078 ] {7b2fe9a4-5a7b-4853-a50b-cc146fffc75d} <Debug> executeQuery: (from x.x.x.x:58480, user: admin) select timezone() FORMAT TabSeparatedWithNamesAndTypes;
2020.05.07 16:22:01.746132 [ 256078 ] {7b2fe9a4-5a7b-4853-a50b-cc146fffc75d} <Trace> AccessRightsContext (admin): Access granted: SELECT(dummy) ON system.one
2020.05.07 16:22:01.746176 [ 256078 ] {7b2fe9a4-5a7b-4853-a50b-cc146fffc75d} <Trace> InterpreterSelectQuery: FetchColumns -> Complete
2020.05.07 16:22:01.746319 [ 256078 ] {7b2fe9a4-5a7b-4853-a50b-cc146fffc75d} <Information> executeQuery: Read 1 rows, 1.00 B in 0.000 sec., 2839 rows/sec., 2.77 KiB/sec.
2020.05.07 16:22:01.746395 [ 256078 ] {7b2fe9a4-5a7b-4853-a50b-cc146fffc75d} <Debug> MemoryTracker: Peak memory usage (for query): 4.01 MiB.
2020.05.07 16:22:01.751176 [ 256078 ] {7b2fe9a4-5a7b-4853-a50b-cc146fffc75d} <Debug> executeQuery: (from  x.x.x.x:58480, user: admin) select version() FORMAT TabSeparatedWithNamesAndTypes;
2020.05.07 16:22:01.751272 [ 256078 ] {7b2fe9a4-5a7b-4853-a50b-cc146fffc75d} <Trace> AccessRightsContext (admin): Access granted: SELECT(dummy) ON system.one
2020.05.07 16:22:01.751293 [ 256078 ] {7b2fe9a4-5a7b-4853-a50b-cc146fffc75d} <Trace> InterpreterSelectQuery: FetchColumns -> Complete
2020.05.07 16:22:01.751383 [ 256078 ] {7b2fe9a4-5a7b-4853-a50b-cc146fffc75d} <Information> executeQuery: Read 1 rows, 1.00 B in 0.000 sec., 5511 rows/sec., 5.38 KiB/sec.
2020.05.07 16:22:01.751440 [ 256078 ] {7b2fe9a4-5a7b-4853-a50b-cc146fffc75d} <Debug> MemoryTracker: Peak memory usage (for query): 4.00 MiB.
2020.05.07 16:22:01.779272 [ 134009 ] {7b2fe9a4-5a7b-4853-a50b-cc146fffc75d} <Debug> executeQuery: (from  x.x.x.x:58480, user: admin) SELECT 1 FORMAT TabSeparatedWithNamesAndTypes;
2020.05.07 16:22:01.779425 [ 134009 ] {7b2fe9a4-5a7b-4853-a50b-cc146fffc75d} <Trace> AccessRightsContext (admin): Access granted: SELECT(dummy) ON system.one
2020.05.07 16:22:01.779467 [ 134009 ] {7b2fe9a4-5a7b-4853-a50b-cc146fffc75d} <Trace> InterpreterSelectQuery: FetchColumns -> Complete
2020.05.07 16:22:01.779654 [ 134009 ] {7b2fe9a4-5a7b-4853-a50b-cc146fffc75d} <Information> executeQuery: Read 1 rows, 1.00 B in 0.000 sec., 2922 rows/sec., 2.85 KiB/sec.
2020.05.07 16:22:01.779759 [ 134009 ] {7b2fe9a4-5a7b-4853-a50b-cc146fffc75d} <Debug> MemoryTracker: Peak memory usage (for query): 4.00 MiB.
2020.05.07 16:22:01.788321 [ 256078 ] {7b2fe9a4-5a7b-4853-a50b-cc146fffc75d} <Debug> executeQuery: (from  x.x.x.x:58480, user: admin) drop table kafka.kafka_engine_table

And nothing more...
strace for thread shows loop like that:

--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_TIMER, si_timerid=0x3efda, ```
si_overrun=0, si_value={int=0, ptr=0}} ---
mincore(0x7f5b568ec000, 1, [1])         = 0
mincore(0x7f5b568ec000, 1, [1])         = 0
mincore(0x7f5b568ec000, 1, [1])         = 0
mincore(0x7f5b568ed000, 1, [1])         = 0
mincore(0x7f5b568ed000, 1, [1])         = 0
mincore(0x7f5b568ed000, 1, [1])         = 0
mincore(0x7f5b568ed000, 1, [1])         = 0
mincore(0x7f5b568ed000, 1, [1])         = 0
mincore(0x7f5b568ed000, 1, [1])         = 0
mincore(0x7f5b568ed000, 1, [1])         = 0
mincore(0x7f5b568ed000, 1, [1])         = 0
mincore(0x7f5b568ee000, 1, [1])         = 0
mincore(0x7f5b568ef000, 1, [1])         = 0
mincore(0x7f5b568ef000, 1, [1])         = 0
mincore(0x7f5b568f0000, 1, [1])         = 0
mincore(0x7f5b568f1000, 1, [1])         = 0
mincore(0x7f5b568f1000, 1, [1])         = 0
mincore(0x7f5b568f1000, 1, [1])         = 0
mincore(0x7f5b568f1000, 1, [1])         = 0
mincore(0x7f5b568f1000, 1, [1])         = 0
mincore(0x7f5b568f1000, 1, [1])         = 0
mincore(0x7f5b568f1000, 1, [1])         = 0
mincore(0x7f5b568f1000, 1, [1])         = 0
mincore(0x7f5b568f1000, 1, [1])         = 0
write(3254, "\0$7b2fe9a4-5a7b-4853-a50b-cc146f"..., 216) = 216
rt_sigreturn({mask=[PIPE]})             = 202
futex(0x7f63c882c5b0, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)

But this issue started with more complicated case with cluster DDL. When i detached MV tables on all cluster nodes. Then i tried to drop kafka tables by "on cluster" query and request failed with timeout:

[2020-05-07 14:43:08] Code: 159, e.displayText() = DB::Exception: Watching task /clickhouse/task_queue/ddl/query-0000008894 is executing longer than distributed_ddl_task_timeout (=180) seconds. There are 10 unfinished hosts (0 of them are currently active), they are going to execute the query in background (version 20.3.8.53 (official build))

And after that no one ddl on cluster had not been worked. In zookeeper for query 0000008894 i sow 10 nodes in status active. And for futher queries nor active or finished nodes was created in zookeeper.

Version: version 20.3.8.53 but it happened in 19th branch it happened too.

Metadata

Metadata

Assignees

Labels

bugConfirmed user-visible misbehaviour in official releasecomp-message-queuesMessage queue integrations (Kafka, RabbitMQ, NATS table engines for stream ingestion/egress).

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions