Overview
As part of the work to roll out replicated ClickHouse we'll be needing some long running testing to ensure stability of the replicated cluster. Specifically, we'll be wanting to know how stable the replicated cluster is when left alone for a while under load (i.e., days, weeks).
We'll need to monitor the system and answer the following questions periodically during a long period of time (a month or so?):
- Is data consistent across all replicas?
- Is query performance acceptable under load?
- Are the queue lengths acceptable under load?
- Do queue lengths grow over time or are they consistent depending on the load?
- <more?>
Implementation
We'll probably want to use clickhouse-admin to extract information from the system. There is a clickhouse-admin binary already installed in each of the clickhouse-{server|keeper} nodes.
To retrieve the information we need, we can leverage the following native ClickHouse tooling:
Altinity has a pretty cool ClickHouse stress test suite. We can probably use it for running stress tests, or take inspiration from it to create our own stress tests.
Relevant links
Tasks
Overview
As part of the work to roll out replicated ClickHouse we'll be needing some long running testing to ensure stability of the replicated cluster. Specifically, we'll be wanting to know how stable the replicated cluster is when left alone for a while under load (i.e., days, weeks).
We'll need to monitor the system and answer the following questions periodically during a long period of time (a month or so?):
Implementation
We'll probably want to use clickhouse-admin to extract information from the system. There is a clickhouse-admin binary already installed in each of the clickhouse-{server|keeper} nodes.
To retrieve the information we need, we can leverage the following native ClickHouse tooling:
system.opentelemetry_span_logsystem.replication_queueandsystem.distributed_ddl_queuesystem.system.user_processesAltinity has a pretty cool ClickHouse stress test suite. We can probably use it for running stress tests, or take inspiration from it to create our own stress tests.
Relevant links
Tasks
system.metric_logandsystem.asynchronous_metric_logtables #7100