[clickhouse] Clickana monitoring dashboard tool#7207
Conversation
| #[serde(rename_all = "snake_case")] | ||
| pub struct SystemTimeSeries { | ||
| pub time: Timestamp, | ||
| pub time: String, |
There was a problem hiding this comment.
This is seriously doing my head in. Since Timestamp is an untagged enum, serde is having a hard time deserializing. My custom deserializer didn't work, but I'll see if I can find a way
| // The ClickHouse client connects via the TCP port | ||
| let ch_address = { | ||
| let mut addr = *address; | ||
| addr.set_port(CLICKHOUSE_TCP_PORT); | ||
| addr.to_string() | ||
| }; | ||
|
|
||
| let clickhouse_admin_config = | ||
| PropertyGroupBuilder::new("config") | ||
| .add_property("http_address", "astring", admin_address) | ||
| .add_property( | ||
| "ch_address", | ||
| "astring", | ||
| address.to_string(), | ||
| ch_address.to_string(), |
There was a problem hiding this comment.
Not really sure when this broke, but it wouldn't have been caught as nothing was calling clickhouse_cli (a wrapper around the clickhouse client command) yet.
|
Just want to make a clarification. The timeseries rendered in each dashboard are per clickhouse server, not of the whole cluster aggregated into a single view. I misremembered the system tables' engines. They're MergeTree (not ReplicatedMergeTree as I thought), so unique to each server. |
|
Ok, so the docs aren't super great, but I've been playing around with this to confirm. They are definitely per server. oximeter_cluster-1 :) SELECT toStartOfInterval(event_time, INTERVAL 60 SECOND) AS t, avg(ProfileEvent_Query)
FROM system.metric_log
WHERE event_date >= toDate(now() - 86400) AND event_time >= now() - 86400
GROUP BY t
ORDER BY t WITH FILL STEP 60
SETTINGS date_time_output_format = 'iso'
SELECT
toStartOfInterval(event_time, toIntervalSecond(60)) AS t,
avg(ProfileEvent_Query)
FROM system.metric_log
WHERE (event_date >= toDate(now() - 86400)) AND (event_time >= (now() - 86400))
GROUP BY t
ORDER BY t ASC WITH FILL STEP 60
SETTINGS date_time_output_format = 'iso'
Query id: a9eed161-d54c-4d11-b23a-21aeca60ef28
┌────────────────────t─┬─avg(ProfileEvent_Query)─┐
│ 2024-12-18T07:02:00Z │ 0 │
│ 2024-12-18T07:03:00Z │ 0 │
│ 2024-12-18T07:04:00Z │ 0.2833333333333333 │
│ 2024-12-18T07:05:00Z │ 0.5666666666666667 │
│ 2024-12-18T07:06:00Z │ 0.6666666666666666 │
│ 2024-12-18T07:07:00Z │ 0.45 │
│ 2024-12-18T07:08:00Z │ 0.2833333333333333 │
│ 2024-12-18T07:09:00Z │ 0 │
│ 2024-12-18T07:10:00Z │ 0 │
│ 2024-12-18T07:11:00Z │ 0 │
│ 2024-12-18T07:12:00Z │ 0.03333333333333333 │
│ 2024-12-18T07:13:00Z │ 0.016666666666666666 │
│ 2024-12-18T07:14:00Z │ 0 │
│ 2024-12-18T07:15:00Z │ 0 │
│ 2024-12-18T07:16:00Z │ 0.18333333333333332 │
│ 2024-12-18T07:17:00Z │ 0.016666666666666666 │
│ 2024-12-18T07:18:00Z │ 0.43333333333333335 │
│ 2024-12-18T07:19:00Z │ 0.016666666666666666 │
│ 2024-12-18T07:20:00Z │ 0 │
│ 2024-12-18T07:21:00Z │ 0 │
│ 2024-12-18T07:22:00Z │ 0 │
│ 2024-12-18T07:23:00Z │ 0 │
│ 2024-12-18T07:24:00Z │ 0.016666666666666666 │
│ 2024-12-18T07:25:00Z │ 0.016666666666666666 │
│ 2024-12-18T07:26:00Z │ 0.6333333333333333 │
│ 2024-12-18T07:27:00Z │ 0 │
│ 2024-12-18T07:28:00Z │ 0 │
└──────────────────────┴─────────────────────────┘
27 rows in set. Elapsed: 0.009 sec. Processed 1.57 thousand rows, 10.60 KB (176.72 thousand rows/s., 1.19 MB/s.)
Peak memory usage: 51.12 KiB.
oximeter_cluster-2 :) SELECT toStartOfInterval(event_time, INTERVAL 60 SECOND) AS t, avg(ProfileEvent_Query)
FROM system.metric_log
WHERE event_date >= toDate(now() - 86400) AND event_time >= now() - 86400
GROUP BY t
ORDER BY t WITH FILL STEP 60
SETTINGS date_time_output_format = 'iso'
SELECT
toStartOfInterval(event_time, toIntervalSecond(60)) AS t,
avg(ProfileEvent_Query)
FROM system.metric_log
WHERE (event_date >= toDate(now() - 86400)) AND (event_time >= (now() - 86400))
GROUP BY t
ORDER BY t ASC WITH FILL STEP 60
SETTINGS date_time_output_format = 'iso'
Query id: 37a3b84f-8844-40b5-ac3a-df4b60ba7b1a
┌────────────────────t─┬─avg(ProfileEvent_Query)─┐
│ 2024-12-18T07:02:00Z │ 0 │
│ 2024-12-18T07:03:00Z │ 0 │
│ 2024-12-18T07:04:00Z │ 0.2833333333333333 │
│ 2024-12-18T07:05:00Z │ 0.5666666666666667 │
│ 2024-12-18T07:06:00Z │ 0.5666666666666667 │
│ 2024-12-18T07:07:00Z │ 0.2833333333333333 │
│ 2024-12-18T07:08:00Z │ 0.2833333333333333 │
│ 2024-12-18T07:09:00Z │ 0 │
│ 2024-12-18T07:10:00Z │ 0 │
│ 2024-12-18T07:11:00Z │ 0 │
│ 2024-12-18T07:12:00Z │ 0.03333333333333333 │
│ 2024-12-18T07:13:00Z │ 0.016666666666666666 │
│ 2024-12-18T07:14:00Z │ 0 │
│ 2024-12-18T07:15:00Z │ 0 │
│ 2024-12-18T07:16:00Z │ 0.016666666666666666 │
│ 2024-12-18T07:17:00Z │ 0 │
│ 2024-12-18T07:18:00Z │ 0.03333333333333333 │
│ 2024-12-18T07:19:00Z │ 0.016666666666666666 │
│ 2024-12-18T07:20:00Z │ 0 │
│ 2024-12-18T07:21:00Z │ 0 │
│ 2024-12-18T07:22:00Z │ 0 │
│ 2024-12-18T07:23:00Z │ 0 │
│ 2024-12-18T07:24:00Z │ 0.016666666666666666 │
│ 2024-12-18T07:25:00Z │ 0.016666666666666666 │
│ 2024-12-18T07:26:00Z │ 0 │
│ 2024-12-18T07:27:00Z │ 0 │
│ 2024-12-18T07:28:00Z │ 0 │
└──────────────────────┴─────────────────────────┘
27 rows in set. Elapsed: 0.021 sec. Processed 2.75 thousand rows, 16.81 KB (134.15 thousand rows/s., 818.93 KB/s.)
Peak memory usage: 64.65 KiB. |
andrewjstone
left a comment
There was a problem hiding this comment.
Looks good @karencfv.
Just some relatively minor suggestions from me.
| results.len() | ||
| ); | ||
| } | ||
| // TODO: Eventually we may want to not have a set amount of charts and make the |
There was a problem hiding this comment.
For future PRs: I think it would be useful to cool to be able to have a little menu of charts on the side of the pane, and then you can scroll and select which ones to show without having to restart the app, or mess with a toml file.
You could also allow toggling between a set of predefined layouts to make it always look nice. So you could show, 1, 2, 4, 6, 8 charts or something and allow selecting which to show in each view. You could even remember which charts to show in each layout, so you could toggle back and forth between different layouts and see all the charts, some with more detail.
There was a problem hiding this comment.
oooooohhhhh nice!! I like that idea. I'll add your comment in the TODO
| let s = self.clone(); | ||
| let c = client.clone(); | ||
|
|
||
| let task = tokio::spawn(async move { |
There was a problem hiding this comment.
While this works, It seems somewhat heavy handed to spawn a task to get api data in parallel for each chart and then immediately join to wait for them all. Spawn is typically used for longer running tasks that stay around.
A more common way to do this is when you want concurrency but don't need to leave the current thread is to use FuturesUnordered. https://betterprogramming.pub/futuresunordered-an-efficient-way-to-manage-multiple-futures-in-rust-a24520abc3f6 has a pretty good overview.
Using FuturesUnordered would also remove the need to clone self and client as they can just be borrowed immutably.
There was a problem hiding this comment.
Nice! Thanks for the tip
| let log = self.new_logger()?; | ||
| let client = ClickhouseServerClient::new(&admin_url, log.clone()); | ||
|
|
||
| let tick_rate = Duration::from_secs(self.refresh_interval); |
There was a problem hiding this comment.
This is not a "rate", but a duration. I'd suggest naming it to tick_interval.
| }; | ||
| use std::fmt::Display; | ||
|
|
||
| const GIBIBYTE_F64: f64 = 1073741824.0; |
There was a problem hiding this comment.
It seems really odd to me to represent number of bytes by floats, as they are always whole numbers.
I realize that clickhouse returns floats for timeseries, but I think for types where it makes sense we should instead normalize those to integers rather than normalizing our data and computations to fit the raw data.
There was a problem hiding this comment.
In a dataset, Ratatui requires the data points to be f64, so I think we're stuck with f64 sadly
There was a problem hiding this comment.
Ah, got it. That makes sense. Feel free to ignore these comments then :)
| let mid_label_as_unit = | ||
| values.avg(lower_label_as_unit, upper_label_as_unit); | ||
|
|
||
| // To nicely display the mid value label for the Y axis, we do the following: |
There was a problem hiding this comment.
I think you can get rid of this parsing if you just convert all the values to integers at ingestion time.
The only reason I guess you wouldn't want to do this is if there are metrics where there are fractions we actually care about.
There was a problem hiding this comment.
I get that this looks super weird 😄 , but I added it for the cases when there is very little variance between each point, and the bounds end up being very close to each other. The rounding made the mid point not be mid at all, and the data didn't really match up with the labels anymore
| .iter() | ||
| .map(|ts| { | ||
| ( | ||
| ts.time.trim_matches('"').parse::<f64>().unwrap_or_else( |
There was a problem hiding this comment.
Why do we need to parse a timestamp into an f64? Can't we use an actual time type instead?
There was a problem hiding this comment.
same as above, the datapoints in the dataset need to be f64 for ratatui to render them
karencfv
left a comment
There was a problem hiding this comment.
Thanks for the review @andrewjstone !
| }; | ||
| use std::fmt::Display; | ||
|
|
||
| const GIBIBYTE_F64: f64 = 1073741824.0; |
There was a problem hiding this comment.
In a dataset, Ratatui requires the data points to be f64, so I think we're stuck with f64 sadly
| let mid_label_as_unit = | ||
| values.avg(lower_label_as_unit, upper_label_as_unit); | ||
|
|
||
| // To nicely display the mid value label for the Y axis, we do the following: |
There was a problem hiding this comment.
I get that this looks super weird 😄 , but I added it for the cases when there is very little variance between each point, and the bounds end up being very close to each other. The rounding made the mid point not be mid at all, and the data didn't really match up with the labels anymore
| .iter() | ||
| .map(|ts| { | ||
| ( | ||
| ts.time.trim_matches('"').parse::<f64>().unwrap_or_else( |
There was a problem hiding this comment.
same as above, the datapoints in the dataset need to be f64 for ratatui to render them
| let s = self.clone(); | ||
| let c = client.clone(); | ||
|
|
||
| let task = tokio::spawn(async move { |
There was a problem hiding this comment.
Nice! Thanks for the tip
| results.len() | ||
| ); | ||
| } | ||
| // TODO: Eventually we may want to not have a set amount of charts and make the |
There was a problem hiding this comment.
oooooohhhhh nice!! I like that idea. I'll add your comment in the TODO

Overview
As part of Stage 1 of RFD468 we'll be observing how a ClickHouse cluster behaves in comparison with a single node server. This commit introduces a basic tool that lets us visualize internal ClickHouse metric information.
As a starting point, Clickana only has 4 charts, and the user may not choose what these are. Additionally, it is only capable of rendering data by making API calls. I'd like to make the tool more flexible; other capabilities will be added in follow up PRs.
Usage
Manual Testing
Next Steps
Related: #6953