[clickhouse] Clickana monitoring dashboard tool by karencfv · Pull Request #7207 · oxidecomputer/omicron

karencfv · 2024-12-05T06:44:39Z

Overview

As part of Stage 1 of RFD468 we'll be observing how a ClickHouse cluster behaves in comparison with a single node server. This commit introduces a basic tool that lets us visualize internal ClickHouse metric information.

As a starting point, Clickana only has 4 charts, and the user may not choose what these are. Additionally, it is only capable of rendering data by making API calls. I'd like to make the tool more flexible; other capabilities will be added in follow up PRs.

Usage

clickana --help                                    
Usage: clickana [OPTIONS] --clickhouse-addr <CLICKHOUSE_ADDR>

Options:
  -l, --log-path <LOG_PATH>                    Path to the log file [env: CLICKANA_LOG_PATH=] [default: /tmp/clickana.log]
  -a, --clickhouse-addr <CLICKHOUSE_ADDR>      Address where a clickhouse admin server is listening on
  -s, --sampling-interval <SAMPLING_INTERVAL>  The interval to collect monitoring data in seconds [default: 60]
  -t, --time-range <TIME_RANGE>                Range of time to collect monitoring data in seconds [default: 3600]
  -r, --refresh-interval <REFRESH_INTERVAL>    The interval at which the dashboards will refresh [default: 60]
  -h, --help                                   Print help

Manual Testing

root@oxz_clickhouse_015f9c34:~# /opt/oxide/clickana/bin/clickana -a [fd00:1122:3344:101::e]:8888

Next Steps

Let the user set which metrics they would like to visualise in each chart. This may be nice to do through a TOML file or something. We could let them choose which unit to represent them in as well perhaps.
Have more metrics available.
It'd be nice to have the ability to take the timeseries as JSON instead of calling the API as well. This could be useful in the future to have some insight into our customer's racks for debugging purposes. We could include ClickHouse internal metric timeseries as part of the support bundles and they could be visualised via Clickana. WDYT @smklein ?

Related: #6953

karencfv · 2024-12-05T06:49:36Z

 #[serde(rename_all = "snake_case")]
 pub struct SystemTimeSeries {
-    pub time: Timestamp,
+    pub time: String,


This is seriously doing my head in. Since Timestamp is an untagged enum, serde is having a hard time deserializing. My custom deserializer didn't work, but I'll see if I can find a way

karencfv · 2024-12-10T08:03:43Z

Almost there...

karencfv · 2024-12-11T08:43:14Z

+                // The ClickHouse client connects via the TCP port
+                let ch_address = {
+                    let mut addr = *address;
+                    addr.set_port(CLICKHOUSE_TCP_PORT);
+                    addr.to_string()
+                };
+
                let clickhouse_admin_config =
                    PropertyGroupBuilder::new("config")
                        .add_property("http_address", "astring", admin_address)
                        .add_property(
                            "ch_address",
                            "astring",
-                            address.to_string(),
+                            ch_address.to_string(),


Not really sure when this broke, but it wouldn't have been caught as nothing was calling clickhouse_cli (a wrapper around the clickhouse client command) yet.

karencfv · 2024-12-18T04:00:34Z

Just want to make a clarification. The timeseries rendered in each dashboard are per clickhouse server, not of the whole cluster aggregated into a single view. I misremembered the system tables' engines. They're MergeTree (not ReplicatedMergeTree as I thought), so unique to each server.
Frankly, it's probably more useful this way. If there are unexpected discrepancies between the nodes, we'll be able to identify them more easily.

karencfv · 2024-12-18T07:30:43Z

Ok, so the docs aren't super great, but I've been playing around with this to confirm. They are definitely per server.

oximeter_cluster-1 :) SELECT toStartOfInterval(event_time, INTERVAL 60 SECOND) AS t, avg(ProfileEvent_Query)
FROM system.metric_log
WHERE event_date >= toDate(now() - 86400) AND event_time >= now() - 86400
GROUP BY t
ORDER BY t WITH FILL STEP 60
SETTINGS date_time_output_format = 'iso'

SELECT
    toStartOfInterval(event_time, toIntervalSecond(60)) AS t,
    avg(ProfileEvent_Query)
FROM system.metric_log
WHERE (event_date >= toDate(now() - 86400)) AND (event_time >= (now() - 86400))
GROUP BY t
ORDER BY t ASC WITH FILL STEP 60
SETTINGS date_time_output_format = 'iso'

Query id: a9eed161-d54c-4d11-b23a-21aeca60ef28

┌────────────────────t─┬─avg(ProfileEvent_Query)─┐
│ 2024-12-18T07:02:00Z │                       0 │
│ 2024-12-18T07:03:00Z │                       0 │
│ 2024-12-18T07:04:00Z │      0.2833333333333333 │
│ 2024-12-18T07:05:00Z │      0.5666666666666667 │
│ 2024-12-18T07:06:00Z │      0.6666666666666666 │
│ 2024-12-18T07:07:00Z │                    0.45 │
│ 2024-12-18T07:08:00Z │      0.2833333333333333 │
│ 2024-12-18T07:09:00Z │                       0 │
│ 2024-12-18T07:10:00Z │                       0 │
│ 2024-12-18T07:11:00Z │                       0 │
│ 2024-12-18T07:12:00Z │     0.03333333333333333 │
│ 2024-12-18T07:13:00Z │    0.016666666666666666 │
│ 2024-12-18T07:14:00Z │                       0 │
│ 2024-12-18T07:15:00Z │                       0 │
│ 2024-12-18T07:16:00Z │     0.18333333333333332 │
│ 2024-12-18T07:17:00Z │    0.016666666666666666 │
│ 2024-12-18T07:18:00Z │     0.43333333333333335 │
│ 2024-12-18T07:19:00Z │    0.016666666666666666 │
│ 2024-12-18T07:20:00Z │                       0 │
│ 2024-12-18T07:21:00Z │                       0 │
│ 2024-12-18T07:22:00Z │                       0 │
│ 2024-12-18T07:23:00Z │                       0 │
│ 2024-12-18T07:24:00Z │    0.016666666666666666 │
│ 2024-12-18T07:25:00Z │    0.016666666666666666 │
│ 2024-12-18T07:26:00Z │      0.6333333333333333 │
│ 2024-12-18T07:27:00Z │                       0 │
│ 2024-12-18T07:28:00Z │                       0 │
└──────────────────────┴─────────────────────────┘

27 rows in set. Elapsed: 0.009 sec. Processed 1.57 thousand rows, 10.60 KB (176.72 thousand rows/s., 1.19 MB/s.)
Peak memory usage: 51.12 KiB.

oximeter_cluster-2 :) SELECT toStartOfInterval(event_time, INTERVAL 60 SECOND) AS t, avg(ProfileEvent_Query)
FROM system.metric_log
WHERE event_date >= toDate(now() - 86400) AND event_time >= now() - 86400
GROUP BY t
ORDER BY t WITH FILL STEP 60
SETTINGS date_time_output_format = 'iso'

SELECT
    toStartOfInterval(event_time, toIntervalSecond(60)) AS t,
    avg(ProfileEvent_Query)
FROM system.metric_log
WHERE (event_date >= toDate(now() - 86400)) AND (event_time >= (now() - 86400))
GROUP BY t
ORDER BY t ASC WITH FILL STEP 60
SETTINGS date_time_output_format = 'iso'

Query id: 37a3b84f-8844-40b5-ac3a-df4b60ba7b1a

┌────────────────────t─┬─avg(ProfileEvent_Query)─┐
│ 2024-12-18T07:02:00Z │                       0 │
│ 2024-12-18T07:03:00Z │                       0 │
│ 2024-12-18T07:04:00Z │      0.2833333333333333 │
│ 2024-12-18T07:05:00Z │      0.5666666666666667 │
│ 2024-12-18T07:06:00Z │      0.5666666666666667 │
│ 2024-12-18T07:07:00Z │      0.2833333333333333 │
│ 2024-12-18T07:08:00Z │      0.2833333333333333 │
│ 2024-12-18T07:09:00Z │                       0 │
│ 2024-12-18T07:10:00Z │                       0 │
│ 2024-12-18T07:11:00Z │                       0 │
│ 2024-12-18T07:12:00Z │     0.03333333333333333 │
│ 2024-12-18T07:13:00Z │    0.016666666666666666 │
│ 2024-12-18T07:14:00Z │                       0 │
│ 2024-12-18T07:15:00Z │                       0 │
│ 2024-12-18T07:16:00Z │    0.016666666666666666 │
│ 2024-12-18T07:17:00Z │                       0 │
│ 2024-12-18T07:18:00Z │     0.03333333333333333 │
│ 2024-12-18T07:19:00Z │    0.016666666666666666 │
│ 2024-12-18T07:20:00Z │                       0 │
│ 2024-12-18T07:21:00Z │                       0 │
│ 2024-12-18T07:22:00Z │                       0 │
│ 2024-12-18T07:23:00Z │                       0 │
│ 2024-12-18T07:24:00Z │    0.016666666666666666 │
│ 2024-12-18T07:25:00Z │    0.016666666666666666 │
│ 2024-12-18T07:26:00Z │                       0 │
│ 2024-12-18T07:27:00Z │                       0 │
│ 2024-12-18T07:28:00Z │                       0 │
└──────────────────────┴─────────────────────────┘

27 rows in set. Elapsed: 0.021 sec. Processed 2.75 thousand rows, 16.81 KB (134.15 thousand rows/s., 818.93 KB/s.)
Peak memory usage: 64.65 KiB.

andrewjstone

Looks good @karencfv.

Just some relatively minor suggestions from me.

andrewjstone · 2024-12-18T19:38:22Z

+                    results.len()
+                );
+            }
+            // TODO: Eventually we may want to not have a set amount of charts and make the


For future PRs: I think it would be useful to cool to be able to have a little menu of charts on the side of the pane, and then you can scroll and select which ones to show without having to restart the app, or mess with a toml file.

You could also allow toggling between a set of predefined layouts to make it always look nice. So you could show, 1, 2, 4, 6, 8 charts or something and allow selecting which to show in each view. You could even remember which charts to show in each layout, so you could toggle back and forth between different layouts and see all the charts, some with more detail.

oooooohhhhh nice!! I like that idea. I'll add your comment in the TODO

andrewjstone · 2024-12-18T19:45:03Z

+                let s = self.clone();
+                let c = client.clone();
+
+                let task = tokio::spawn(async move {


While this works, It seems somewhat heavy handed to spawn a task to get api data in parallel for each chart and then immediately join to wait for them all. Spawn is typically used for longer running tasks that stay around.

A more common way to do this is when you want concurrency but don't need to leave the current thread is to use FuturesUnordered. https://betterprogramming.pub/futuresunordered-an-efficient-way-to-manage-multiple-futures-in-rust-a24520abc3f6 has a pretty good overview.

Using FuturesUnordered would also remove the need to clone self and client as they can just be borrowed immutably.

Nice! Thanks for the tip

andrewjstone · 2024-12-18T19:48:08Z

+        let log = self.new_logger()?;
+        let client = ClickhouseServerClient::new(&admin_url, log.clone());
+
+        let tick_rate = Duration::from_secs(self.refresh_interval);


This is not a "rate", but a duration. I'd suggest naming it to tick_interval.

andrewjstone · 2024-12-18T19:58:14Z

+};
+use std::fmt::Display;
+
+const GIBIBYTE_F64: f64 = 1073741824.0;


It seems really odd to me to represent number of bytes by floats, as they are always whole numbers.

I realize that clickhouse returns floats for timeseries, but I think for types where it makes sense we should instead normalize those to integers rather than normalizing our data and computations to fit the raw data.

In a dataset, Ratatui requires the data points to be f64, so I think we're stuck with f64 sadly

Ah, got it. That makes sense. Feel free to ignore these comments then :)

andrewjstone · 2024-12-18T20:00:27Z

+        let mid_label_as_unit =
+            values.avg(lower_label_as_unit, upper_label_as_unit);
+
+        // To nicely display the mid value label for the Y axis, we do the following:


I think you can get rid of this parsing if you just convert all the values to integers at ingestion time.

The only reason I guess you wouldn't want to do this is if there are metrics where there are fractions we actually care about.

I get that this looks super weird 😄 , but I added it for the cases when there is very little variance between each point, and the bounds end up being very close to each other. The rounding made the mid point not be mid at all, and the data didn't really match up with the labels anymore

andrewjstone · 2024-12-18T20:04:47Z

+            .iter()
+            .map(|ts| {
+                (
+                    ts.time.trim_matches('"').parse::<f64>().unwrap_or_else(


Why do we need to parse a timestamp into an f64? Can't we use an actual time type instead?

same as above, the datapoints in the dataset need to be f64 for ratatui to render them

karencfv

Thanks for the review @andrewjstone !

karencfv · 2024-12-18T20:43:24Z

+};
+use std::fmt::Display;
+
+const GIBIBYTE_F64: f64 = 1073741824.0;


In a dataset, Ratatui requires the data points to be f64, so I think we're stuck with f64 sadly

karencfv · 2024-12-18T20:46:31Z

+        let mid_label_as_unit =
+            values.avg(lower_label_as_unit, upper_label_as_unit);
+
+        // To nicely display the mid value label for the Y axis, we do the following:


I get that this looks super weird 😄 , but I added it for the cases when there is very little variance between each point, and the bounds end up being very close to each other. The rounding made the mid point not be mid at all, and the data didn't really match up with the labels anymore

karencfv · 2024-12-18T20:47:31Z

+            .iter()
+            .map(|ts| {
+                (
+                    ts.time.trim_matches('"').parse::<f64>().unwrap_or_else(


same as above, the datapoints in the dataset need to be f64 for ratatui to render them

karencfv · 2024-12-18T20:49:18Z

+                let s = self.clone();
+                let c = client.clone();
+
+                let task = tokio::spawn(async move {


Nice! Thanks for the tip

karencfv · 2024-12-18T20:51:49Z

+                    results.len()
+                );
+            }
+            // TODO: Eventually we may want to not have a set amount of charts and make the


oooooohhhhh nice!! I like that idea. I'll add your comment in the TODO

karencfv added 14 commits December 3, 2024 17:30

poc

a986ca6

notes

0c2e4ee

clean up

db1e6c6

simplify

51baaa4

move file to devtools

707021d

Create dashboard data struct

1e07b65

use full UTC date/time as label

0e83157

clean up

98b9b32

adjust upper and lower Y axis bounds and labels

99b5f78

Some more clean up

13d4436

retrieve settings from CLI

e147c5a

Make room to add more charts

77fc165

Set up to generate charts from several metrics

bcc51cc

clean up

a44c7b6

karencfv commented Dec 5, 2024

View reviewed changes

Comment thread dev-tools/clickana/src/clickana.rs Outdated

karencfv commented Dec 5, 2024

View reviewed changes

karencfv added 14 commits December 6, 2024 12:54

Restructure and add support for other charts

e552d71

Restructure and add support for other charts

06d9e61

strat breaking up functions

af16dd9

better label calculation

7957e0c

extract calculations into functions

1d0d5fe

No need to use u64

8c1bd2a

Clean up value handling

7ada5ae

Clean up timestamp handling

1d4ec78

fmt

317b95c

restructure methods as standalone functions

54b86f5

restructure bounds and labels

7896ed8

separate chart into another file

dee3179

Add other charts

17fc96d

Add dashboard title

ae5819c

karencfv added 2 commits December 10, 2024 19:53

show time range in title bar

7822a8a

fix mid value label for y axis

ed8651f

karencfv added 5 commits December 11, 2024 12:18

clean up

1d7521c

make API calls concurrent

2c53dbf

simplify run method

e41c0be

clean up

a0cc34f

include in clickhouse and clickhouse-server zones

a176a5f

karencfv commented Dec 11, 2024

View reviewed changes

karencfv added 4 commits December 12, 2024 13:57

add some tests

7b6cd44

clean up

a525cbe

More tests

c8d22cd

fmt

21b793c

karencfv marked this pull request as ready for review December 12, 2024 03:40

karencfv requested a review from andrewjstone December 12, 2024 03:41

andrewjstone approved these changes Dec 18, 2024

View reviewed changes

karencfv commented Dec 18, 2024

View reviewed changes

address comments

9bb0de7

karencfv enabled auto-merge (squash) December 19, 2024 00:03

how the hell did this happen?

b1dfbe7

karencfv merged commit d337333 into oxidecomputer:main Dec 19, 2024

karencfv deleted the clickana branch December 19, 2024 02:06

karencfv mentioned this pull request Dec 19, 2024

[clickhouse] Long running QA tests for replicated cluster #6953

Closed

8 tasks

Conversation

karencfv commented Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Usage

Manual Testing

Next Steps

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karencfv commented Dec 10, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karencfv commented Dec 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karencfv commented Dec 18, 2024

Uh oh!

andrewjstone left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karencfv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

karencfv commented Dec 5, 2024 •

edited

Loading

karencfv commented Dec 18, 2024 •

edited

Loading