Add new TimeSeries table engine to handle Prometheus protocols#64183
Add new TimeSeries table engine to handle Prometheus protocols#64183nikitamikhaylov merged 25 commits intoClickHouse:masterfrom
Conversation
src/Storages/StorageTimeSeries.cpp
Outdated
There was a problem hiding this comment.
Those tables are customizable in two ways:
- The columns of an inner target table can be specified explicitly as columns of a TimeSeries table, and the table engine can be changed too:
CREATE TABLE ts (id UInt128 CODEC(ZSTD(3))) ENGINE = TimeSeries() DATA ENGINE=ReplicatedMergeTree(ReplicatedMergeTree('zkpath', 'replica') ORDER BY id, timestamp PARTITION BY toYYYYMM(timestamp)
- You can create the target table as you wish and then make a TimeSeries table use it:
CREATE TABLE my_data(id UInt128, timestamp DateTime64(3), value Float64) ENGINE=MergeTree ORDER BY id, timestamp;
CREATE TABLE ts ENGINE=TimeSeries() DATA=my_data;
src/Storages/StorageTimeSeries.cpp
Outdated
There was a problem hiding this comment.
If it expected that metrics can be filtered by tags often why not have it in EAV format?
id UInt128,
metric_name LowCardinality(String),
tag_name LowCardinality(String) ~
tag_value String
ORDER BY (metric_name, tag_name, id)
There was a problem hiding this comment.
id is a hash of the metric_name combined with a sorted list of {tag_name, tag_value} pairs
There was a problem hiding this comment.
metric_name LowCardinality(String),
tag_name LowCardinality(String) ~
tag_value String
Why not tag_value LowCartdinality(String) then?
There was a problem hiding this comment.
I understand from where it's coming.
But 2 things concerns me:
- Heavily unbalanced columns: id ~16 bytes, metric_name ~ 10-20 bytes, tags potentially hundreds bytes
- How filtering will work by tags.
There was a problem hiding this comment.
Why not tag_value LowCartdinality(String) then?
We assume that tags value will be much more unique.
Because two metrics cannot exists with the exact same combination of tags.
There was a problem hiding this comment.
I thought about the structure of the tags table more. For supporting the Prometheus read protocol it's actually more convenient to have the structure of the tags table in the manner where all the tags related to a specific ID are described in one row. So
id UInt128,
metric_name LowCardinality(String),
tag_name LowCardinality(String) ~
tag_value String
is not really suitable.
The following structure
id UInt128,
metric_name String,
tags Map(String, String)
must be ok, and also I've added support for putting some tags into separate columns of the tags table. For example, the following statement
CREATE TABLE ts ENGINE=TimeSeries() SETTINGS tags_to_columns={'job': 'job', 'instance':'instance'}
will create the tags table with columns
id UInt128,
metric_name String,
job String,
instance String,
tags Map(String, String)
The types of such columns can be adjusted, for example
CREATE TABLE ts (metric_name LowCardinality(String), job LowCardinality(Nullable(String))) ENGINE=TimeSeries() SETTINGS tags_to_columns={'job': 'job', 'instance':'instance'}
will create the tags table with columns
id UInt128,
metric_name LowCardinality(String),
job LowCardinality(Nullable(String)),
instance String,
tags Map(String, String)
There was a problem hiding this comment.
What if make buckets for map column? (ie have X map columns, and distribute tags across them using hash function)
As bucketed map type wasn't implemented yet
There was a problem hiding this comment.
There was a problem hiding this comment.
It's better to wait until #47045 will be ready.
It looks kinda dead, but ok :)
Anyway explicit materialization of certain columns is OK.
How query builder will work with them?
It will be some specific implementation for Prometheus Protocol or general, like "finish constraint optimization in production" #33544 ?
src/Storages/StorageTimeSeries.cpp
Outdated
There was a problem hiding this comment.
UInt128 looks really generous for potentially trillion rows dataset.
16 bytes for ID
8 bytes for Timestamp
8 bytes for Value
There was a problem hiding this comment.
I'm not sure, maybe UInt64 for ID is enough
There was a problem hiding this comment.
AFAIK, due to nature of running prometheus collector those id (actually exact set of tags and ids are byproduct of that set of tags) will rotate pretty frequently, and also it does mean, that we are actually not interested in data for particular ID, we interesting in data by particular tag combination and metric name. Commonly few "id" will match them (old and new pod for example)
So, i have few ideas in mind: use something like snowflake for id's
Or something more strange, for each "metric name" reserve range of 10000 incremental id's and fill this range with data for that particular metric name and different tags. Just in attempt to get better data locality.
There was a problem hiding this comment.
Or maybe UInt128 is not too big because this table is always stored ordered by ID first - so the id column will contain sequences of the same values, and that should be compressed pretty well.
i have few ideas in mind: use something like snowflake for id's
What do you mean?
Or something more strange, for each "metric name" reserve range of 10000 incremental id's and fill this range with data for that particular metric name and different tags. Just in attempt to get better data locality.
Incremental IDs aren't suitable for implementation because if we used them then INSERT to a TimeSeries table would require an internal SELECT to understand which ID is next. And it would also cause issues with concurrency and when internal tables are replicated. So it's better to use some kind of hash for ID.
Perhaps ID can be split into two parts: the first 32 or 64 bits would be used for a hash of the metric name, and the remaining 32 or 64 bits would be used for a hash of all the tags' names and values. Then the data for the same metric, but different tags will be stored next to each other.
There was a problem hiding this comment.
Or maybe UInt128 is not too big because this table is always stored ordered by ID first - so the id column will contain sequences of the same values, and that should be compressed pretty well.
It's still not good for more data needs to be decompressed/compressed and also compared (equality condition on that column, also assume that compare of 8 byte field implemented in native way) during queries
What do you mean?
Just generate snowflakeID for each "new" tag-value combinations.
table would require an internal SELECT to understand which ID is next.
Well, technically it can be solved by kind of dictionary, in memory which will map hash of tags-values into id's
And it would also cause issues with concurrency and when internal tables are replicated.
It's not like we care about consistency of those ID's. we only care about that we can map tag combination to particular (or list of particular) ID on this node. Users will filter by tags, not IDs itself
Perhaps ID can be split into two parts: the first 32 or 64 bits would be used for a hash of the metric name, and the remaining 32 or 64 bits would be used for a hash of all the tags' names and values.
It's ~ok approach, but it would be really great to avoid using 128 bit id's and if we go with lower granularity, split 64 bits to fit data will be more complicated due higher chance of collisions.
There was a problem hiding this comment.
I decided to make it customizable. The TimeSeries engine supports settings now and also allows to describe columns to change their default types, for example:
CREATE TABLE ts(id UInt64) ENGINE=TimeSeries() SETTINGS id_algorithm='SipHash_MetricNameLow32_And_TagsHigh32'
There was a problem hiding this comment.
May be distribution can be shifted a bit, lit 24 for MetricName and 40 for Tags
Another thought, that it may sense to have some tags before metric name in number generator (like ENV or tenant)
What do you think, if id_algorithm can be defined as UDF name?
CREATE TABLE ts(id UInt64) ENGINE=TimeSeries() SETTINGS id_algorithm='generatorId'
CREATE FUNCTION generatorId AS (metric_name, tags) -> sipHash64(metric_name, tags);
CREATE FUNCTION generatorId AS (metric_name, tags) -> bitShiftLeft(56, sipHash64(tags['env'])) + bitShiftLeft(40, sipHash64(metric_name)::UInt16) + bitShiftRight(24, sipHash64(tags)); -- it's not correct, but idea is clear i hope.
There was a problem hiding this comment.
I've implemented something like that but with using the DEFAULT expression for the id column instead of the generatorId function:
CREATE TABLE prometheus
(
id FixedString(16) DEFAULT murmurHash3_128(metric_name, all_tags),
metric_name LowCardinality(String),
all_tags Map(String, String)
) ENGINE=TimeSeries
So the DEFAULT expression for id can do bitShiftLeft or even access a dictionary.
|
This is an automated comment for commit f37e0c7 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
7b6f75c to
970fd1f
Compare
|
Do you intend to expose additional interfaces for receiving/transforming data into the same engine. For example, Influx line protocol (https://docs.influxdata.com/influxdb/cloud/reference/syntax/line-protocol/). |
|
Influx line protocol is compatible with current formats: here is the example #62924 |
I was referring to whether or not there is a plan to expose another http endpoint, similar to what is being done for prometheus remote write. |
|
This is great news |
18681dd to
c65db8e
Compare
I think that can be implemented later as a new format: |
c629a5b to
aaf39d2
Compare
|
Very nice finding. and the query @antonio2368 Any ideas? |
|
Can we now use PromQL to query ClickHouse data? Could you please provide an example? And are there any performance concerns? @vitlibar |
No, not yet. Please subscribe to #57545 to stay in the loop. |
|
Refer to this https://clickhouse.com/docs/en/engines/table-engines/special/time_series, it should be possible to read data from ClickHouse using Prometheus, which is equivalent to using PromQL to read from ClickHouse. is right? @joschi |
|
@jessechencx No. |
|
why can not? I would like to use ClickHouse as the backend storage for Prometheus and then query the data in ClickHouse through Prometheus. and use time_series as table engine. @joschi |
@jessechencx, reading from ClickHouse using Prometheus is done using the Prometheus remote read API, not the PromQL query API. You can use it to ship data between Prometheus and ClickHouse, but not to query that data with PromQL. |
Changelog category:
Changelog entry:
Add new TimeSeries table engine:
This table can then be used to support Prometheus protocols (
remote write,remote read,query) after enabling them in the server configuration:Documentation entry for user-facing changes
CI Settings (Only check the boxes if you know what you are doing):