Company or project name
No response
Use case
I want to be able to sample granules
Describe the solution you'd like
I propose a _granule or _mark column which gives a granule number for the part. I want to use this for data sampling e.g.
WHERE _granule % 10 = 0
would sample approx 10% of the data.
I'm not clear if this should be applied before or after other filters - it shouldn't make a difference on the result set (where its possible you get no results) but might performance - i defer to experts.
The primary use case is for sampling full table scans - first iteration should defer correction/accuracy of aggs e.g. sums to the user.
Describe alternatives you've considered
Tried sampling using _part and _part_offset but this gives poor sampling - for a number of reasons:
IN on tuples matches the tuple columns independently e.g following matches all parts returned in the sample and all offsets vs part,offset pair matching. Result is over sampling.
- The list gets very large and likely high mem overhead on large tables
(
SELECT
part_name,
mark_number,
rows_in_granule,
sum(rows_in_granule) OVER (PARTITION BY part_name ORDER BY mark_number ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_rows,
cumulative_rows + 1 AS row_sample
FROM mergeTreeIndex('otel', 'otel_traces')
WHERE (mark_number % 10) = 0
)
SELECT
ResourceAttributes['host.name'] AS host_name,
avg(Duration) AS duration
FROM otel.otel_traces
WHERE indexHint((_part, _part_offset) IN (
SELECT
part_name,
row_sample
FROM sample
))
GROUP BY host_name
ORDER BY host_name ASC
Additional context
No response
Company or project name
No response
Use case
I want to be able to sample granules
Describe the solution you'd like
I propose a _granule or _mark column which gives a granule number for the part. I want to use this for data sampling e.g.
WHERE _granule % 10 = 0would sample approx 10% of the data.
I'm not clear if this should be applied before or after other filters - it shouldn't make a difference on the result set (where its possible you get no results) but might performance - i defer to experts.
The primary use case is for sampling full table scans - first iteration should defer correction/accuracy of aggs e.g.
sumsto the user.Describe alternatives you've considered
Tried sampling using
_partand_part_offsetbut this gives poor sampling - for a number of reasons:INon tuples matches the tuple columns independently e.g following matches all parts returned in thesampleand all offsets vs part,offset pair matching. Result is over sampling.Additional context
No response