-
Notifications
You must be signed in to change notification settings - Fork 8.3k
_granule meta column for sampling #79572
Copy link
Copy link
Open
Labels
Description
Company or project name
No response
Use case
I want to be able to sample granules
Describe the solution you'd like
I propose a _granule or _mark column which gives a granule number for the part. I want to use this for data sampling e.g.
WHERE _granule % 10 = 0
would sample approx 10% of the data.
I'm not clear if this should be applied before or after other filters - it shouldn't make a difference on the result set (where its possible you get no results) but might performance - i defer to experts.
The primary use case is for sampling full table scans - first iteration should defer correction/accuracy of aggs e.g. sums to the user.
Describe alternatives you've considered
Tried sampling using _part and _part_offset but this gives poor sampling - for a number of reasons:
INon tuples matches the tuple columns independently e.g following matches all parts returned in thesampleand all offsets vs part,offset pair matching. Result is over sampling.- The list gets very large and likely high mem overhead on large tables
(
SELECT
part_name,
mark_number,
rows_in_granule,
sum(rows_in_granule) OVER (PARTITION BY part_name ORDER BY mark_number ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_rows,
cumulative_rows + 1 AS row_sample
FROM mergeTreeIndex('otel', 'otel_traces')
WHERE (mark_number % 10) = 0
)
SELECT
ResourceAttributes['host.name'] AS host_name,
avg(Duration) AS duration
FROM otel.otel_traces
WHERE indexHint((_part, _part_offset) IN (
SELECT
part_name,
row_sample
FROM sample
))
GROUP BY host_name
ORDER BY host_name ASC
Additional context
No response
Reactions are currently unavailable