_granule meta column for sampling

### Company or project name

_No response_

### Use case

I want to be able to sample granules

### Describe the solution you'd like

I propose a _granule or _mark column which gives a granule number for the part. I want to use this for data sampling e.g.

`WHERE _granule % 10 = 0`

would sample approx 10% of the data. 

I'm not clear if this should be applied before or after other filters - it shouldn't make a difference on the result set (where its possible you get no results) but might performance - i defer to experts.

The primary use case is for sampling full table scans - first iteration should defer correction/accuracy of aggs e.g. `sums` to the user.


### Describe alternatives you've considered

Tried sampling using `_part` and  `_part_offset` but this gives poor sampling - for a number of reasons:

1. `IN` on tuples matches the tuple columns independently e.g following matches all parts returned in the `sample` and all offsets vs part,offset pair matching. Result is over sampling.
2. The list gets very large and likely high mem overhead on large tables


```WITH sample AS
    (
        SELECT
            part_name,
            mark_number,
            rows_in_granule,
            sum(rows_in_granule) OVER (PARTITION BY part_name ORDER BY mark_number ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_rows,
            cumulative_rows + 1 AS row_sample
        FROM mergeTreeIndex('otel', 'otel_traces')
        WHERE (mark_number % 10) = 0
    )
SELECT
    ResourceAttributes['host.name'] AS host_name,
    avg(Duration) AS duration
FROM otel.otel_traces
WHERE indexHint((_part, _part_offset) IN (
    SELECT
        part_name,
        row_sample
    FROM sample
))
GROUP BY host_name
ORDER BY host_name ASC
```

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_granule meta column for sampling #79572

Company or project name

Use case

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

_granule meta column for sampling #79572

Description

Company or project name

Use case

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions