Skip to content

Support MinMax index for iceberg #77638

@alesapin

Description

@alesapin

In Manifest list files according to standard Iceberg writers may store min/max values for each column in each data file: https://iceberg.apache.org/spec/#manifest-lists.

Image

For example Spark store them by default, and this information can be very useful to speedup wide range of queries. This is how it looks like:

    "lower_bounds": {
      "array": [
        {
          "key": 1,
          "value": "\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
        },
        {
          "key": 2,
          "value": "vasya"
        },
        {
          "key": 3,
          "value": "Ò\u0002<U+0096>I\u0000\u0000\u0000\u0000"
        },
        {
          "key": 4,
          "value": "½N\u0000\u0000"
        },
        {
          "key": 5,
          "value": "'B"
        }
      ]
    },
    "upper_bounds": {
      "array": [
        {
          "key": 1,
          "value": "\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
        },
        {
          "key": 2,
          "value": "vasya"
        },
        {
          "key": 3,
          "value": "Ò\u0002<U+0096>I\u0000\u0000\u0000\u0000"
        },
        {
          "key": 4,
          "value": "½N\u0000\u0000"
        },
        {
          "key": 5,
          "value": "'B"
        }
      ]
    },

Merge Tree MinMax indices can be reused to implement this feature https://github.com/ClickHouse/ClickHouse/blob/master/src/Storages/MergeTree/MergeTreeIndexMinMax.h.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions