Proposal: Distinct Value Cache

_See related epic: https://github.com/influxdata/influxdb/issues/25539_

## Context

It is very common for `influxdb` users to want to quickly return the unique values for a given column or to return the unique values for a given column given the value of some other column. For example, the unique region names, or unique hostnames, or given the region ‘us-west’ the unique hostnames within that region.

These kinds of lookups are typically used in selectors at the top of dashboards and are frequently accessed. Performing these lookups in tens of milliseconds, rather than hundreds of milliseconds represents a significant improvement in user experience.

The metadata cache will provide a feature to make such queries fast.

## Requirements

1. Users should be able to specify a table and one or more columns to cache.
2. If given more than one column, the order is important and hierarchical (like the region, host example).
3. The API to read from the cache is accessible through the `query` interface, i.e., via queries through a FlightSQL connection or through the `/api/v3/query_sql` API, to a user-defined function, similar to the `last_cache`.
4. The API to configure the metadata cache can be accessed through the REST API.
5. There is a system table that summarizes the caches configured for a database.
6. Eviction of entries from the cache is configurable (see below).

## Querying the `meta_cache`

Queries to the metadata cache will be fulfilled via a user-defined function called `meta_cache` and invoked like so:

```sql
SELECT host FROM meta_cache('cpu')
```

Here, `host` is a column on the `cpu` table. This query will return the distinct/unique values that have been seen for the `host` column in incoming writes to the database.

The entries returned should be sorted in ascending order by default.

If a cache is configured for multiple columns, one could select from both:

```sql
SELECT region, host FROM meta_cache('cpu')
```

Queries to the cache support `LIMIT` and `WHERE` clauses, e.g.,

```sql
SELECT host FROM meta_cache('cpu') LIMIT 10
SELECT host FROM meta_cache('cpu') WHERE region = 'us-west'
```
In the latter example, the cache would need to be configured on both the `region` and `host` columns.

Each cache has a unique name, such that if there are multiple caches configured on a given table, they can be queried by their name:

```sql
SELECT host FROM meta_cache('cpu', '<cache_name>')
```

## System Table

There will be a system table that can be queried via:

```sql
SELECT * FROM system.meta_caches WHERE table = 'cpu'
```

The `table` predicate is required. This will return results with columns for:
* `table` the cache's target table name
* `name` the name of the cache
* `column_ids` as a list of `ColumnId`s, that the cache is configured on
* `column_names` as a list of the column's names, that the cache is configured on
* `max_age` or null if it is not configured for the cache

## Configuration API

Metadata caches can be created via the following API:
```
POST /api/v3/configure/meta_cache
```
with body parameters:
* `db`: database the cache is configured on
* `table`: table the cache is configured on
* `name`: (optional) name of the cache, must be unique within the db/table, will be generated if not provided
* `columns`: list of ordered columns to construct the cache
* `max_age`: (optional) values that have not been seen for this amount of time will be evicted from the cache

and deleted via:
```
DELETE /api/v3/configure/meta_cache
```
with body _or URL_ parameters:
* `db`: database the cache is configured on
* `table`: table the cache is configured on
* `name`: the name of the cache

## Hierarchical Cache Structure

The metadata cache, like the last-n-value cache, should have a hierarchical structure. For example, for a cache configured on two columns, `region` and `host`, _in that order_, the values will be stored in a hierarchy:
<p align="center"><img width="512" alt="Screenshot 2024-11-12 at 14 03 38" src="https://github.com/user-attachments/assets/e4c640d2-2bf0-4ee7-988c-c1d4d3314279"></p>

## Eviction

Entries in the cache will be evicted after they have reached the `max_age` configured on the cache. If the cache has a configured `max_age`, an eviction process will need to be run to prune entries that are older than the `max_age`.

## Limitations

### Cache Population

The cache is populated strictly by incoming writes. So, values that were written to the database before the cache was created - or in the event of a restart, values that were written to the database before the restart - will not be cached.

There are plans to have caches be pre-populated on cache creation and server restart in the Pro version of InfluxDB 3.

## Questions

1. Can / How do we support `OFFSET` clauses when the contents of the cache could change between queries?
2. Do we restrict columns cached to tag columns?
3. Allow configuring a `max_size` for a cache? and if so, is it size in memory usage, or number of elements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Distinct Value Cache #25538

Context

Requirements

Querying the `meta_cache`

System Table

Configuration API

Hierarchical Cache Structure

Eviction

Limitations

Cache Population

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Distinct Value Cache #25538

Description

Context

Requirements

Querying the meta_cache

System Table

Configuration API

Hierarchical Cache Structure

Eviction

Limitations

Cache Population

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Querying the `meta_cache`