Skip to content

Proposal: Distinct Value Cache #25538

@hiltontj

Description

@hiltontj

See related epic: #25539

Context

It is very common for influxdb users to want to quickly return the unique values for a given column or to return the unique values for a given column given the value of some other column. For example, the unique region names, or unique hostnames, or given the region ‘us-west’ the unique hostnames within that region.

These kinds of lookups are typically used in selectors at the top of dashboards and are frequently accessed. Performing these lookups in tens of milliseconds, rather than hundreds of milliseconds represents a significant improvement in user experience.

The metadata cache will provide a feature to make such queries fast.

Requirements

  1. Users should be able to specify a table and one or more columns to cache.
  2. If given more than one column, the order is important and hierarchical (like the region, host example).
  3. The API to read from the cache is accessible through the query interface, i.e., via queries through a FlightSQL connection or through the /api/v3/query_sql API, to a user-defined function, similar to the last_cache.
  4. The API to configure the metadata cache can be accessed through the REST API.
  5. There is a system table that summarizes the caches configured for a database.
  6. Eviction of entries from the cache is configurable (see below).

Querying the meta_cache

Queries to the metadata cache will be fulfilled via a user-defined function called meta_cache and invoked like so:

SELECT host FROM meta_cache('cpu')

Here, host is a column on the cpu table. This query will return the distinct/unique values that have been seen for the host column in incoming writes to the database.

The entries returned should be sorted in ascending order by default.

If a cache is configured for multiple columns, one could select from both:

SELECT region, host FROM meta_cache('cpu')

Queries to the cache support LIMIT and WHERE clauses, e.g.,

SELECT host FROM meta_cache('cpu') LIMIT 10
SELECT host FROM meta_cache('cpu') WHERE region = 'us-west'

In the latter example, the cache would need to be configured on both the region and host columns.

Each cache has a unique name, such that if there are multiple caches configured on a given table, they can be queried by their name:

SELECT host FROM meta_cache('cpu', '<cache_name>')

System Table

There will be a system table that can be queried via:

SELECT * FROM system.meta_caches WHERE table = 'cpu'

The table predicate is required. This will return results with columns for:

  • table the cache's target table name
  • name the name of the cache
  • column_ids as a list of ColumnIds, that the cache is configured on
  • column_names as a list of the column's names, that the cache is configured on
  • max_age or null if it is not configured for the cache

Configuration API

Metadata caches can be created via the following API:

POST /api/v3/configure/meta_cache

with body parameters:

  • db: database the cache is configured on
  • table: table the cache is configured on
  • name: (optional) name of the cache, must be unique within the db/table, will be generated if not provided
  • columns: list of ordered columns to construct the cache
  • max_age: (optional) values that have not been seen for this amount of time will be evicted from the cache

and deleted via:

DELETE /api/v3/configure/meta_cache

with body or URL parameters:

  • db: database the cache is configured on
  • table: table the cache is configured on
  • name: the name of the cache

Hierarchical Cache Structure

The metadata cache, like the last-n-value cache, should have a hierarchical structure. For example, for a cache configured on two columns, region and host, in that order, the values will be stored in a hierarchy:

Screenshot 2024-11-12 at 14 03 38

Eviction

Entries in the cache will be evicted after they have reached the max_age configured on the cache. If the cache has a configured max_age, an eviction process will need to be run to prune entries that are older than the max_age.

Limitations

Cache Population

The cache is populated strictly by incoming writes. So, values that were written to the database before the cache was created - or in the event of a restart, values that were written to the database before the restart - will not be cached.

There are plans to have caches be pre-populated on cache creation and server restart in the Pro version of InfluxDB 3.

Questions

  1. Can / How do we support OFFSET clauses when the contents of the cache could change between queries?
  2. Do we restrict columns cached to tag columns?
  3. Allow configuring a max_size for a cache? and if so, is it size in memory usage, or number of elements.

Metadata

Metadata

Assignees

No one assigned

    Labels

    proposalProposal for new feature; for discussion and refining requirementsv3

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions