-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
See related epic: #25539
Context
It is very common for influxdb users to want to quickly return the unique values for a given column or to return the unique values for a given column given the value of some other column. For example, the unique region names, or unique hostnames, or given the region ‘us-west’ the unique hostnames within that region.
These kinds of lookups are typically used in selectors at the top of dashboards and are frequently accessed. Performing these lookups in tens of milliseconds, rather than hundreds of milliseconds represents a significant improvement in user experience.
The metadata cache will provide a feature to make such queries fast.
Requirements
- Users should be able to specify a table and one or more columns to cache.
- If given more than one column, the order is important and hierarchical (like the region, host example).
- The API to read from the cache is accessible through the
queryinterface, i.e., via queries through a FlightSQL connection or through the/api/v3/query_sqlAPI, to a user-defined function, similar to thelast_cache. - The API to configure the metadata cache can be accessed through the REST API.
- There is a system table that summarizes the caches configured for a database.
- Eviction of entries from the cache is configurable (see below).
Querying the meta_cache
Queries to the metadata cache will be fulfilled via a user-defined function called meta_cache and invoked like so:
SELECT host FROM meta_cache('cpu')Here, host is a column on the cpu table. This query will return the distinct/unique values that have been seen for the host column in incoming writes to the database.
The entries returned should be sorted in ascending order by default.
If a cache is configured for multiple columns, one could select from both:
SELECT region, host FROM meta_cache('cpu')Queries to the cache support LIMIT and WHERE clauses, e.g.,
SELECT host FROM meta_cache('cpu') LIMIT 10
SELECT host FROM meta_cache('cpu') WHERE region = 'us-west'In the latter example, the cache would need to be configured on both the region and host columns.
Each cache has a unique name, such that if there are multiple caches configured on a given table, they can be queried by their name:
SELECT host FROM meta_cache('cpu', '<cache_name>')System Table
There will be a system table that can be queried via:
SELECT * FROM system.meta_caches WHERE table = 'cpu'The table predicate is required. This will return results with columns for:
tablethe cache's target table namenamethe name of the cachecolumn_idsas a list ofColumnIds, that the cache is configured oncolumn_namesas a list of the column's names, that the cache is configured onmax_ageor null if it is not configured for the cache
Configuration API
Metadata caches can be created via the following API:
POST /api/v3/configure/meta_cache
with body parameters:
db: database the cache is configured ontable: table the cache is configured onname: (optional) name of the cache, must be unique within the db/table, will be generated if not providedcolumns: list of ordered columns to construct the cachemax_age: (optional) values that have not been seen for this amount of time will be evicted from the cache
and deleted via:
DELETE /api/v3/configure/meta_cache
with body or URL parameters:
db: database the cache is configured ontable: table the cache is configured onname: the name of the cache
Hierarchical Cache Structure
The metadata cache, like the last-n-value cache, should have a hierarchical structure. For example, for a cache configured on two columns, region and host, in that order, the values will be stored in a hierarchy:
Eviction
Entries in the cache will be evicted after they have reached the max_age configured on the cache. If the cache has a configured max_age, an eviction process will need to be run to prune entries that are older than the max_age.
Limitations
Cache Population
The cache is populated strictly by incoming writes. So, values that were written to the database before the cache was created - or in the event of a restart, values that were written to the database before the restart - will not be cached.
There are plans to have caches be pre-populated on cache creation and server restart in the Pro version of InfluxDB 3.
Questions
- Can / How do we support
OFFSETclauses when the contents of the cache could change between queries? - Do we restrict columns cached to tag columns?
- Allow configuring a
max_sizefor a cache? and if so, is it size in memory usage, or number of elements.
