Skip to content

feat: API to create last caches#25147

Merged
hiltontj merged 58 commits intomainfrom
hiltontj/lastcache-create-api
Jul 16, 2024
Merged

feat: API to create last caches#25147
hiltontj merged 58 commits intomainfrom
hiltontj/lastcache-create-api

Conversation

@hiltontj
Copy link
Copy Markdown
Contributor

@hiltontj hiltontj commented Jul 12, 2024

Closes #25096

  • Adds a new HTTP API that allows the creation of a last cache, see the issue for details
  • An E2E test was added to check success/failure behaviour of the API
  • Adds the mime crate, for parsing request MIME types, but this is only used in the code I added - we may adopt it in other APIs / parts of the HTTP server in future PRs

hiltontj added 30 commits June 27, 2024 12:01
Each last cache holds a ring buffer for each column in an index map, which
preserves the insertion order for faster record batch production.

The ring buffer uses a custom type to handle the different supported
data types that we can have in the system.
LastCacheProvider is the API used to create last caches and write
table batches to them. It uses a two-layer RwLock/HashMap: the first for
the database, and the second layer for the table within the database.

This allows for table-level locks when writing in buffered data, and only
gets a database-level lock when creating a cache (and in future, when
removing them as well).
Added basic APIs on the write buffer to access the last cache and then a
test to the last_cache module to see that it works with a simple example
Addressed three parts of PR feedback:

1. Remove double-lock on cache map
2. Re-order the get when writing to the cache to be outside the loop
3. Move the time check into the cache itself
This refactors the last cache to use a nested caching structure, where
the key columns for a given cache are used to create a hierarchy of
nested maps, terminating in the actual store for the values in the cache.

Access to the cache is done via a set of predicates which can optionally
specify the key column values at any level in the cache hierarchy to only
gather record batches from children of that node in the cache.

Some todos:
- Need to handle the TTL
- Need to move the TableProvider impl up to the LastCache type
This re-writes the datafusion TableProvider implementation on the correct
type, i.e., the LastCache, and adds conversion from the filter Expr's to
the Predicate type for the cache.
Last caches will have expired entries walked when writes come in.
Changed key columns so that they do not accept null values, i.e., rows
that are pushed that are missing key column values will be ignored.

When producing record batches for a cache, if not all key columns are
used in the predicate, then this change makes it so that the non-predicate
key columns are produced as columns in the outputted record batches.

A test with a few cases showing this was added.
Ensure key columns in the last cache that are not included in the
predicate are emitted in the RecordBatches as a column.

Cleaned up and added comments to the new test.
Added two tests, as per commit title. Also moved the eviction process
to a separate function so that it was not being done on every write to
the cache, which could be expensive, and this ensures that entries are
evicted regardless of whether writes are coming in or not.
CacheAlreadyExists errors were only being based on the database and
table names, and not including the cache names, which was not
correct.
This also adds explicit support for series key columns to distinguish
them from normal tags in terms of nullability

A test was added to check nulls work
Support the addition of new fields to the last cache, for caches that do
not have a specified set of value columns.

A test was added along with the changes.
hiltontj added 11 commits July 9, 2024 16:40
Created a new module for the DataFusion table function implementations.
The TableProvider impl for LastCache was moved there, and new code that
implements the TableFunctionImpl trait to make the last cache queryable
was also written.

The LastCacheProvider and LastCache were augmented to make this work:
- The provider stores an Arc<LastCache> instead of a LastCache
- The LastCache uses interior mutability via an RwLock, to make the above
  possible.
The server used to accept a socket address and bind it directly, returning
error if the bind fails.

This commit changes that so the ServerBuilder accepts a TcpListener. The
behaviour is essentially the same, but this allows us to bind the address
from tests when instantiating the server, so we can easily assign unused
ports.

Tests in the influxdb3_server were updated to exploit this in order to
use port 0 auto assignment and stop flaky test failures.

A new, failing, test was also added to that module for the last cache.
Committing here as the last cache is in a working state, but it is naively
implemented as it just stores all key columns again (still with the hierarchy)
@hiltontj hiltontj added the v3 label Jul 12, 2024
@hiltontj hiltontj self-assigned this Jul 12, 2024
@hiltontj hiltontj marked this pull request as ready for review July 12, 2024 20:28
@hiltontj hiltontj requested review from mgattozzi and pauldix July 12, 2024 20:28
@hiltontj
Copy link
Copy Markdown
Contributor Author

I realized that there is one bit of behaviour that does not make sense (I commented on this here):

If a cache is created with identical parameters to another, with the exception that it has a different name, then the result will be two identical caches (that have different names). This should instead be an error.

I will open up a follow-on PR to address this.

Copy link
Copy Markdown
Member

@pauldix pauldix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, love the integration tests!

Base automatically changed from hiltontj/lastcache-datafusion to main July 16, 2024 14:10
@hiltontj hiltontj merged commit 5648859 into main Jul 16, 2024
@hiltontj hiltontj deleted the hiltontj/lastcache-create-api branch July 16, 2024 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

API for creation of last_caches

2 participants