feat: support addition of newly written columns to last cache#25125
Merged
feat: support addition of newly written columns to last cache#25125
Conversation
Each last cache holds a ring buffer for each column in an index map, which preserves the insertion order for faster record batch production. The ring buffer uses a custom type to handle the different supported data types that we can have in the system.
LastCacheProvider is the API used to create last caches and write table batches to them. It uses a two-layer RwLock/HashMap: the first for the database, and the second layer for the table within the database. This allows for table-level locks when writing in buffered data, and only gets a database-level lock when creating a cache (and in future, when removing them as well).
Added basic APIs on the write buffer to access the last cache and then a test to the last_cache module to see that it works with a simple example
Addressed three parts of PR feedback: 1. Remove double-lock on cache map 2. Re-order the get when writing to the cache to be outside the loop 3. Move the time check into the cache itself
This refactors the last cache to use a nested caching structure, where the key columns for a given cache are used to create a hierarchy of nested maps, terminating in the actual store for the values in the cache. Access to the cache is done via a set of predicates which can optionally specify the key column values at any level in the cache hierarchy to only gather record batches from children of that node in the cache. Some todos: - Need to handle the TTL - Need to move the TableProvider impl up to the LastCache type
This re-writes the datafusion TableProvider implementation on the correct type, i.e., the LastCache, and adds conversion from the filter Expr's to the Predicate type for the cache.
Last caches will have expired entries walked when writes come in.
Changed key columns so that they do not accept null values, i.e., rows that are pushed that are missing key column values will be ignored. When producing record batches for a cache, if not all key columns are used in the predicate, then this change makes it so that the non-predicate key columns are produced as columns in the outputted record batches. A test with a few cases showing this was added.
Ensure key columns in the last cache that are not included in the predicate are emitted in the RecordBatches as a column. Cleaned up and added comments to the new test.
Added two tests, as per commit title. Also moved the eviction process to a separate function so that it was not being done on every write to the cache, which could be expensive, and this ensures that entries are evicted regardless of whether writes are coming in or not.
CacheAlreadyExists errors were only being based on the database and table names, and not including the cache names, which was not correct.
This also adds explicit support for series key columns to distinguish them from normal tags in terms of nullability A test was added to check nulls work
Support the addition of new fields to the last cache, for caches that do not have a specified set of value columns. A test was added along with the changes.
hiltontj
commented
Jul 8, 2024
Contributor
Author
hiltontj
left a comment
There was a problem hiding this comment.
Left comments for myself to address. 🧹
2917fc1 to
387b687
Compare
Enabling addition of new fields to a last cache made the insertion order guarantee of the IndexMap break down. It could not be relied upon anymore so this commit removes reference to that fact, despite still using the IndexMap type, and strips out the schema from the inner LastCacheStore type of the LastCache. Now, the outer LastCache schema is relied on for producing RecordBatches, which requires a lookup to the inner LastCacheStore's HashMap for each field in the schema. This may not be as convenient as iterating over the map as before, but trying to manage the disparate schema, and maintaining the map ordering was making the code too complicated. This seems like a reasonable compromise for now, until we see the need to optimize. The IndexMap is still used for its fast iteration and lookup characteristics. The test that checks for new field ordering behaviour was modified to be correct.
387b687 to
5beb47c
Compare
Some renaming of variables was done to clarify meaning as well.
pauldix
approved these changes
Jul 9, 2024
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #25124
Follows #25109
This will make caches that are initialized with the default value set, i.e., all non-key columns, capture newly written fields that were not present in the table when those caches were initialized.
At a high level, there are two notable changes in this PR:
LastCacheStoreno longer holds its own arrow schema - now, the outerLastCache's schema is relied on for producingRecordBatchesIndexMapused in theLastCacheStoreis no longer relied upon; instead, when producingRecordBatches, it is done by iterating through the fields in the outerLastCacheschema, and doing a lookup in the innerLastCacheStore(fields that are not in the store are substituted with a null array).Two tests were added to check: 1) that new fields get added, and 2) that the cache behaves when new fields are added in different orders to different key column values within the cache.