Skip to content

feat: support addition of newly written columns to last cache#25125

Merged
hiltontj merged 36 commits intomainfrom
hiltontj/lastcache-new-vals
Jul 9, 2024
Merged

feat: support addition of newly written columns to last cache#25125
hiltontj merged 36 commits intomainfrom
hiltontj/lastcache-new-vals

Conversation

@hiltontj
Copy link
Copy Markdown
Contributor

@hiltontj hiltontj commented Jul 5, 2024

Closes #25124

Follows #25109

This will make caches that are initialized with the default value set, i.e., all non-key columns, capture newly written fields that were not present in the table when those caches were initialized.

At a high level, there are two notable changes in this PR:

  • The LastCacheStore no longer holds its own arrow schema - now, the outer LastCache's schema is relied on for producing RecordBatches
  • The insertion order guarantee of the IndexMap used in the LastCacheStore is no longer relied upon; instead, when producing RecordBatches, it is done by iterating through the fields in the outer LastCache schema, and doing a lookup in the inner LastCacheStore (fields that are not in the store are substituted with a null array).

Two tests were added to check: 1) that new fields get added, and 2) that the cache behaves when new fields are added in different orders to different key column values within the cache.

hiltontj added 30 commits June 27, 2024 12:01
Each last cache holds a ring buffer for each column in an index map, which
preserves the insertion order for faster record batch production.

The ring buffer uses a custom type to handle the different supported
data types that we can have in the system.
LastCacheProvider is the API used to create last caches and write
table batches to them. It uses a two-layer RwLock/HashMap: the first for
the database, and the second layer for the table within the database.

This allows for table-level locks when writing in buffered data, and only
gets a database-level lock when creating a cache (and in future, when
removing them as well).
Added basic APIs on the write buffer to access the last cache and then a
test to the last_cache module to see that it works with a simple example
Addressed three parts of PR feedback:

1. Remove double-lock on cache map
2. Re-order the get when writing to the cache to be outside the loop
3. Move the time check into the cache itself
This refactors the last cache to use a nested caching structure, where
the key columns for a given cache are used to create a hierarchy of
nested maps, terminating in the actual store for the values in the cache.

Access to the cache is done via a set of predicates which can optionally
specify the key column values at any level in the cache hierarchy to only
gather record batches from children of that node in the cache.

Some todos:
- Need to handle the TTL
- Need to move the TableProvider impl up to the LastCache type
This re-writes the datafusion TableProvider implementation on the correct
type, i.e., the LastCache, and adds conversion from the filter Expr's to
the Predicate type for the cache.
Last caches will have expired entries walked when writes come in.
Changed key columns so that they do not accept null values, i.e., rows
that are pushed that are missing key column values will be ignored.

When producing record batches for a cache, if not all key columns are
used in the predicate, then this change makes it so that the non-predicate
key columns are produced as columns in the outputted record batches.

A test with a few cases showing this was added.
Ensure key columns in the last cache that are not included in the
predicate are emitted in the RecordBatches as a column.

Cleaned up and added comments to the new test.
Added two tests, as per commit title. Also moved the eviction process
to a separate function so that it was not being done on every write to
the cache, which could be expensive, and this ensures that entries are
evicted regardless of whether writes are coming in or not.
CacheAlreadyExists errors were only being based on the database and
table names, and not including the cache names, which was not
correct.
This also adds explicit support for series key columns to distinguish
them from normal tags in terms of nullability

A test was added to check nulls work
Support the addition of new fields to the last cache, for caches that do
not have a specified set of value columns.

A test was added along with the changes.
@hiltontj hiltontj added the v3 label Jul 5, 2024
@hiltontj hiltontj self-assigned this Jul 5, 2024
Copy link
Copy Markdown
Contributor Author

@hiltontj hiltontj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments for myself to address. 🧹

@hiltontj hiltontj force-pushed the hiltontj/lastcache-new-vals branch from 2917fc1 to 387b687 Compare July 9, 2024 14:02
Enabling addition of new fields to a last cache made the insertion order
guarantee of the IndexMap break down. It could not be relied upon anymore
so this commit removes reference to that fact, despite still using the
IndexMap type, and strips out the schema from the inner LastCacheStore
type of the LastCache.

Now, the outer LastCache schema is relied on for producing RecordBatches,
which requires a lookup to the inner LastCacheStore's HashMap for each
field in the schema. This may not be as convenient as iterating over the
map as before, but trying to manage the disparate schema, and maintaining
the map ordering was making the code too complicated. This seems like a
reasonable compromise for now, until we see the need to optimize.

The IndexMap is still used for its fast iteration and lookup
characteristics.

The test that checks for new field ordering behaviour was modified to be
correct.
@hiltontj hiltontj force-pushed the hiltontj/lastcache-new-vals branch from 387b687 to 5beb47c Compare July 9, 2024 14:06
Some renaming of variables was done to clarify meaning as well.
@hiltontj hiltontj marked this pull request as ready for review July 9, 2024 14:54
@hiltontj hiltontj requested review from mgattozzi and pauldix July 9, 2024 14:54
Base automatically changed from hiltontj/lastcache-impl to main July 9, 2024 19:22
@hiltontj hiltontj merged commit 2609b59 into main Jul 9, 2024
@hiltontj hiltontj deleted the hiltontj/lastcache-new-vals branch July 9, 2024 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support adding new fields to a cache

2 participants