Skip to content

re_datastore: introduce cluster keys#593

Merged
teh-cmc merged 90 commits intomainfrom
cmc/datastore/clustering_keys
Dec 22, 2022
Merged

re_datastore: introduce cluster keys#593
teh-cmc merged 90 commits intomainfrom
cmc/datastore/clustering_keys

Conversation

@teh-cmc
Copy link
Copy Markdown
Contributor

@teh-cmc teh-cmc commented Dec 18, 2022

This PR adds support for cluster keys.

A cluster key specifies a column/component that is guaranteed to always be present for every single row of data within the store.
It is a property of a the DataStore itself.

In addition to always being present, the payload of the cluster key..:

  • is always increasingly sorted,
  • is always dense (no validity bitmap),
  • and never contains duplicate entries.

This makes the cluster key a perfect candidate for joining query results together, and doing so as efficiently as possible.

Basically, this brings our datastore much closer to our point-of-view-based query model, which very much lends itself to doing a lot of joins across time and space.
The store simply cannot end up in a state where data from different sources can't be joined one way or another, making a lot of things much simpler (especially when you get to range queries, where suddenly everything turns into streaming-join iterators... a topic for another day).

Effectively, this trades some performance on the write path for some performance on the read path, as can be seen on the benchmarks below.


In this PR:

  • Cluster keys are inserted, auto-generated and deduplicated as necessary.
  • The previous, complicated test suite that used to check a bunch of things that are simply not permitted anymore has been replaced by a much simpler, dataframe-based one.
  • A polars feature has been added which includes a bunch of helpers to work efficiently with dataframes, because it is now extremely nice to work with those.
  • In re_query, the code checking that instances are sorted and/or generating ones has been removed since this is now guaranteed by the store.
  • Typed errors (re_datastore: replace anyhow::Error usage with a thiserror derived Error type #527) make their apparition on the write path.

Things not in this PR:

  • Any kind of splat support.
  • Sorting the rows of the MsgBundle on behalf of the user if they aren't already (and I don't think we'd ever want to).

image


Fixes #559

@teh-cmc teh-cmc changed the title re_datastore: introduce clustering keys re_datastore: introduce cluster keys Dec 20, 2022
@teh-cmc teh-cmc marked this pull request as ready for review December 20, 2022 20:49
Copy link
Copy Markdown
Contributor

@jleibs jleibs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome. This will clean up a bunch of stuff on the query side!

Couple of small perf-related issues.

/// ```
//
// TODO(cmc): can this really fail though?
pub fn latest_component(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by consolidate here.

Obviously these tools give very similar results as re_query, but there are a bunch of reasons why I think they're very different and should live separately.
It's almost standup, so let's bring that up then I guess!

pub enum WriteError {
// Batches
#[error("Cannot insert more than 1 row at a time, got {0}")]
BadBatchLength(usize),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe BadRowLength? Batch length makes me think of, e.g. InstanceArray.len()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BadRowLength makes me think of InstanceArray.len().. 😄

Let's go for something drastic: MoreThanOneRow 😬

trace!(
kind = "insert",
id = self.insert_id,
cluster_key = %self.cluster_key,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've meant to ask about this before: what does % do here? I had trouble googling for the syntax in this context.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's tracing syntax, which pretty much has grown into a de-facto standard for all things macros these days.

  • % means "use the display implementation"
  • ? means "use the debug implementation"

See https://docs.rs/tracing/latest/tracing/index.html#using-the-macros

.downcast_ref::<ListArray<i32>>()
.unwrap()
.value(row_nr);
let nb_instances = row.len();
Copy link
Copy Markdown
Contributor

@jleibs jleibs Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using .value to get the length is going to be expensive because it will materialize a full Box<dyn array>

I believe you can pull out the inner length for the row directly from the ListArray using:

let (start, end) = rows_single
                           .as_any()
                           .downcast_ref::<ListArray<i32>>()
                           .unwrap()
                           .offsets()
                           .start_end_unchecked(row_nr);
nb_instances = end - start;

Copy link
Copy Markdown
Contributor Author

@teh-cmc teh-cmc Dec 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did something similar, though slightly different, because offsets() works differently in arrow 0.14 vs 0.15...

Which makes me think, I should try those benchmarks with 0.15.

@teh-cmc teh-cmc requested a review from jleibs December 21, 2022 11:29
@teh-cmc
Copy link
Copy Markdown
Contributor Author

teh-cmc commented Dec 21, 2022

Re-ran the benchmarks with your suggestions applied: it looks way more in line with what one would expect, and the flakiness seems to be gone 👍 (see screenshot in top post).

I'm curious to see how this all behaves with arrow 0.15, which seems to address most of the issues we're seeing with ListArrays... but unfortunately we're blocked by arrow-convert at the moment.

Copy link
Copy Markdown
Contributor

@jleibs jleibs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@teh-cmc teh-cmc merged commit 42911ed into main Dec 22, 2022
@teh-cmc teh-cmc deleted the cmc/datastore/clustering_keys branch December 22, 2022 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

re_datastore: introduce clustering keys/components

2 participants