Conversation
jleibs
left a comment
There was a problem hiding this comment.
Looks awesome. This will clean up a bunch of stuff on the query side!
Couple of small perf-related issues.
| /// ``` | ||
| // | ||
| // TODO(cmc): can this really fail though? | ||
| pub fn latest_component( |
There was a problem hiding this comment.
We should consolidate this with:
There was a problem hiding this comment.
I'm not sure what you mean by consolidate here.
Obviously these tools give very similar results as re_query, but there are a bunch of reasons why I think they're very different and should live separately.
It's almost standup, so let's bring that up then I guess!
| pub enum WriteError { | ||
| // Batches | ||
| #[error("Cannot insert more than 1 row at a time, got {0}")] | ||
| BadBatchLength(usize), |
There was a problem hiding this comment.
Maybe BadRowLength? Batch length makes me think of, e.g. InstanceArray.len()
There was a problem hiding this comment.
BadRowLength makes me think of InstanceArray.len().. 😄
Let's go for something drastic: MoreThanOneRow 😬
| trace!( | ||
| kind = "insert", | ||
| id = self.insert_id, | ||
| cluster_key = %self.cluster_key, |
There was a problem hiding this comment.
I've meant to ask about this before: what does % do here? I had trouble googling for the syntax in this context.
There was a problem hiding this comment.
It's tracing syntax, which pretty much has grown into a de-facto standard for all things macros these days.
%means "use the display implementation"?means "use the debug implementation"
See https://docs.rs/tracing/latest/tracing/index.html#using-the-macros
| .downcast_ref::<ListArray<i32>>() | ||
| .unwrap() | ||
| .value(row_nr); | ||
| let nb_instances = row.len(); |
There was a problem hiding this comment.
Using .value to get the length is going to be expensive because it will materialize a full Box<dyn array>
I believe you can pull out the inner length for the row directly from the ListArray using:
let (start, end) = rows_single
.as_any()
.downcast_ref::<ListArray<i32>>()
.unwrap()
.offsets()
.start_end_unchecked(row_nr);
nb_instances = end - start;
There was a problem hiding this comment.
I did something similar, though slightly different, because offsets() works differently in arrow 0.14 vs 0.15...
Which makes me think, I should try those benchmarks with 0.15.
|
Re-ran the benchmarks with your suggestions applied: it looks way more in line with what one would expect, and the flakiness seems to be gone 👍 (see screenshot in top post). I'm curious to see how this all behaves with arrow 0.15, which seems to address most of the issues we're seeing with |
This PR adds support for cluster keys.
A cluster key specifies a column/component that is guaranteed to always be present for every single row of data within the store.
It is a property of a the
DataStoreitself.In addition to always being present, the payload of the cluster key..:
This makes the cluster key a perfect candidate for joining query results together, and doing so as efficiently as possible.
Basically, this brings our datastore much closer to our point-of-view-based query model, which very much lends itself to doing a lot of joins across time and space.
The store simply cannot end up in a state where data from different sources can't be joined one way or another, making a lot of things much simpler (especially when you get to range queries, where suddenly everything turns into streaming-join iterators... a topic for another day).
Effectively, this trades some performance on the write path for some performance on the read path, as can be seen on the benchmarks below.
In this PR:
polarsfeature has been added which includes a bunch of helpers to work efficiently with dataframes, because it is now extremely nice to work with those.re_query, the code checking that instances are sorted and/or generating ones has been removed since this is now guaranteed by the store.Things not in this PR:
MsgBundleon behalf of the user if they aren't already (and I don't think we'd ever want to).Fixes #559