re_datastore: introduce cluster keys by teh-cmc · Pull Request #593 · rerun-io/rerun

teh-cmc · 2022-12-18T11:43:23Z

This PR adds support for cluster keys.

A cluster key specifies a column/component that is guaranteed to always be present for every single row of data within the store.
It is a property of a the DataStore itself.

In addition to always being present, the payload of the cluster key..:

is always increasingly sorted,
is always dense (no validity bitmap),
and never contains duplicate entries.

This makes the cluster key a perfect candidate for joining query results together, and doing so as efficiently as possible.

Basically, this brings our datastore much closer to our point-of-view-based query model, which very much lends itself to doing a lot of joins across time and space.
The store simply cannot end up in a state where data from different sources can't be joined one way or another, making a lot of things much simpler (especially when you get to range queries, where suddenly everything turns into streaming-join iterators... a topic for another day).

Effectively, this trades some performance on the write path for some performance on the read path, as can be seen on the benchmarks below.

In this PR:

Cluster keys are inserted, auto-generated and deduplicated as necessary.
The previous, complicated test suite that used to check a bunch of things that are simply not permitted anymore has been replaced by a much simpler, dataframe-based one.
A polars feature has been added which includes a bunch of helpers to work efficiently with dataframes, because it is now extremely nice to work with those.
In re_query, the code checking that instances are sorted and/or generating ones has been removed since this is now guaranteed by the store.
Typed errors (re_datastore: replace anyhow::Error usage with a thiserror derived Error type #527) make their apparition on the write path.

Things not in this PR:

Any kind of splat support.
Sorting the rows of the MsgBundle on behalf of the user if they aren't already (and I don't think we'd ever want to).

Fixes #559

…low!

…ts by design

crates/re_arrow_store/benches/data_store.rs

crates/re_arrow_store/Cargo.toml

crates/re_arrow_store/src/polars_helpers.rs

crates/re_arrow_store/src/store.rs

crates/re_arrow_store/src/store_write.rs

crates/re_arrow_store/src/test_util.rs

crates/re_query/src/query.rs

…ing_keys

jleibs

Looks awesome. This will clean up a bunch of stuff on the query side!

Couple of small perf-related issues.

crates/re_arrow_store/src/arrow_util.rs

jleibs · 2022-12-20T23:06:18Z

crates/re_arrow_store/src/polars_helpers.rs

+/// ```
+//
+// TODO(cmc): can this really fail though?
+pub fn latest_component(


We should consolidate this with:

https://github.com/rerun-io/rerun/pull/603/files#diff-e2ca753456712410c1f0b12feeb129aca73d6d52ecd5f07462226965fbfe4a47R8

I'm not sure what you mean by consolidate here.

Obviously these tools give very similar results as re_query, but there are a bunch of reasons why I think they're very different and should live separately.
It's almost standup, so let's bring that up then I guess!

crates/re_arrow_store/tests/correctness.rs

jleibs · 2022-12-20T23:18:05Z

crates/re_arrow_store/src/store_write.rs

+pub enum WriteError {
+    // Batches
+    #[error("Cannot insert more than 1 row at a time, got {0}")]
+    BadBatchLength(usize),


Maybe BadRowLength? Batch length makes me think of, e.g. InstanceArray.len()

BadRowLength makes me think of InstanceArray.len().. 😄

Let's go for something drastic: MoreThanOneRow 😬

jleibs · 2022-12-20T23:21:21Z

crates/re_arrow_store/src/store_write.rs

        trace!(
            kind = "insert",
            id = self.insert_id,
+            cluster_key = %self.cluster_key,


I've meant to ask about this before: what does % do here? I had trouble googling for the syntax in this context.

It's tracing syntax, which pretty much has grown into a de-facto standard for all things macros these days.

% means "use the display implementation"

? means "use the debug implementation"

See https://docs.rs/tracing/latest/tracing/index.html#using-the-macros

crates/re_arrow_store/src/store_write.rs

jleibs · 2022-12-20T23:53:30Z

crates/re_arrow_store/src/store_write.rs

+                .downcast_ref::<ListArray<i32>>()
+                .unwrap()
+                .value(row_nr);
+            let nb_instances = row.len();


Using .value to get the length is going to be expensive because it will materialize a full Box<dyn array>

I believe you can pull out the inner length for the row directly from the ListArray using:

let (start, end) = rows_single .as_any() .downcast_ref::<ListArray<i32>>() .unwrap() .offsets() .start_end_unchecked(row_nr); nb_instances = end - start;

I did something similar, though slightly different, because offsets() works differently in arrow 0.14 vs 0.15...

Which makes me think, I should try those benchmarks with 0.15.

crates/re_arrow_store/src/store_write.rs

teh-cmc · 2022-12-21T11:38:43Z

Re-ran the benchmarks with your suggestions applied: it looks way more in line with what one would expect, and the flakiness seems to be gone 👍 (see screenshot in top post).

I'm curious to see how this all behaves with arrow 0.15, which seems to address most of the issues we're seeing with ListArrays... but unfortunately we're blocked by arrow-convert at the moment.

jleibs

Looks good!

teh-cmc mentioned this pull request Dec 18, 2022

re_datastore: component chunks & streamlining batches #584

Merged

teh-cmc added 29 commits December 18, 2022 16:01

get is supposed to return a row, not a [row]

940364b

unwrap note

fcf6d5a

the bench too

1a86bee

self review

9e22ac1

doc test also

2fac6d2

and re_query ofc!

8682229

slicing is _very_ slow, don't do it if you don't have to

b7e5fd5

no more col_arrays in re_query

da816d1

there's actually no need for concatenating at all

b22eecc

incrementally compute and cache bucket sizes

4a7b7ef

cleaning up and documenting existing limitations

f88f248

introducing bucket retirement

2bcd47e

issue ref

c8b40b6

some more doc stuff

36ce4db

self-review

751c2e8

polars/fmt should always be there for tests

6776365

streamlining batch support

9652013

take list header into account

bce700e

it's fine

5c6fec8

self-review

e97eab6

just something i want to keep around for later

37cd9b2

(un)wrapping lists is a bit slow... and slicing them is _extremely_ s…

4682c60

…low!

merge cmc/datastore/get_a_single_row (#590)

5abfffe

no more col_arrays in re_query

02170b9

introducing the notion of clustering key, thankfully breaking all tes…

782f0d5

…ts by design

making good use of that shiny new Instance component

17cb879

merge cmc/datastore/get_rid_of_copies (#584)

c311caf

missed one

b9453c4

introducing arrow_util with is_dense_array()

4a36e70

teh-cmc added 4 commits December 20, 2022 20:15

cleanin up tests

c997c28

continuing cleanup and doc

2a00bb8

updating visuals for this brave new world

a1903b3

docs

56a9aee

teh-cmc commented Dec 20, 2022

View reviewed changes

teh-cmc changed the title ~~re_datastore: introduce clustering keys~~ re_datastore: introduce cluster keys Dec 20, 2022

teh-cmc added 4 commits December 20, 2022 21:27

self-review

56fe27a

Merge remote-tracking branch 'origin/main' into cmc/datastore/cluster…

c62a408

…ing_keys

bruh

9448cc6

bruh...

669bd54

teh-cmc marked this pull request as ready for review December 20, 2022 20:49

teh-cmc added 3 commits December 20, 2022 21:58

...

28f51ad

outdated comment

da81555

no reason to search for it multiple times

772874c

jleibs suggested changes Dec 21, 2022

View reviewed changes

jleibs reviewed Dec 21, 2022

View reviewed changes

crates/re_arrow_store/src/store_write.rs Outdated Show resolved Hide resolved

teh-cmc added 4 commits December 21, 2022 10:14

Merge branch 'main' into cmc/datastore/clustering_keys

1d25265

polars_helpers => polars_util for consistency's sake

a221de0

addressing PR comments and a couple other things

eff6343

xxx

63c02ad

teh-cmc requested a review from jleibs December 21, 2022 11:29

Merge branch 'main' into cmc/datastore/clustering_keys

33163b2

jleibs approved these changes Dec 21, 2022

View reviewed changes

teh-cmc added 4 commits December 22, 2022 10:39

merge of death

5fa1025

post-merge fixes

7a4924a

Merge branch 'main' into cmc/datastore/clustering_keys

89b8b5d

more fixes

d1726b6

teh-cmc merged commit 42911ed into main Dec 22, 2022

teh-cmc deleted the cmc/datastore/clustering_keys branch December 22, 2022 10:19

Conversation

teh-cmc commented Dec 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jleibs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jleibs Dec 20, 2022

Choose a reason for hiding this comment

Uh oh!

teh-cmc Dec 21, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jleibs Dec 20, 2022

Choose a reason for hiding this comment

Uh oh!

teh-cmc Dec 21, 2022

Choose a reason for hiding this comment

Uh oh!

jleibs Dec 20, 2022

Choose a reason for hiding this comment

Uh oh!

teh-cmc Dec 21, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jleibs Dec 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

teh-cmc Dec 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

teh-cmc commented Dec 21, 2022

Uh oh!

jleibs left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

teh-cmc commented Dec 18, 2022 •

edited

Loading

jleibs Dec 20, 2022 •

edited

Loading

teh-cmc Dec 21, 2022 •

edited

Loading