Skip to content

re_datastore: range queries#609

Merged
teh-cmc merged 139 commits intomainfrom
cmc/datastore/range_queries2
Jan 2, 2023
Merged

re_datastore: range queries#609
teh-cmc merged 139 commits intomainfrom
cmc/datastore/range_queries2

Conversation

@teh-cmc
Copy link
Copy Markdown
Contributor

@teh-cmc teh-cmc commented Dec 20, 2022

This PR implements range queries.

It focuses solely on the datastore part of the matter: work still needs to be done on the re_query side.
I.e, similar to how re_query implements an efficient point-in-time multi-PoV join with semantics close to those of the very slow polars_util::latest_components, we now need for re_query to provide an efficient time-ranged multi-PoV streaming-join with semantics akin to those of the (yet again) very slow polars_util::range_components.
See #641.


The design space has turned out to be painfully vast, the tradeoffs numerous, and the semantics all too subtle if not straight vicious at times.

Here's the kind of questions that pop up when talking about range queries:

  • How close should the semantics of DataStore::latest_at and DataStore::range be?
  • Should DataStore::range handle returning the latest-at state, or should that be the job of a higher-level abstraction?
  • What semantics when reconcialiting the latest-state with the state of the streaming-join?
  • Should DataStore::range directly handle the multi-PoV join inline, or should that be the job of a higher-level abstraction?
  • What are the semantics of the streaming join? How fine-grained is it: at the timepoint level, or even finer than that?
  • Are there tiebreakers? If so, what are they? Is tiebreaking handled by DataStore::range, or something else higher in the stack?
  • When do we yield empty frames vs. don't yield anything? Do we ever return errors?
  • It goes on and on...

At this point it feels like I've iterated through pretty much the entire matrix of possibilities, and I've always been unhappy with the result... until now!

With this PR, latest-at queries and range queries end up with very close semantics, barring for two major differences:

  • While latest-at queries walk backwards in time, range queries iterate forwards through rows.
    This might sounds obvious, but has very subtle repercussions on how and when multiple components from multiple point-of-views are joined with each other.
  • While latest-at queries only yield the latest non-null row index for a given component/timepoint pair, range queries will happily yield many row indices for the same component/timepoint pair.
    That's one of the major feature of range queries, allowing for things such as text logs.

I've tried to document all of these decisions as best as I could, but most of the time I've been completely unable to properly convey these things in words.
The test suite really is the ground truth here and is the best place to find answers.


This PR:

  • Implements support for range queries: both low-level DataStore APIs and high-level polars utilities.
  • Makes a very clear distinction between latest-at and range queries by introducing two new types (LastestAtQuery & RangeQuery) and two new methods (DataStore::latest_at, DataStore::range).
  • Fixes a lot of places where TimeInts were unjustifiably erased into i64s.
  • Makes TimeInt impl nohash_hasher::IsEnabled!

Fixes #450


datastore/insert/batch/rects/insert
                        time:   [110.40 µs 110.52 µs 110.68 µs]
                        thrpt:  [90.350 Melem/s 90.485 Melem/s 90.580 Melem/s]
                 change:
                        time:   [-36.045% -35.912% -35.791%] (p = 0.00 < 0.05)
                        thrpt:  [+55.740% +56.035% +56.359%]
                        Performance has improved.

datastore/latest_at/batch/rects/query
                        time:   [392.17 ns 393.18 ns 394.54 ns]
                        thrpt:  [253.46 Melem/s 254.34 Melem/s 254.99 Melem/s]
                 change:
                        time:   [-2.8212% -2.3384% -1.9017%] (p = 0.00 < 0.05)
                        thrpt:  [+1.9385% +2.3943% +2.9031%]
                        Performance has improved.

datastore/latest_at/missing_components/primary
                        time:   [139.47 ns 140.49 ns 142.36 ns]
                        thrpt:  [702.45 Melem/s 711.81 Melem/s 717.01 Melem/s]
                 change:
                        time:   [-5.0274% -3.9278% -2.9135%] (p = 0.00 < 0.05)
                        thrpt:  [+3.0009% +4.0884% +5.2936%]
                        Performance has improved.

datastore/latest_at/missing_components/secondaries
                        time:   [170.24 ns 170.33 ns 170.43 ns]
                        thrpt:  [586.77 Melem/s 587.10 Melem/s 587.40 Melem/s]
                 change:
                        time:   [-0.4024% -0.0165% +0.4071%] (p = 0.93 > 0.05)
                        thrpt:  [-0.4055% +0.0165% +0.4041%]
                        No change in performance detected.

datastore/range/batch/rects/query
                        time:   [28.068 µs 28.146 µs 28.236 µs]
                        thrpt:  [354.16 Melem/s 355.29 Melem/s 356.27 Melem/s]

@teh-cmc teh-cmc linked an issue Dec 26, 2022 that may be closed by this pull request
///
/// ⚠ The semantics are subtle! Study carefully the example below.
///
/// Usage:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this has been replaced by a standalone example in #645

///
/// ⚠ The semantics are subtle! Study carefully the example below.
///
/// Usage:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this has been replaced by a standalone example in #645

@teh-cmc
Copy link
Copy Markdown
Contributor Author

teh-cmc commented Dec 30, 2022

I think it'll be much easier if I answer all of your comments regarding primary keys and when-to-yield-data here rather than all over the place @jleibs, so here we go.

So, just to recap, there are two APIs at play here: the "low-level" one (DataStore::range) and the "high-level" one (polars_util::range_component{s}), where the high-level one exists solely for exploration/debugging and to demonstrate what we expect the real high-level API (re_query) will look like (and indeed, that's how it's implemented in #653).


So, regarding your 3 comments about the low-level API / DataStore::range:

Continuing to wrap my head around the PR, but I keep getting hung up on this aspect of the behavior.

In part there's some additional confusion here since we've overloaded the usage of primary

* In the `re_query` code we use primary to refer to the component that we will use for the left-side of the join

* In the `re_arrow_store` code we use primary to refer to the component which determines row-selections.

The re_query code does a sequence of queries, one-per-component, using each component as the "re-arrow-primary" and then joins them together using the "re-query-primary".

The reason we needed to do this was to get a proper view of the Instance Keys at the time of the component-update in order to do our joins correctly. However, in the case of a range-query, it doesn't seem like we even need a concept of the arrow-store-primary. Since we get every row, instance-keys will always be present within that row. Re-query can still implement its join logic using gits own concept of primary, but having the arrow-store do additional filtering here just seems to unnecessarily constraint what we can do on the query side.

I'm increasingly convinced we don't care about a primary here. I think we just want every row that contains an update to any of the components that we're interested in. This is necessary so these values can be joined into some version of the primary, but how we do that is probably going to be up to some kind of join-policy in re_query.

The one exception to this is the logic for tie-break-on-primary which does have some nice properties. However, that could still be achieved with a similar policy like tie-break-on-component-order.

I agree with pretty much all of the above.

Broadly, there are 2 main questions here from what I can tell:

  • Why are we yielding data only if/when the primary component changes?
  • Why are there point-of-views involved at all for range queries?

The answer is the same for both: I just wanted to make sure that the first datastore API proposed by this PR was semantically as close as possible to both the legacy range API (where "yield-only-for-primary" comes from) and the already existing DataStore::latest_at (where point-of-views come from), before if/when we decide to diverge away from those (looks like we just did!).

I'm definitely on board to get rid of both of those in the low-level API: it should be pretty trivial to do now that everything is in place, and should even make the implementation of the high-level layer a tad simpler.


Speaking of the high-level API, you say..:

This behavior is unexpected / confusing to me.

This I disagree with OTOH; though I'm not entirely sure what feels unexpected and/or confusing to you in this case..?

In the context of the high-level API, these are the semantics I would expect (and they happen to match the semantics of the legacy range API, afaict).

Consider a store in that state:

┌──────────┬─────────────┬──────────────────┬─────────────────────────────────────┬──────────────────────┐
│ frame_nr ┆ entity      ┆ rerun.instance   ┆ rerun.text_entry                    ┆ rerun.colorrgba      │
╞══════════╪═════════════╪══════════════════╪═════════════════════════════════════╪══════════════════════╡
│ 4        ┆ this/that   ┆ [0]              ┆ ["hello"]                           ┆ [RED]                │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4        ┆ this/that   ┆ [0]              ┆ null                                ┆ [GREEN]              │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4        ┆ this/that   ┆ [0]              ┆ null                                ┆ [BLUE]               │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4        ┆ this/that   ┆ [0]              ┆ ["hello"]                           ┆ [YELLOW]             │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4        ┆ this/that   ┆ [0]              ┆ null                                ┆ [PURPLE]             │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4        ┆ this/that   ┆ [0]              ┆ ["world"]                           ┆ null                 │
└──────────┴─────────────┴──────────────────┴─────────────────────────────────────┴──────────────────────┘

When e.g. building a scene for a text view, I expect this to yield [("hello", RED), ("hello", YELLOW), ("world", PURPLE)], not [("hello", RED), ("hello", GREEN), ("hello", BLUE), ("hello", YELLOW), ("hello", PURPLE), ("world", PURPLE)].

So, having a notion of a primary component and a yield-only-for-primary behaviour in the high-level layer seems like the right call to me? (Though, again, the implementation could be made simpler if we dropped all the rules on the low-level layer!)

@teh-cmc teh-cmc force-pushed the cmc/datastore/range_queries2 branch from db9950f to c4fbda7 Compare December 30, 2022 16:40
@teh-cmc
Copy link
Copy Markdown
Contributor Author

teh-cmc commented Dec 30, 2022

Alright, to help with the discussion going forward, I've updated the code to reflect what a PoV-less/primary-less/always-yield lower-level API would look like (also updated the example so that rows actually share instance keys once in a while... 😅).

I.e. DataStore::range does not take a primary component as an argument anymore, and will happily yield any row that contains data for at least one of the components we're interested in (not much point in yielding an array of None values).

The semantics of the high-level API, on the other hand, are left unchanged (as demonstrated by the test suite not being impacted at all): polars_util::range_components still takes a primary, and will only yield new dataframes when the primary component has been updated. Again, this (afaict) matches the behaviour of the legacy API, as well as my expectations.
As expected, the implementation of the high-level API is made much simpler thanks to the lack of PoV concerns: i.e. there is now a single iterator to deal with.

I'm writing all of this in a bit of a hurry, so not sure it will all make sense... but at the very least, I hope the example below will!

Consider this snippet:

use polars_core::prelude::JoinType;
use re_arrow_store::{polars_util, test_bundle, DataStore, RangeQuery, TimeRange};
use re_log_types::{
    datagen::{build_frame_nr, build_some_point2d, build_some_rects},
    field_types::{Instance, Point2D, Rect2D},
    msg_bundle::Component as _,
    ObjPath as EntityPath, TimeType, Timeline,
};

let mut store = DataStore::new(Instance::name(), Default::default());

let ent_path = EntityPath::from("this/that");

let frame1 = 1.into();
let frame2 = 2.into();
let frame3 = 3.into();
let frame4 = 4.into();

let bundle = test_bundle!(ent_path @ [build_frame_nr(frame1)] => [build_some_rects(2)]);
store.insert(&bundle).unwrap();

let bundle = test_bundle!(ent_path @ [build_frame_nr(frame2)] => [build_some_point2d(2)]);
store.insert(&bundle).unwrap();

let bundle = test_bundle!(ent_path @ [build_frame_nr(frame3)] => [build_some_point2d(4)]);
store.insert(&bundle).unwrap();

let bundle = test_bundle!(ent_path @ [build_frame_nr(frame4)] => [build_some_rects(3)]);
store.insert(&bundle).unwrap();

let bundle = test_bundle!(ent_path @ [build_frame_nr(frame4)] => [build_some_point2d(1)]);
store.insert(&bundle).unwrap();

let bundle = test_bundle!(ent_path @ [build_frame_nr(frame4)] => [build_some_rects(3)]);
store.insert(&bundle).unwrap();

let bundle = test_bundle!(ent_path @ [build_frame_nr(frame4)] => [build_some_point2d(3)]);
store.insert(&bundle).unwrap();

let timeline_frame_nr = Timeline::new("frame_nr", TimeType::Sequence);
let query = RangeQuery {
    timeline: timeline_frame_nr,
    range: TimeRange::new(2.into(), 4.into()),
};

which leaves us with a store that looks like this:

┌─────────────────┬──────────┬───────────┬────────────────┬─────────────────────────────────────┬─────────────────────────────────────┐
│ rerun.insert_id ┆ frame_nr ┆ entity    ┆ rerun.instance ┆ rerun.point2d                       ┆ rerun.rect2d                        │
│ ---             ┆ ---      ┆ ---       ┆ ---            ┆ ---                                 ┆ ---                                 │
│ u64             ┆ i64      ┆ str       ┆ list[u64]      ┆ list[struct[2]]                     ┆ list[struct[4]]                     │
╞═════════════════╪══════════╪═══════════╪════════════════╪═════════════════════════════════════╪═════════════════════════════════════╡
│ 1               ┆ 1        ┆ this/that ┆ [0, 1]         ┆ null                                ┆ [{0.0,0.0,0.0,0.0}, {1.0,1.0,0.0... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2               ┆ 2        ┆ this/that ┆ [0, 1]         ┆ [{10.0,10.0}, {11.0,11.0}]          ┆ null                                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3               ┆ 3        ┆ this/that ┆ [0, 1, ... 3]  ┆ [{20.0,20.0}, {21.0,21.0}, ... {... ┆ null                                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4               ┆ 4        ┆ this/that ┆ [0, 1, 2]      ┆ null                                ┆ [{0.0,0.0,0.0,0.0}, {1.0,1.0,0.0... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5               ┆ 4        ┆ this/that ┆ [0]            ┆ [{30.0,30.0}]                       ┆ null                                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 6               ┆ 4        ┆ this/that ┆ [0, 1, 2]      ┆ null                                ┆ [{0.0,0.0,0.0,0.0}, {1.0,1.0,0.0... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 7               ┆ 4        ┆ this/that ┆ [0, 1, 2]      ┆ [{40.0,40.0}, {41.0,41.0}, {42.0... ┆ null                                │
└─────────────────┴──────────┴───────────┴────────────────┴─────────────────────────────────────┴─────────────────────────────────────┘

Querying from rect2d's PoV yields the following:

let dfs = polars_util::range_components(
    &store,
    &query,
    &ent_path,
    Rect2D::name(),
    [Instance::name(), Rect2D::name(), Point2D::name()],
    &JoinType::Outer,
);
Found data at time #1 from rerun.rect2d's PoV (outer-joining):
shape: (2, 2)
┌────────────────┬───────────────────┐
│ rerun.instance ┆ rerun.rect2d      │
│ ---            ┆ ---               │
│ u64            ┆ struct[4]         │
╞════════════════╪═══════════════════╡
│ 0              ┆ {0.0,0.0,0.0,0.0} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1              ┆ {1.0,1.0,0.0,0.0} │
└────────────────┴───────────────────┘

Found data at time #4 from rerun.rect2d's PoV (outer-joining):
shape: (4, 3)
┌────────────────┬───────────────────────┬───────────────┐
│ rerun.instance ┆ rerun.rect2d          ┆ rerun.point2d │
│ ---            ┆ ---                   ┆ ---           │
│ u64            ┆ struct[4]             ┆ struct[2]     │
╞════════════════╪═══════════════════════╪═══════════════╡
│ 0              ┆ {0.0,0.0,0.0,0.0}     ┆ {20.0,20.0}   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1              ┆ {1.0,1.0,0.0,0.0}     ┆ {21.0,21.0}   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2              ┆ {2.0,2.0,1.0,1.0}     ┆ {22.0,22.0}   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3              ┆ {null,null,null,null} ┆ {23.0,23.0}   │
└────────────────┴───────────────────────┴───────────────┘

Found data at time #4 from rerun.rect2d's PoV (outer-joining):
shape: (3, 3)
┌────────────────┬───────────────────┬───────────────┐
│ rerun.instance ┆ rerun.rect2d      ┆ rerun.point2d │
│ ---            ┆ ---               ┆ ---           │
│ u64            ┆ struct[4]         ┆ struct[2]     │
╞════════════════╪═══════════════════╪═══════════════╡
│ 0              ┆ {0.0,0.0,0.0,0.0} ┆ {30.0,30.0}   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1              ┆ {1.0,1.0,0.0,0.0} ┆ {null,null}   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2              ┆ {2.0,2.0,1.0,1.0} ┆ {null,null}   │
└────────────────┴───────────────────┴───────────────┘

Querying from point2d's PoV yields the following:

let dfs = polars_util::range_components(
    &store,
    &query,
    &ent_path,
    Point2D::name(),
    [Instance::name(), Rect2D::name(), Point2D::name()],
    &JoinType::Outer,
);
Found data at time #2 from rerun.point2d's PoV (outer-joining):
shape: (2, 2)
┌────────────────┬───────────────┐
│ rerun.instance ┆ rerun.point2d │
│ ---            ┆ ---           │
│ u64            ┆ struct[2]     │
╞════════════════╪═══════════════╡
│ 0              ┆ {10.0,10.0}   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1              ┆ {11.0,11.0}   │
└────────────────┴───────────────┘

Found data at time #3 from rerun.point2d's PoV (outer-joining):
shape: (4, 2)
┌────────────────┬───────────────┐
│ rerun.instance ┆ rerun.point2d │
│ ---            ┆ ---           │
│ u64            ┆ struct[2]     │
╞════════════════╪═══════════════╡
│ 0              ┆ {20.0,20.0}   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1              ┆ {21.0,21.0}   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2              ┆ {22.0,22.0}   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3              ┆ {23.0,23.0}   │
└────────────────┴───────────────┘

Found data at time #4 from rerun.point2d's PoV (outer-joining):
shape: (3, 3)
┌────────────────┬───────────────┬───────────────────┐
│ rerun.instance ┆ rerun.point2d ┆ rerun.rect2d      │
│ ---            ┆ ---           ┆ ---               │
│ u64            ┆ struct[2]     ┆ struct[4]         │
╞════════════════╪═══════════════╪═══════════════════╡
│ 0              ┆ {30.0,30.0}   ┆ {0.0,0.0,0.0,0.0} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1              ┆ {null,null}   ┆ {1.0,1.0,0.0,0.0} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2              ┆ {null,null}   ┆ {2.0,2.0,1.0,1.0} │
└────────────────┴───────────────┴───────────────────┘

Found data at time #4 from rerun.point2d's PoV (outer-joining):
shape: (3, 3)
┌────────────────┬───────────────┬───────────────────┐
│ rerun.instance ┆ rerun.point2d ┆ rerun.rect2d      │
│ ---            ┆ ---           ┆ ---               │
│ u64            ┆ struct[2]     ┆ struct[4]         │
╞════════════════╪═══════════════╪═══════════════════╡
│ 0              ┆ {40.0,40.0}   ┆ {0.0,0.0,0.0,0.0} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1              ┆ {41.0,41.0}   ┆ {1.0,1.0,0.0,0.0} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2              ┆ {42.0,42.0}   ┆ {2.0,2.0,1.0,1.0} │
└────────────────┴───────────────┴───────────────────┘

@jleibs
Copy link
Copy Markdown
Contributor

jleibs commented Dec 30, 2022

Thanks for the summaries! In short: I think we're on the same page. Will finish re-reviewing all of this (and the dependent PRs) today.

Sticking close to the legacy range (for now) makes total sense, though I definitely think we'll want to talk about modifications to those semantics.

My interpretation of those legacy APIs was that we get the equivalent of a latest-at for each time where the primary has changed, and my concern was that not yielding all rows meant we wouldn't be able to construct the proper latest-at view (using only a single range-query) since we would be missing rows where intermediate non-primary rows changed.

However, reading over #653 it looks like you worked around this by using separate range-queries for each component, directly matching the pattern in established in latest-at.

   // Now let's create the actual range iterators, one for each component / point-of-view.
    for (i, component) in std::iter::once(primary)
        .chain(components.iter().copied())
        .enumerate()
    {
        let components = [cluster_key, component];

        let it = store.range(query, ent_path, component, components).map(

I probably should have finished reading that PR before commenting here in the first place.

Which I think gives us the higher-level behavior I was expecting. I also imagine we can reduce that to a single range-query with the refactor from your latest push to remove primary from the lower-level API. I agree with your current choice for the higher-level behavior.

Copy link
Copy Markdown
Contributor

@jleibs jleibs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at the questions around find_bucket -- the other comments are fairly minor.

// This cannot fail, `iter_bucket` is guaranteed to always yield at least one bucket,
// since index tables always spawn with a default bucket that covers [-∞;+∞].
self.iter_bucket(time).next().unwrap()
self.range_buckets(..=time).next().unwrap()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Unless I'm missing something, this seems like it should be using range_buckets_rev

However, if this is in fact a bug, the fact that this is still passing unit-tests is concerning.

Copy link
Copy Markdown
Contributor Author

@teh-cmc teh-cmc Jan 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning Unless I'm missing something, this seems like it should be using range_buckets_rev

It certainly should.

However, if this is in fact a bug, the fact that this is still passing unit-tests is concerning.

find_bucket (not to be confused with find_bucket_mut... 😅) is only ever used by the recently added latest_components_at helper ("Retrieve all the ComponentNames that have been written to for a given EntityPath"), which we don't have tests for apparently :( I'll add one (UPDATE: in another PR).

Copy link
Copy Markdown
Contributor

@jleibs jleibs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside from the lints this looks great!

@teh-cmc teh-cmc merged commit 853b714 into main Jan 2, 2023
@teh-cmc teh-cmc deleted the cmc/datastore/range_queries2 branch January 2, 2023 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

re_datastore: implement range queries

3 participants