Conversation
| /// | ||
| /// ⚠ The semantics are subtle! Study carefully the example below. | ||
| /// | ||
| /// Usage: |
There was a problem hiding this comment.
Note: this has been replaced by a standalone example in #645
| /// | ||
| /// ⚠ The semantics are subtle! Study carefully the example below. | ||
| /// | ||
| /// Usage: |
There was a problem hiding this comment.
Note: this has been replaced by a standalone example in #645
|
I think it'll be much easier if I answer all of your comments regarding primary keys and when-to-yield-data here rather than all over the place @jleibs, so here we go. So, just to recap, there are two APIs at play here: the "low-level" one ( So, regarding your 3 comments about the low-level API /
I agree with pretty much all of the above. Broadly, there are 2 main questions here from what I can tell:
The answer is the same for both: I just wanted to make sure that the first datastore API proposed by this PR was semantically as close as possible to both the legacy range API (where "yield-only-for-primary" comes from) and the already existing I'm definitely on board to get rid of both of those in the low-level API: it should be pretty trivial to do now that everything is in place, and should even make the implementation of the high-level layer a tad simpler. Speaking of the high-level API, you say..:
This I disagree with OTOH; though I'm not entirely sure what feels unexpected and/or confusing to you in this case..? In the context of the high-level API, these are the semantics I would expect (and they happen to match the semantics of the legacy range API, afaict). Consider a store in that state: When e.g. building a scene for a text view, I expect this to yield So, having a notion of a primary component and a yield-only-for-primary behaviour in the high-level layer seems like the right call to me? (Though, again, the implementation could be made simpler if we dropped all the rules on the low-level layer!) |
db9950f to
c4fbda7
Compare
|
Alright, to help with the discussion going forward, I've updated the code to reflect what a PoV-less/primary-less/always-yield lower-level API would look like (also updated the example so that rows actually share instance keys once in a while... 😅). I.e. The semantics of the high-level API, on the other hand, are left unchanged (as demonstrated by the test suite not being impacted at all): I'm writing all of this in a bit of a hurry, so not sure it will all make sense... but at the very least, I hope the example below will! Consider this snippet: use polars_core::prelude::JoinType;
use re_arrow_store::{polars_util, test_bundle, DataStore, RangeQuery, TimeRange};
use re_log_types::{
datagen::{build_frame_nr, build_some_point2d, build_some_rects},
field_types::{Instance, Point2D, Rect2D},
msg_bundle::Component as _,
ObjPath as EntityPath, TimeType, Timeline,
};
let mut store = DataStore::new(Instance::name(), Default::default());
let ent_path = EntityPath::from("this/that");
let frame1 = 1.into();
let frame2 = 2.into();
let frame3 = 3.into();
let frame4 = 4.into();
let bundle = test_bundle!(ent_path @ [build_frame_nr(frame1)] => [build_some_rects(2)]);
store.insert(&bundle).unwrap();
let bundle = test_bundle!(ent_path @ [build_frame_nr(frame2)] => [build_some_point2d(2)]);
store.insert(&bundle).unwrap();
let bundle = test_bundle!(ent_path @ [build_frame_nr(frame3)] => [build_some_point2d(4)]);
store.insert(&bundle).unwrap();
let bundle = test_bundle!(ent_path @ [build_frame_nr(frame4)] => [build_some_rects(3)]);
store.insert(&bundle).unwrap();
let bundle = test_bundle!(ent_path @ [build_frame_nr(frame4)] => [build_some_point2d(1)]);
store.insert(&bundle).unwrap();
let bundle = test_bundle!(ent_path @ [build_frame_nr(frame4)] => [build_some_rects(3)]);
store.insert(&bundle).unwrap();
let bundle = test_bundle!(ent_path @ [build_frame_nr(frame4)] => [build_some_point2d(3)]);
store.insert(&bundle).unwrap();
let timeline_frame_nr = Timeline::new("frame_nr", TimeType::Sequence);
let query = RangeQuery {
timeline: timeline_frame_nr,
range: TimeRange::new(2.into(), 4.into()),
};which leaves us with a store that looks like this: Querying from let dfs = polars_util::range_components(
&store,
&query,
&ent_path,
Rect2D::name(),
[Instance::name(), Rect2D::name(), Point2D::name()],
&JoinType::Outer,
);Querying from let dfs = polars_util::range_components(
&store,
&query,
&ent_path,
Point2D::name(),
[Instance::name(), Rect2D::name(), Point2D::name()],
&JoinType::Outer,
); |
|
Thanks for the summaries! In short: I think we're on the same page. Will finish re-reviewing all of this (and the dependent PRs) today. Sticking close to the legacy range (for now) makes total sense, though I definitely think we'll want to talk about modifications to those semantics. My interpretation of those legacy APIs was that we get the equivalent of a latest-at for each time where the primary has changed, and my concern was that not yielding all rows meant we wouldn't be able to construct the proper latest-at view (using only a single range-query) since we would be missing rows where intermediate non-primary rows changed. However, reading over #653 it looks like you worked around this by using separate range-queries for each component, directly matching the pattern in established in latest-at. I probably should have finished reading that PR before commenting here in the first place. Which I think gives us the higher-level behavior I was expecting. I also imagine we can reduce that to a single range-query with the refactor from your latest push to remove primary from the lower-level API. I agree with your current choice for the higher-level behavior. |
jleibs
left a comment
There was a problem hiding this comment.
Please take a look at the questions around find_bucket -- the other comments are fairly minor.
| // This cannot fail, `iter_bucket` is guaranteed to always yield at least one bucket, | ||
| // since index tables always spawn with a default bucket that covers [-∞;+∞]. | ||
| self.iter_bucket(time).next().unwrap() | ||
| self.range_buckets(..=time).next().unwrap() |
There was a problem hiding this comment.
range_buckets_rev
However, if this is in fact a bug, the fact that this is still passing unit-tests is concerning.
There was a problem hiding this comment.
warning Unless I'm missing something, this seems like it should be using
range_buckets_rev
It certainly should.
However, if this is in fact a bug, the fact that this is still passing unit-tests is concerning.
find_bucket (not to be confused with find_bucket_mut... 😅) is only ever used by the recently added latest_components_at helper ("Retrieve all the ComponentNames that have been written to for a given EntityPath"), which we don't have tests for apparently :( I'll add one (UPDATE: in another PR).
jleibs
left a comment
There was a problem hiding this comment.
aside from the lints this looks great!
This PR implements range queries.
It focuses solely on the datastore part of the matter: work still needs to be done on the
re_queryside.I.e, similar to how
re_queryimplements an efficient point-in-time multi-PoV join with semantics close to those of the very slowpolars_util::latest_components, we now need forre_queryto provide an efficient time-ranged multi-PoV streaming-join with semantics akin to those of the (yet again) very slowpolars_util::range_components.See #641.
The design space has turned out to be painfully vast, the tradeoffs numerous, and the semantics all too subtle if not straight vicious at times.
Here's the kind of questions that pop up when talking about range queries:
DataStore::latest_atandDataStore::rangebe?DataStore::rangehandle returning the latest-at state, or should that be the job of a higher-level abstraction?DataStore::rangedirectly handle the multi-PoV join inline, or should that be the job of a higher-level abstraction?DataStore::range, or something else higher in the stack?At this point it feels like I've iterated through pretty much the entire matrix of possibilities, and I've always been unhappy with the result... until now!
With this PR, latest-at queries and range queries end up with very close semantics, barring for two major differences:
This might sounds obvious, but has very subtle repercussions on how and when multiple components from multiple point-of-views are joined with each other.
That's one of the major feature of range queries, allowing for things such as text logs.
I've tried to document all of these decisions as best as I could, but most of the time I've been completely unable to properly convey these things in words.
The test suite really is the ground truth here and is the best place to find answers.
This PR:
rangequeries: both low-levelDataStoreAPIs and high-levelpolarsutilities.LastestAtQuery&RangeQuery) and two new methods (DataStore::latest_at,DataStore::range).TimeInts were unjustifiably erased intoi64s.TimeIntimplnohash_hasher::IsEnabled!Fixes #450