Dataframe API v2 p2: MVP implementation #7560
Conversation
82d2b01 to
e37a795
Compare
18886e3 to
a989ac2
Compare
e37a795 to
ae03b16
Compare
| let cur_index_value = streaming_state_per_component | ||
| .values() | ||
| // NOTE: We're purposefully ignoring RowId-related semantics here: we just want to know | ||
| // the value we're looking for on the "main" index (dedupe semantics). | ||
| .min_by_key(|streaming_state| streaming_state.index_value) | ||
| .map(|streaming_state| streaming_state.index_value)?; |
There was a problem hiding this comment.
Rather than doing this on every call to next_row, I suspect it might be clearer to do this whole thing as a 2-phased process.
First, work just with the Timeline data from every view-relevant chunk to materialize a new column of sorted/unique TimeInt values (note, as an added benefit this is the same input you'll want to be able to feed into sample_index_values() anyways). This could still be done incrementally, "batch-wise" by only looking at overlapping chunks on some horizon.
Then, once we have the ability to iterate over batches of TimeInts, we iterate through them incrementally and look for the matching values from the relevant chunks, as you're doing below, which now becomes a common code-path between this implementation and sampled_index_values()
Additionally, my gut is that having batches of unique TimeInts in advance sets us up nicely for some future optimizations.
- It lets us fairly easily parallelize the per-selected-column work. Each worker can independently yield a sequence of rows matching the requested sequence of TimeInts.
- It lets us look ahead to check for matching runs in the given columns. Any time we have a matching run in a range with a single column (happy path) we can directly yield a slice of multiple rows from our column-generator.
- Similarly, null runs can quickly be identified and generated when the last TimeInt in the requested batch is less than the next available time-int for the column.
- The aggregator consuming from each of the parallel columns generators can then yield RecordBatches based on overlapping row-runs from the separate columns, which means in the happy path of dense non-overlapping chunks we return to getting nice contiguous slices again.
There was a problem hiding this comment.
We can do these improvements in follow up PRs, let's focus on landing all semantics first.
There was a problem hiding this comment.
Agreed -- not a requested change. Just an observation about the structure to keep in mind as you refactor in the direction of supporting sampled_index_values()
7d1cb72 to
39cfb1a
Compare
1d76116 to
94a9c09
Compare
94a9c09 to
9ce5152
Compare
bc4f392 to
9412150
Compare
|
We've integrated all of this in @abey79's work-in-progress dataframe-view -- everything works semantics-wise. Next steps (future PRs):
|
A first implementation of the new dataframe APIs.
The name is now very misleading though: there isn't anything dataframe-y left in here, it is a row-based iterator with Rerun semantics baked in, driven by a sorted streaming join.
It is rather slow (related: #7558 (comment)), lacks many features and is full of edge cases, but it works.
It does support dedupe-latest semantics (slowly), view contents and selections, chunk overlaps, and pagination (horribly, by virtue of implementing
Iterator).It does not support
Clears, norlatest-atsparse-filling, nor PoVs, nor index sampling. Yet.Upcoming PRs will be all about fixing these shortcomings one by one.
It should look somewhat familiar:
No tests until we have the guarantee that these are the semantics we will commit to.
Checklist
mainbuild: rerun.io/viewernightlybuild: rerun.io/viewerCHANGELOG.mdand the migration guideTo run all checks from
main, comment on the PR with@rerun-bot full-check.