Dataframe API v2 p0: Chunk support for dedupe-latest semantics by teh-cmc · Pull Request #7558 · rerun-io/rerun

teh-cmc · 2024-10-01T15:02:53Z

Implements support for the new dedupe-latest ™️ semantics on Chunk. This is one of the fundamental primitives required for the new upcoming dataframe APIs.

This requires the use of the take Arrow kernel.
Unfortunately I made the mistake of testing that new kernel, which revealed that it is allocating a lot of data when it shouldn't, so I'll have to fix it at some point in a future PR.

Part of Expand rust APIs to support new Data API concepts #7495

Checklist

I have read and agree to Contributor Guide and the Code of Conduct
I've included a screenshot or gif (if applicable)
I have tested the web demo (if applicable):
- Using examples from latest main build: rerun.io/viewer
- Using full set of examples from nightly build: rerun.io/viewer
The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG
If applicable, add a new check to the release checklist!
If have noted any breaking changes to the log API in CHANGELOG.md and the migration guide

To run all checks from main, comment on the PR with @rerun-bot full-check.

teh-cmc · 2024-10-01T16:33:17Z

Unfortunately I made the mistake of testing that new kernel, which revealed that it is allocating a lot of data when it shouldn't, so I'll have to fix it at some point in a future PR.

That's just me being dumb: I keep forgetting that these things are not expressible given how ListArray is encoded... We need ListView once again.

Unfortunately that does mean that my current, fairly readable approach of index-deduping all relevant chunks and then doing a somewhat straightforward streaming-join, sucks even more than I thought, performance-wise.

That's fine, we can always make it fast in a follow-up PR by adding even more cursor shenanigans instead... but let's integrate it all and cement the semantics first before trying to go there.
(Also, since this is purely a private implementation detail, maybe we could simply have a custom take kernel implementation that returns a DictionaryArray?)

I still want to keep this implementation around as it materializes explicitly the semantics we expect, and can in fact be made pretty fast once we actually have a ListView at our disposal.

jleibs · 2024-10-01T17:11:50Z

We need ListView once again.

Yeah, this definitely seems like an area where that would be nice. At least it looks like the relevant PRs are still moving forward on that one.

we can always make it fast in a follow-up PR by adding even more cursor shenanigans instead... but let's integrate it all and cement the semantics first before trying to go there.

Strongly agree.

At least at this phase no individual value should be copied more than once.

crates/store/re_chunk/src/slice.rs

jleibs · 2024-10-01T17:17:39Z

crates/store/re_chunk/tests/memory_test.rs

+            let indices = ArrowPrimitiveArray::from_vec(
+                (0..untaken.0.len() as i32)
+                    .filter(|i| i % 2 == 0)
+                    .collect_vec(),
+            );


What happens if we take a contiguous slice? Is there internal optimization for slicing when possible, or does it always copy blindly?

Always copies.

Co-authored-by: Jeremy Leibs <jeremy@rerun.io>

The new public API definition and nothing else. Speak now. * Part of #7495 * Requires #7558 --------- Co-authored-by: Jeremy Leibs <jeremy@rerun.io>

A first implementation of the new dataframe APIs. The name is now very misleading though: there isn't anything dataframe-y left in here, it is a row-based iterator with Rerun semantics baked in, driven by a sorted streaming join. It is rather slow (related: #7558 (comment)), lacks many features and is full of edge cases, but it works. It does support dedupe-latest semantics (slowly), view contents and selections, chunk overlaps, and pagination (horribly, by virtue of implementing `Iterator`). It does _not_ support `Clear`s, nor `latest-at` sparse-filling, nor PoVs, nor index sampling. Yet. Upcoming PRs will be all about fixing these shortcomings one by one. It should look somewhat familiar: ```rust let query_cache = QueryCache::new(store); let query_engine = QueryEngine { store, cache: &query_cache, }; let mut query = QueryExpression2::new(timeline); query.view_contents = Some( query_engine .iter_entity_paths(&entity_path_filter) .map(|entity_path| (entity_path, None)) .collect(), ); query.filtered_index_range = Some(ResolvedTimeRange::new(time_from, time_to)); eprintln!("{query:#?}:"); let query_handle = query_engine.query(query.clone()); // eprintln!("{:#?}", query_handle.selected_contents()); for batch in query_handle.into_batch_iter().skip(offset).take(len) { eprintln!("{batch}"); } ``` No tests until we have the guarantee that these are the semantics we will commit to. * Part of #7495 * Requires #7559

teh-cmc added 2 commits October 1, 2024 16:42

re_chunk: support for 'take' kernel and very sad test suite

d6ebcfb

re_chunk: implement support for dedupe-latest semantics

3997145

teh-cmc added 🏹 arrow Apache Arrow 🔍 re_query affects re_query itself include in changelog labels Oct 1, 2024

teh-cmc changed the title ~~Dataframe API v2: Chunk support for dedupe-latest semantics~~ Dataframe API v2 #0: Chunk support for dedupe-latest semantics Oct 1, 2024

teh-cmc added the do-not-merge Do not merge this PR label Oct 1, 2024

teh-cmc mentioned this pull request Oct 1, 2024

Dataframe API v2 p1: API definitions #7559

Merged

6 tasks

teh-cmc marked this pull request as ready for review October 1, 2024 15:21

no alloc is _not_ happening

5027525

teh-cmc mentioned this pull request Oct 1, 2024

Dataframe API v2 p2: MVP implementation #7560

Merged

6 tasks

jleibs approved these changes Oct 1, 2024

View reviewed changes

Update crates/store/re_chunk/src/slice.rs

bc068b6

Co-authored-by: Jeremy Leibs <jeremy@rerun.io>

teh-cmc removed the do-not-merge Do not merge this PR label Oct 2, 2024

teh-cmc merged commit e5ae198 into main Oct 2, 2024

teh-cmc deleted the cmc/dataframev2_0_chunk_stuff branch October 2, 2024 09:53

teh-cmc added a commit that referenced this pull request Oct 2, 2024

Dataframe API v2 #1: API definitions (#7559)

3581ca4

The new public API definition and nothing else. Speak now. * Part of #7495 * Requires #7558 --------- Co-authored-by: Jeremy Leibs <jeremy@rerun.io>

teh-cmc changed the title ~~Dataframe API v2 #0: Chunk support for dedupe-latest semantics~~ Dataframe API v2 p0: Chunk support for dedupe-latest semantics Oct 3, 2024

emilk removed the include in changelog label Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataframe API v2 p0: Chunk support for dedupe-latest semantics#7558

Dataframe API v2 p0: Chunk support for dedupe-latest semantics#7558
teh-cmc merged 4 commits intomainfrom
cmc/dataframev2_0_chunk_stuff

teh-cmc commented Oct 1, 2024 •

edited by github-actions bot

Loading

Uh oh!

teh-cmc commented Oct 1, 2024 •

edited

Loading

Uh oh!

jleibs commented Oct 1, 2024

Uh oh!

Uh oh!

jleibs Oct 1, 2024

Uh oh!

teh-cmc Oct 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

teh-cmc commented Oct 1, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

teh-cmc commented Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jleibs commented Oct 1, 2024

Uh oh!

Uh oh!

jleibs Oct 1, 2024

Choose a reason for hiding this comment

Uh oh!

teh-cmc Oct 2, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

teh-cmc commented Oct 1, 2024 •

edited by github-actions bot

Loading

teh-cmc commented Oct 1, 2024 •

edited

Loading