Client-side chunks 1: introduce Chunk and its suffle/sort routines#6438
Client-side chunks 1: introduce Chunk and its suffle/sort routines#6438
Chunk and its suffle/sort routines#6438Conversation
403f441 to
7276a7c
Compare
948b430 to
fd577c0
Compare
7276a7c to
94a5e7f
Compare
| // TODO(cmc): maybe this would be better as raw i64s so getting time columns in and out of | ||
| // chunks is just a blind memcpy… it's probably not worth the hassle for now though. | ||
| // We'll see how things evolve as we start putting chunks in the backend. | ||
| pub(crate) times: Vec<TimeInt>, |
There was a problem hiding this comment.
Depending on how the backend side goes, I might actually end up not deserializing these at all, if I can afford it. That would be sweet.
| /// data within. | ||
| #[derive(Debug, Clone)] | ||
| pub struct Chunk { | ||
| pub(crate) id: ChunkId, |
There was a problem hiding this comment.
I'm exploring the possibility of always making sure that the ID of a chunk is the same as the ID of its first row (in sorted order).
That would be way more useful than a random ID generated post-micro-batching, and would give way more meaning to sorting chunks based on their IDs.
| /// | ||
| /// Iff you know for sure whether the data is already appropriately sorted or not, specify `is_sorted`. | ||
| /// When left unspecified (`None`), it will be computed in O(n) time. | ||
| pub fn new( |
There was a problem hiding this comment.
TODO in this PR or another: when creating a chunk of static data, there is no reason to keep anything but the last row (in sorted row-id order).
The backend will have to support multi-rows static chunks anyhow since clients can send anything, which both the query engine and compaction will know how to take care of, but it's a nice little optimization on the standard path.
| /// Empty if this is a static chunk. | ||
| pub(crate) timelines: BTreeMap<Timeline, ChunkTimeline>, | ||
|
|
||
| /// A sparse `ListArray` for each component. |
There was a problem hiding this comment.
To my knowledge arrow doesn't have a spec for "sparse" listarray.
Do you mean nullable listarray?
There was a problem hiding this comment.
Also, worth thinking about. Arrow now supports a ListView: https://arrow.apache.org/docs/format/Columnar.html#listview-layout
This could give us a mechanism to shuffle just the offsets in cases where we don't want to pay the full cost of rearranging the child buffer.
There was a problem hiding this comment.
To my knowledge arrow doesn't have a spec for "sparse" listarray.
Do you mean nullable listarray?
I just find the "official" terminology extremely confusing: what's a nullable listarray exactly? a listarray that can be null? a listarray that can contain null values? both?
| #[allow(clippy::collapsible_if)] // readability | ||
| if cfg!(debug_assertions) { | ||
| for &time in times { | ||
| if time < time_range.min() || time > time_range.max() { |
There was a problem hiding this comment.
Is time_range allowed to be conservative or should we also be sanity-checking that this is a tight bound?
There was a problem hiding this comment.
Tighter checks definitely cannot hurt
crates/re_chunk/src/shuffle.rs
Outdated
| /// | ||
| /// If `make_contiguous` is `true`, the underlying arrow data will be copied and shuffled in | ||
| /// memory in order to make it contiguous. | ||
| /// Otherwise, only the offsets will be shuffled. |
There was a problem hiding this comment.
Otherwise, only the offsets will be shuffled.
I don't believe this is allowed for ListArray. Offsets must be monotonically increasing and dense -- the length of each array is (offset[n+1] - offset[n])
We could, however, do this with ListView instead.
There was a problem hiding this comment.
Oh yeah, nice catch. No idea why arrow2 allows it :|
We're not going to get ListView into arrow2 any time soon obviously, so I'll just remove the non-contiguous path and leave a TODO that links to our arrow-rs migration ticket.
This new and improved `re_format_arrow` ™️ brings two major improvements: - It is now designed to format standard Arrow dataframes (aka chunks or batches), i.e. a `Schema` and a `Chunk`. In particular: chunk-level and field-level schema metadata will now be rendered properly with the rest of the table. - Tables larger than your terminal will now do their best to fit in, while making sure to still show just enough data. E.g. here's an excerpt of a real-world Rerun dataframe from our `helix` example: ``` cargo r -p rerun-cli --no-default-features --features native_viewer -- print helix.rrd --verbose ``` before (`main`):  and after:  --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441
1981f31 to
e22ff58
Compare
A `TransportChunk` is a `Chunk` that is ready for transport and/or storage. It is very cheap to go from `Chunk` to a `TransportChunk` and vice-versa. A `TransportChunk` maps 1:1 to a native Arrow `RecordBatch`. It has a stable ABI, and can be cheaply send across process boundaries. `arrow2` has no `RecordBatch` type; we will get one once we migrate to `arrow-rs`. A `TransportChunk` is self-describing: it contains all the data _and_ metadata needed to index it into storage. We rely heavily on chunk-level and field-level metadata to communicate Rerun-specific semantics over the wire, e.g. whether some columns are already properly sorted. The Arrow metadata system is fairly limited -- it's all untyped strings --, but for now that seems good enough. It will be trivial to switch to something else later, if need be. - Fixes #1760 - Fixes #1692 - Fixes #3360 - Fixes #1696 --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441
This is a fork of the old `DataTable` batcher, and works very similarly. Like before, this batcher will micro-batch using both space and time thresholds. There are two main differences: - This batcher maintains a dataframe per-entity, as opposed to the old one which worked globally. - Once a threshold is reached, this batcher further splits the incoming batch in order to fulfill these invariants: ```rust /// In particular, a [`Chunk`] cannot: /// * contain data for more than one entity path /// * contain rows with different sets of timelines /// * use more than one datatype for a given component /// * contain more rows than a pre-configured threshold if one or more timelines are unsorted ``` Most of the code is the same, the real interesting piece is `PendingRow::many_into_chunks`, as well as the newly added tests. - Fixes #4431 --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441
Integrate the new chunk batcher in all SDKs, and get rid of the old one. On the backend, we make sure to deserialize incoming chunks into the old `DataTable`s, so business can continue as usual. Although the new batcher has a much more complicated task with all these sub-splits to manage, it is somehow already more performant than the old one 🤷♂️: ```bash # this branch cargo b -p log_benchmark --release && hyperfine --runs 15 './target/release/log_benchmark --benchmarks points3d_many_individual' Benchmark 1: ./target/release/log_benchmark --benchmarks points3d_many_individual Time (mean ± σ): 4.499 s ± 0.117 s [User: 5.544 s, System: 1.836 s] Range (min … max): 4.226 s … 4.640 s 15 runs # main cargo b -p log_benchmark --release && hyperfine --runs 15 './target/release/log_benchmark --benchmarks points3d_many_individual' Benchmark 1: ./target/release/log_benchmark --benchmarks points3d_many_individual Time (mean ± σ): 4.407 s ± 0.773 s [User: 8.423 s, System: 0.880 s] Range (min … max): 2.997 s … 6.148 s 15 runs ``` Notice the massive difference in user time. --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441
Introduces the new
re_chunkcrate:Specifically, it introduces the
Chunktype itself, and all methods and helpers related to sorting.A
Chunkis self-describing: it contains all the data and metadata needed to index it into storage.There are a lot of things that need to be sorted within a
Chunk, and as such we must make sure to keep track of what is or isn't sorted at all times, to avoid needlessly re-sorting things everytime a chunk changes hands.This necessitates a bunch of sanity checking all over the place to make sure we never end up in undefined states.
Chunkis not about transport, it's about providing a nice-to-work with representation when manipulating a chunk in memory.Transporting a
Chunkhappens in the next PR.DataTable::sortshared withDataStore#1981Part of a PR series to implement our new chunk-based data model on the client-side (SDKs):
Chunkand its suffle/sort routines #6438TransportChunk#6439Checklist
mainbuild: rerun.io/viewernightlybuild: rerun.io/viewerTo run all checks from
main, comment on the PR with@rerun-bot full-check.