Skip to content

The 1M index entries problem #5967

@teh-cmc

Description

@teh-cmc

Solutions

Context

Every log call in Rerun results in a DataRow: Rerun's atomic unit of change.

DataRows are indexed by the datastore, and that index ultimately powers all the row/time-based features in Rerun, such as latest-at & (visible-time-)range queries, change detection, the time panel and time cursor, garbage collection, etc.

Each entry in this index (i.e. each row) carries of a bunch of metadata:

  • A RowId, that uniquely identifies the associated row and defines its global order (!).
  • A TimePoint: a collection of timeline names and timestamps (generally around 3 of each in the common case).
  • A bunch of DataCells, each of which carry some metadata of their own:
    • A schema (as part of the underlying arrow Array), so that we can know whether the cell is:
      • the expected datatype (we're done)
      • a datatype that needs to be converted to the expected datatype (soft promise)
      • a promise payload that will have to be resolved (hard promise)
    • Offsets into the shared arrow buffers.
  • Other stuff that I'm omitting because it's going away as part of Remove instance keys (Archetype-less, PoV-less, join-less queries) #5303.

We also maintain a smaller, but still significant, reverse index, for when we need to fetch information about a specific RowId.
Finally, in addition to these primary indices maintained by the datastore itself, many pieces of the app also maintain their own secondary indices via our change detection system (e.g. all the time panel stuff).

All of this easily adds up to a few hundred bytes of overhead per index entry.
Whether that becomes a problem or not from a memory standpoint depends on the amount of data that the user actually stores in each of these index entries.

For something like e.g. a point cloud or an image, the indexing overhead is dwarfed by the actual data, and everything is fine (for the most part).
For something like a scalar, the indexing overhead is orders or magnitude larger than the data itself, which leads to situations like this one (#5904), where 20MiB of scalar data in a parquet file somehow ends up with 20GiB of memory footprint once logged.

Unfortunately the overhead issues don't stop at the memory footprint.

All row/time-based subsystems in the app must spend compute time to deal with these indices, which slows down all aspects of the viewer (including ingestion!) e.g.:

  • The time panel has to spend CPU time every frame to render all these points on the timeline.
  • The datastore and other secondary indices have to update their indices that much more frequently.
  • The GC has to go through that many more rows.
  • Etc.

And, of course, the infamous worst of all: range queries that span a lot of indices will have to unpack as many rows, which quickly becomes prohibitively expensive and therefore requires a lot of complex caching machinery to pull off (including the dreaded deserialization cache!).
Things once again get even worse from an overhead perspective if it turns out that the contents of the row was actually one single float value all along...

Having too many index entries / rows wrecks everything: from logging, to ingestion, to visualization and all the way to the memory footprint.
Given this, here's the 1M$ question: how can Rerun help users index less rows while still somehow meeting all their needs without any (of very little) added complexity?

Whichever solution we come up with, it should work both with static known-ahead-of-time datasets (e.g. logging a scalar plot from parquet file) and real-time datasets (e.g. an high-frequency IMU).

Things that either wouldn't help, or not enough

Temporal batches

Temporal batches provide a huge DX and performance gain on the logging side, but once the data gets exploded back into rows and indexed into the store, all the problems stated above still apply.
In fact, temporal batches are just way to make it easier to create more index entries!

Temporal batches also require the user to know their dataset ahead of time, while we want a solution that works both for AOT and RT use cases.

We do want temporal batches for other reasons, but they won't help with this issue.

Optimizing the memory footprint of index entries

Some possible avenues on that front:

  • Get rid of the log_tick timeline.
  • Try and optimize the way schema information is stored and shared across rows (custom implementation of arrow::Array?).
  • Try and optimize the memory footprint of RowId (varints?!).
  • Others?

While those would definitely be welcomed improvements, I don't see that kind of optimization getting us anywhere near a viable memory footprint for scalar use cases.
This also only helps with the memory footprint, compute is still as much as an issue as before.

Others?

???


Related:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions