Skip to content

Promises: bigger-than-RAM #5247

@emilk

Description

@emilk

Goals

Support some forms of “bigger-than-RAM” recordings, as soon as possible

Background

Small-index vs Big-index

Table index: row ids and time points.

Does the table index fit in RAM?

Hypothesis: most “bigger-than-RAM” problems have smallish indices.

Big index

Example: 100GB of scalar plots

We need a hierarchical index file on disk, with seeking, and have store-subscribers that are aware of this, etc. Difficult!

Small index

Example: thousands of uncompressed 4k images, or big point clouds, meshes, …

We “just” need to figure out how load blobs from disk on-demand. Easier!

Promises: a solution to small-index

We replace large blobs with promises, that refer to the external data.

A promise could be a file path with optional byte offset, a URL, …

When a query results in a Promise, we (try) to resolve it.

Example: we go through a huge MCAP file and log it to Rerun, but replace big blobs by a Promise referring to a byte-offset in the MCAP.

User stories

  • Logging a file reference
    • rr.log(”image”, rr.Image(data=rr.Promise.file_path(”foo.jpg”)))
  • VRS
    • file://recording.vrs?stream=video&time=42
  • Log a video file
    for i, frame in enumerate(video):
        rr.set_time_point("frame", i)
        rr.log(”video”, rr.Image(data=rr.Promise.file_path(ffoo.mp4?frame={i}”)))

Design

A Promise is a datatype, which can be used for any component.
So a component.Point3D can be represented by datatype.Promise
A promise contains a single URI string.

A promise resolves to some IPC Arrow data (or an error, or pending).

The promise is resolved late, after primary caches, close to the UI/visualizer.

/// The data of component. `ComponentResult` a better name?
enum ComponentResult<'data, T> {
    /// The entity doesn't have this component
    None,
    
    /// Wait for it - it is being loaded in the background
    Pending,
    
    /// Failed to load.
    Error(String),
    
    /// The data is decoded and ready.
    /// A slice into the secondary promise cache (if it was a promise)
    Data(&'data [T]),
}

impl PromiseResult {
    fn map() -> …
}

MVP

  • log huge files, index them after, then open the small index
  • Shortcomings:
    • Some stalling when time-scrubbing
    • No web support
    • Local files only

Steps

  • Add a PromiseCache returning ComponentResult<'a, T>
  • entity_iterator should either
    • return a MaybePromise<T> for each component (leaving it to the user to resolve)
    • or a ComponentResult<'a, T> for each component
  • Put datatype-name in the meta-data of each DataCell
  • Built-in resolver for [file://…?bytes=…](file://)
    • Immediate, fseek
    • IPC Arrow data at a byte offset, or ArrowMsg at offset + index in it
  • rerun index huge.rrd > indexed.rrd
    • creates “indexed” version of rrd which replaces components with promises and puts the raw blobs elsewhere in the file
    • two files as alternative, but single file preferred
    • “self” uri, for referring to the same file
  • gc PromiseCache

Post-MVP

Latency-aware

  • Start using in ComponentResult in visualizers
  • make resolver async
  • Some latency resolver strategy
    • experiment with simulated latency etc.

Promise resolvers

  • Custom HTTP(S) resolver
  • VRS resolver

SDK-aware

Each of these adds additional abilities:

  • Auto-promsify sink in the SDK
  • log promise components directly rr.log("mypoints", rr.Promise(Position3D.name, uri))
  • Support promises for all archetypes
    • Rust: replace Option<Vec<Position3D>> with MaybePromise<Vec<Position3D>>
    • Python: isinstance
    • C++: enhance or wrap Collection type

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions