Skip to content

RecordingStream's brittle shutdown/flush behavior #5335

@teh-cmc

Description

@teh-cmc

Context

Facts:

  • RecordingStreams are ref-counted, Send & Sync.
  • When they get Dropped, they disconnect(), causing them to flush pending batches, join pending threads, flush sinks, etc.

These two facts combined together make for a pretty brittle shutdown behavior.

It's very easy to clone and send RecordingStreams in a bunch of background threads. If any of these threads were to outlive the main thread, then the Drop implementation of RecordingStream will turn into a noop (because strong_count() > 0), and user data will be lost.
A very nasty manifestation of this is e.g. sending a RecordingStream into an ephemeral thread that is meant to compute something heavy and then log it (e.g. calling into a dataloader).

Another example is RecordingStreams being stored in process- and thread- locals.

Python

In Python, the issue is mitigated by the fact that the SDK is in itself a kind of entity with its very own lifetime, and the recordings' respective lifetimes are themselves tied to the SDK's lifetime.

When the SDK shut downs, all the recordings tied to it are shut down too, triggering the all the flushing mechanisms.

Python SDK's shutdown logic:

#[pyfunction]
fn shutdown(py: Python<'_>) {
    re_log::debug!("Shutting down the Rerun SDK");
    // Release the GIL in case any flushing behavior needs to cleanup a python object.
    py.allow_threads(|| {
        for (_, recording) in all_recordings().drain() {
            recording.disconnect();
        }
        flush_garbage_queue();
    });
}

C++

In C and C++, the issue is actually made somewhat worse since the SDK keeps its own refcount of recording streams, and implements a destructor to automatically flush them.

I.e. C++ users have to worry about the refcount issues on both the Rust and C++ side.

C++ SDK's destructor:

RecordingStream::~RecordingStream() {
    if (_id != RR_REC_STREAM_CURRENT_RECORDING && _id != RR_REC_STREAM_CURRENT_BLUEPRINT) {
        rr_recording_stream_free(this->_id);
    }
}

Proposal

The SDK should always be an actual thing that gets instantiated (whether directly by the user or automagically behind the scenes), and all recordings created with that SDK should have their lifetimes tied to it.

This SDK object can never outlive main.
All SDKs rely on this object via FFI, so lifetime management is only implemented once on the Rust side.

On shutdown, all recordings tied to that SDK are flushed.
In the future, we might want to make it configurable (NO_WAIT, WAIT_FOR_FLUSH, etc).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions