Skip to content

MsgId-as-a-component#655

Merged
teh-cmc merged 165 commits intomainfrom
cmc/datastore/msgid_as_component
Jan 3, 2023
Merged

MsgId-as-a-component#655
teh-cmc merged 165 commits intomainfrom
cmc/datastore/msgid_as_component

Conversation

@teh-cmc
Copy link
Copy Markdown
Contributor

@teh-cmc teh-cmc commented Dec 29, 2022

This PR elevates MsgId to the rank of Component, and makes sure that it always gets appended to every row of every message bundle!

This is required so that we can keep track of per-message data on the backend, e.g. the complete list of timepoints associated with a given row (the store doesn't have this information, since the different timelines of a timepoint live in completely separate indices!).

Note that this is not granular enough: we need an ID for each row (i.e. each event) in addition to an ID for each message!
As of today, though, a message is always 1 row, so it'll do for now.

This has to be done on the client-side / as early as possible! Trying to craft arrow payloads directly from within the datastore in the hot write path would tank the write performance down to pretty much zero.

Another nice side-effect of this is that it gets us closer to the flat-table ingestion format we've been moving towards.

Fixes #541
Unblocks #654


The benchmarks are the same before, except for the insert one of course since they literally insert 50% more components now ([Instance, Rect2D] => [Instance, Rect2D, MsgId]):

datastore/insert/batch/rects/insert
                        time:   [170.50 µs 170.68 µs 170.89 µs]
                        thrpt:  [58.517 Melem/s 58.590 Melem/s 58.653 Melem/s]
                 change:
                        time:   [+30.782% +36.733% +42.520%] (p = 0.00 < 0.05)
                        thrpt:  [-29.834% -26.865% -23.537%]
                        Performance has regressed.

datastore/insert/batch/rects/insert
            time:   [180.84 µs 184.42 µs 188.96 µs]
            thrpt:  [52.920 Melem/s 54.225 Melem/s 55.296 Melem/s]
     change:
            time:   [+1.0072% +2.1236% +3.3206%] (p = 0.00 < 0.05)
            thrpt:  [-3.2139% -2.0795% -0.9972%]
@teh-cmc teh-cmc changed the base branch from main to cmc/query/range_support December 29, 2022 14:20
Base automatically changed from cmc/query/range_support to main January 2, 2023 17:50
@teh-cmc teh-cmc marked this pull request as ready for review January 2, 2023 18:03
mod arrow;

#[derive(Clone, Copy, Debug, PartialEq, Eq, PartialOrd, Ord, Hash)]
#[derive(Clone, Copy, Debug, PartialEq, Eq, PartialOrd, Ord, Hash, arrow2_convert::ArrowField)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did something change that caused the derive macro to work in this context now but not previously?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I know of; @jondo2010 would know better than I though

Copy link
Copy Markdown
Contributor

@jleibs jleibs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the intent is for this to be similar to the cluster_key in terms of existence-guarantees, seems like we should have something in re_arrow_store that checks for it and complains loudly if it's missing on insert.

@teh-cmc
Copy link
Copy Markdown
Contributor Author

teh-cmc commented Jan 2, 2023

If the intent is for this to be similar to the cluster_key in terms of existence-guarantees, seems like we should have something in re_arrow_store that checks for it and complains loudly if it's missing on insert.

If the intent is for this to be similar to the cluster_key in terms of existence-guarantees, seems like we should have something in re_arrow_store that checks for it and complains loudly if it's missing on insert.

Hmmm, interesting observation...

Personally I'd say this is not quite on the same level as cluster keys.
Cluster keys are what make the core query features of the datastore itself actually useful in practice, and in general code within re_arrow_store can and will assume that this data in present, and won't hesitate to panic otherwise.

The lack of MsgIds on the other hand is completely transparent to the store and won't result in any issue besides some application-level features going missing (e.g. the text view showing only the timestamp of the currently selected timeline rather than all timestamps from all timelines).
So it's really more of an application-level concern I'd say, and in that sense I feel like enforcing it at the MsgBundle barrier (i.e. when exiting the SDK and/or entering the server) is the right place for this.

Now, whether others agree with me... 😶

.iter()
.map(|(component, bundle)| {
let instances = if bundle.components.len() == 1 {
let instances = if bundle.components.len() < 3 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relationship between the 3 here and the knowledge that MsgBundle populates a MsgId component is a bit too "magic" for my taste. Probably worth adding a comment, but in general I wonder if this check would be better done by adding a helper to MsgBundle like: has_component(&ComponentName)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this definitely got outta hands 😄 I'll make another PR, there's a couple of helpers I'd like to see on MsgBundle actually

@teh-cmc teh-cmc merged commit f96b395 into main Jan 3, 2023
@teh-cmc teh-cmc deleted the cmc/datastore/msgid_as_component branch January 3, 2023 07:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

re_datastore: always-on MsgId component

2 participants