Skip to content

re_datastore: dump to flat dataframe#645

Merged
teh-cmc merged 156 commits intomainfrom
cmc/datastore/dump_flat_df
Jan 2, 2023
Merged

re_datastore: dump to flat dataframe#645
teh-cmc merged 156 commits intomainfrom
cmc/datastore/dump_flat_df

Conversation

@teh-cmc
Copy link
Copy Markdown
Contributor

@teh-cmc teh-cmc commented Dec 26, 2022

Fixes #640
Requires #609 (to keep things linear, don't want to have to deal with a multiverse situation...)

This also adds a new datastore config: store_insert_ids.
When enabled (default in debug), the ID of a write request will be stored alongside the data that was written.

This replaces the doc-comment-examples of all dataframe APIs with actual standalone examples.


Example:

let mut store = DataStore::new(Instance::name(), Default::default());

let ent_paths = [
    EntityPath::from("this/that"),
    EntityPath::from("and/this/other/thing"),
];

for ent_path in &ent_paths {
    let bundle1 = test_bundle!(ent_path @ [
        build_frame_nr(1.into()), build_log_time(Time::now()),
    ] => [build_some_instances(2), build_some_rects(2)]);
    store.insert(&bundle1).unwrap();
}

for ent_path in &ent_paths {
    let bundle2 = test_bundle!(ent_path @ [
        build_frame_nr(2.into())
    ] => [build_some_instances(2), build_some_point2d(2)]);
    store.insert(&bundle2).unwrap();

    let bundle3 = test_bundle!(ent_path @ [
        build_frame_nr(3.into()), build_log_time(Time::now()),
    ] => [build_some_instances_from(25..29), build_some_point2d(4)]);
    store.insert(&bundle3).unwrap();
}

for ent_path in &ent_paths {
    let bundle4_1 = test_bundle!(ent_path @ [
        build_frame_nr(4.into()), build_log_time(Time::now()),
    ] => [build_some_instances_from(20..23), build_some_rects(3)]);
    store.insert(&bundle4_1).unwrap();

    let bundle4_15 = test_bundle!(ent_path @ [
        build_frame_nr(4.into()),
    ] => [build_some_instances_from(20..23), build_some_point2d(3)]);
    store.insert(&bundle4_15).unwrap();

    let bundle4_2 = test_bundle!(ent_path @ [
        build_frame_nr(4.into()), build_log_time(Time::now()),
    ] => [build_some_instances_from(25..28), build_some_rects(3)]);
    store.insert(&bundle4_2).unwrap();

    let bundle4_25 = test_bundle!(ent_path @ [
        build_frame_nr(4.into()), build_log_time(Time::now()),
    ] => [build_some_instances_from(25..28), build_some_point2d(3)]);
    store.insert(&bundle4_25).unwrap();
}

let df = store.to_dataframe();
eprintln!("{df}");

Outputs:
image

@teh-cmc teh-cmc changed the base branch from main to cmc/datastore/range_queries2 December 26, 2022 10:51
@teh-cmc teh-cmc marked this pull request as ready for review December 26, 2022 11:17
Copy link
Copy Markdown
Contributor

@jleibs jleibs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very helpful way to view the data for debugging.

Comment on lines +144 to +163
let comp_values: Vec<_> = comp_rows
.into_iter()
.map(|row| row.unwrap_or_else(|| new_empty_array(comp_table.datatype.clone())))
.collect();
let comp_values: Vec<_> = comp_values.iter().map(|arr| &**arr).collect();

// Each cell is actually a list, so we need to compute offsets one cell at a time.
let mut offset = 0i32;
let comp_offsets: Vec<_> = comp_values
.iter()
.map(|row| {
let ret = offset;
offset += row.len() as i32;
ret
})
// don't forget the last element!
.chain(std::iter::once(
comp_values.iter().map(|row| row.len() as i32).sum(),
))
.collect();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The empty arrays don't actually need to be created since they take up zero space in the value-array -- you just need to book-keep them in the offset array.

Suggested change
let comp_values: Vec<_> = comp_rows
.into_iter()
.map(|row| row.unwrap_or_else(|| new_empty_array(comp_table.datatype.clone())))
.collect();
let comp_values: Vec<_> = comp_values.iter().map(|arr| &**arr).collect();
// Each cell is actually a list, so we need to compute offsets one cell at a time.
let mut offset = 0i32;
let comp_offsets: Vec<_> = comp_values
.iter()
.map(|row| {
let ret = offset;
offset += row.len() as i32;
ret
})
// don't forget the last element!
.chain(std::iter::once(
comp_values.iter().map(|row| row.len() as i32).sum(),
))
.collect();
// Each cell is actually a list, so we need to compute offsets one cell at a time.
let mut offset = 0i32;
let comp_offsets: Vec<_> = std::iter::once(0)
.chain(comp_rows.iter().map(|row| {
offset += row.as_ref().map_or(0, |row| row.len()) as i32;
offset
}))
.collect();
let comp_values: Vec<_> = comp_rows.iter().flatten().map(|row| row.as_ref()).collect();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jesus, nevermind the empty arrays, none of that code made any sense 😂

Base automatically changed from cmc/datastore/range_queries2 to main January 2, 2023 16:28
@teh-cmc teh-cmc merged commit 78b6486 into main Jan 2, 2023
@teh-cmc teh-cmc deleted the cmc/datastore/dump_flat_df branch January 2, 2023 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

re_datastore: polars_util::dump_to_flat_df

2 participants