Skip to content

Dataframe API v2 p1: API definitions#7559

Merged
teh-cmc merged 4 commits intomainfrom
cmc/dataframev2_1_api_def
Oct 2, 2024
Merged

Dataframe API v2 p1: API definitions#7559
teh-cmc merged 4 commits intomainfrom
cmc/dataframev2_1_api_def

Conversation

@teh-cmc
Copy link
Copy Markdown
Contributor

@teh-cmc teh-cmc commented Oct 1, 2024

The new public API definition and nothing else. Speak now.

Checklist

  • I have read and agree to Contributor Guide and the Code of Conduct
  • I've included a screenshot or gif (if applicable)
  • I have tested the web demo (if applicable):
  • The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG
  • If applicable, add a new check to the release checklist!
  • If have noted any breaking changes to the log API in CHANGELOG.md and the migration guide

To run all checks from main, comment on the PR with @rerun-bot full-check.

@teh-cmc teh-cmc added 🔍 re_query affects re_query itself do-not-merge Do not merge this PR include in changelog labels Oct 1, 2024
@teh-cmc teh-cmc marked this pull request as ready for review October 1, 2024 15:24
@teh-cmc teh-cmc force-pushed the cmc/dataframev2_1_api_def branch from 18886e3 to a989ac2 Compare October 1, 2024 16:55

/// How the data will be joined into the resulting `RecordBatch`.
//
// TODO(cmc): remove with the old re_dataframe.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I believe we still need this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the API is a row iterator, there's nothing to join 🤷

pub struct ControlColumnSelector {
/// Name of the control column.
//
// TODO(cmc): this should be `component_name`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like an unnecessary verbosity. We generally refer to components by name. For example, we write things like "The Point3D component" and consider it synonymous with "The component named Point3D".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the convention across the codebase is that component the actual component data (component: impl Component) whereas component_name is just a name (component_name: ComponentName).

/// view contents: it is possible to end up with values from outside the view!
LatestAtGlobal,
//
// TODO(cmc): `LatestAtView`?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 or more importantly, LatestAtWindowed(...)

Base automatically changed from cmc/dataframev2_0_chunk_stuff to main October 2, 2024 09:53
@teh-cmc teh-cmc force-pushed the cmc/dataframev2_1_api_def branch from 7d1cb72 to 39cfb1a Compare October 2, 2024 10:02
Co-authored-by: Jeremy Leibs <jeremy@rerun.io>
@teh-cmc teh-cmc removed the do-not-merge Do not merge this PR label Oct 2, 2024
@teh-cmc teh-cmc merged commit 3581ca4 into main Oct 2, 2024
@teh-cmc teh-cmc deleted the cmc/dataframev2_1_api_def branch October 2, 2024 10:07
teh-cmc added a commit that referenced this pull request Oct 2, 2024
A first implementation of the new dataframe APIs.
The name is now very misleading though: there isn't anything dataframe-y
left in here, it is a row-based iterator with Rerun semantics baked in,
driven by a sorted streaming join.

It is rather slow (related:
#7558 (comment)),
lacks many features and is full of edge cases, but it works.
It does support dedupe-latest semantics (slowly), view contents and
selections, chunk overlaps, and pagination (horribly, by virtue of
implementing `Iterator`).
It does _not_ support `Clear`s, nor `latest-at` sparse-filling, nor
PoVs, nor index sampling. Yet.

Upcoming PRs will be all about fixing these shortcomings one by one.

It should look somewhat familiar:
```rust
let query_cache = QueryCache::new(store);
let query_engine = QueryEngine {
    store,
    cache: &query_cache,
};

let mut query = QueryExpression2::new(timeline);
query.view_contents = Some(
    query_engine
        .iter_entity_paths(&entity_path_filter)
        .map(|entity_path| (entity_path, None))
        .collect(),
);
query.filtered_index_range = Some(ResolvedTimeRange::new(time_from, time_to));
eprintln!("{query:#?}:");

let query_handle = query_engine.query(query.clone());
// eprintln!("{:#?}", query_handle.selected_contents());
for batch in query_handle.into_batch_iter().skip(offset).take(len) {
    eprintln!("{batch}");
}
```

No tests until we have the guarantee that these are the semantics we
will commit to.

* Part of #7495 
* Requires #7559
@teh-cmc teh-cmc changed the title Dataframe API v2 #1: API definitions Dataframe API v2 1: API definitions Oct 3, 2024
@teh-cmc teh-cmc changed the title Dataframe API v2 1: API definitions Dataframe API v2 p1: API definitions Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🔍 re_query affects re_query itself

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants