Skip to content

coherent interface for data_processing (panel conversions) module#1061

Merged
fkiraly merged 12 commits intomainfrom
panel-converters
Jul 17, 2021
Merged

coherent interface for data_processing (panel conversions) module#1061
fkiraly merged 12 commits intomainfrom
panel-converters

Conversation

@fkiraly
Copy link
Copy Markdown
Collaborator

@fkiraly fkiraly commented Jun 26, 2021

This PR introduces a clean interface for the panel data types conversions module.

This is fully downwards compatible, but adds the convert(from, to, scitype)(what) syntax on top of the utils.data_processing module.

This should provide users with an easy way to convert their data into the sktime required format for TSC without having to sift through the code to find the right converter utility.

I would also be planning to put this in a future section 1 for "data loading" in the TSC introduction notebook.

A number of conversions are missing, and some are lossy, so this may be a good start for future "good first issues".

@fkiraly fkiraly added API design API design & software architecture implementing framework Implementing or improving framework for learning tasks, e.g., base class functionality module:classification classification module: time series classification labels Jun 26, 2021
Copy link
Copy Markdown
Contributor

@TonyBagnall TonyBagnall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not looked at the code yet, but we so need this, it's great you have taken it on. I'll look in detail on monday

@fkiraly
Copy link
Copy Markdown
Collaborator Author

fkiraly commented Jul 1, 2021

@TonyBagnall - which Monday?

@TonyBagnall
Copy link
Copy Markdown
Contributor

That's a bit sarcastic. I've been busy this week, and thought we should resolve 980 first in any case. I'm not sure why, after months of inaction, everything now has to be rushed through to your agenda. We are all volunteers here.

@fkiraly
Copy link
Copy Markdown
Collaborator Author

fkiraly commented Jul 1, 2021

That's a bit sarcastic.

Hah, indeed 😃

I've been busy this week, and thought we should resolve 980 first in any case.

No, don't worry! Btw, I think this is independent of #980 - it just provides a nicer interface to the existing converters, that's all.

I'm not sure why, after months of inaction, everything now has to be rushed through to your agenda.

Well, we had the dev sprint and a larger number of PR are open. I'm just trying to avoid that PR get forgotten just because no one looks at then, especially the ones that are fresh. It's more difficult if they get older and PR pile on and on.

You are doing the same with the old PR.

We are all volunteers here.

Not since you've clicked "I agree" when you signed up for slack. Did you read the fine print?

Copy link
Copy Markdown
Contributor

@TonyBagnall TonyBagnall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure on the terminology used here. "one column per variable" doesn't mean much to me, not sure what wide-format is and as I understand it there is no standard definition for long format? Could we link to an example of each either in file or through a loader from ts format? Also n_instances and n_timepoints are clear to me, but n_columns means nothing. I would prefer n_dimensions if this is for TSC.
MTYPE_REGISTER_PANEL = [
("nested_univ", "pd.DataFrame with one column per variable, pd.Series in cells"),
("numpy3D", "3D np.array of format (n_instances, n_columns, n_timepoints)"),
("numpyflat", "2D np.array of format (n_instances, n_columnsn_timepoints)"),
("pd-multiindex", "pd.DataFrame with multi-index (instance, time point)"),
("pd-wide", "pd.DataFrame in wide format, cols = (instance
time point)"),
("pd-long", "pd.DataFrame in long format, cols = (index, time_index, column)"),
]

@fkiraly
Copy link
Copy Markdown
Collaborator Author

fkiraly commented Jul 5, 2021

Also n_instances and n_timepoints are clear to me, but n_columns means nothing

This is how it is in the docstrings of the existing converters, I thought you would recognize it in the module that you have been curating, @TonyBagnall 😃

But I agree, this entire business is in dire need of:

  • a document where all the default formats are listed
  • with examples for the format, optimally the same "scientific data" in different machine types

Could we link to an example of each either in file or through a loader from ts format?

I would very much hope so, but that would be a different scope imo. This is just cleaning up the existing module with existing docstrings.

not sure what wide-format is and as I understand it there is no standard definition for long format?

I think it's relatively standard, at least in the medical world? Not sure whether it's R specific terminology though.

Copy link
Copy Markdown
Contributor

@TonyBagnall TonyBagnall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on discussion I'm happy to approve this, its certainly an improvement

@TonyBagnall
Copy link
Copy Markdown
Contributor

I'll put this in unless @mloning wants to review it? No rush if you do want a review, but also I'm doing some housekeeping, good to get things done, and this seems pretty uncontentious

@fkiraly fkiraly merged commit 9ca7587 into main Jul 17, 2021
@fkiraly fkiraly deleted the panel-converters branch July 17, 2021 08:35
fkiraly added a commit that referenced this pull request Aug 26, 2021
…rs (#1225)

This PR further consolidates the datatypes module introduced in #1061 and #1201:

* adding example fixtures for most important panel data containers
* consolidated checks for "is (some type)" in module, added some missing ones
* adding registry constants for panel data containers
* renaming leftover "what" variables to "obj"
* bugfix in converter from nested to multi-index
* adding converters for list-of-data-frames panel type used in the distance module
* added tests for checking functionality
* adding tests for conversion functionality, testing converters against fixtures
* adding docstrings where they were missing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

API design API design & software architecture implementing framework Implementing or improving framework for learning tasks, e.g., base class functionality module:classification classification module: time series classification

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants