coherent interface for data_processing (panel conversions) module#1061
coherent interface for data_processing (panel conversions) module#1061
Conversation
TonyBagnall
left a comment
There was a problem hiding this comment.
Not looked at the code yet, but we so need this, it's great you have taken it on. I'll look in detail on monday
|
@TonyBagnall - which Monday? |
|
That's a bit sarcastic. I've been busy this week, and thought we should resolve 980 first in any case. I'm not sure why, after months of inaction, everything now has to be rushed through to your agenda. We are all volunteers here. |
Hah, indeed 😃
No, don't worry! Btw, I think this is independent of #980 - it just provides a nicer interface to the existing converters, that's all.
Well, we had the dev sprint and a larger number of PR are open. I'm just trying to avoid that PR get forgotten just because no one looks at then, especially the ones that are fresh. It's more difficult if they get older and PR pile on and on. You are doing the same with the old PR.
Not since you've clicked "I agree" when you signed up for slack. Did you read the fine print? |
TonyBagnall
left a comment
There was a problem hiding this comment.
I'm not sure on the terminology used here. "one column per variable" doesn't mean much to me, not sure what wide-format is and as I understand it there is no standard definition for long format? Could we link to an example of each either in file or through a loader from ts format? Also n_instances and n_timepoints are clear to me, but n_columns means nothing. I would prefer n_dimensions if this is for TSC.
MTYPE_REGISTER_PANEL = [
("nested_univ", "pd.DataFrame with one column per variable, pd.Series in cells"),
("numpy3D", "3D np.array of format (n_instances, n_columns, n_timepoints)"),
("numpyflat", "2D np.array of format (n_instances, n_columnsn_timepoints)"),
("pd-multiindex", "pd.DataFrame with multi-index (instance, time point)"),
("pd-wide", "pd.DataFrame in wide format, cols = (instancetime point)"),
("pd-long", "pd.DataFrame in long format, cols = (index, time_index, column)"),
]
This is how it is in the docstrings of the existing converters, I thought you would recognize it in the module that you have been curating, @TonyBagnall 😃 But I agree, this entire business is in dire need of:
I would very much hope so, but that would be a different scope imo. This is just cleaning up the existing module with existing docstrings.
I think it's relatively standard, at least in the medical world? Not sure whether it's R specific terminology though. |
TonyBagnall
left a comment
There was a problem hiding this comment.
based on discussion I'm happy to approve this, its certainly an improvement
|
I'll put this in unless @mloning wants to review it? No rush if you do want a review, but also I'm doing some housekeeping, good to get things done, and this seems pretty uncontentious |
…rs (#1225) This PR further consolidates the datatypes module introduced in #1061 and #1201: * adding example fixtures for most important panel data containers * consolidated checks for "is (some type)" in module, added some missing ones * adding registry constants for panel data containers * renaming leftover "what" variables to "obj" * bugfix in converter from nested to multi-index * adding converters for list-of-data-frames panel type used in the distance module * added tests for checking functionality * adding tests for conversion functionality, testing converters against fixtures * adding docstrings where they were missing
This PR introduces a clean interface for the panel data types conversions module.
This is fully downwards compatible, but adds the
convert(from, to, scitype)(what)syntax on top of theutils.data_processingmodule.This should provide users with an easy way to convert their data into the
sktimerequired format for TSC without having to sift through the code to find the right converter utility.I would also be planning to put this in a future section 1 for "data loading" in the TSC introduction notebook.
A number of conversions are missing, and some are lossy, so this may be a good start for future "good first issues".