Transform typing by mloning · Pull Request #420 · sktime/sktime

mloning · 2020-10-09T17:15:33Z

Reference Issues/PRs

STEP03 on transform typing

What does this implement/fix? Explain your changes.

Major changes

There are some backward-incompatible changes here and I'm planning to make this release v0.5.0 to indicate this.

introduced types for transformers (series-to-primitives, series-to-series, panel-to-tabular, panel-to-panel)
added estimator tags
added more comprehensive unit testing framework for all transformer types
reworked RowTransformer based on new transformer types, but no longer handles multivariate unequal length data, so we cannot use it together with multiple random intervals, to do this properly we need better support for unequal length data, so this basically removes the most modular version of time series forest
reworked RandomIntervalFeatureExtractionTransformer to no longer inherit but now wrap RandomIntervalSegmenter (avoids change of transform type through inheritance)
changed the forecaster fit API from fit(y, fh=None, X=None) to fit(y, X=None, fh=None) to make it easier to pass exogenous arguments via fit(y, X) as fh is rarely required in fit
renamed series_as_features to panel everywhere including module/function names as panel is shorter, easier to read, and more widely used, except sktime/series_as_features module which exclusively refers to classification/regression/clustering setting and not other panel data related functionality

Minor changes

enhanced all_estimators function to accept exclude and return_names keyword argument
reworked time_series_slope to consistently handle multidimensional inputs, combined with _lsq_fit used in TSF
minor fixes

Key files to review

sktime/transformers/base.py

…itute/sktime into transform-types

fkiraly · 2020-10-19T19:15:27Z

This is a huge pull request...

Starting review with the proposed typing (from the STEP), questions/comments.

a thing that make this hard to read: you don't comment on the "meaning" of arguments, i.e., what they represent mathematically; nor do you annotate well what the "right dimension" or expected type should be. Sure, it is asserted to be np.ndarray, but with which parameters?
I'm not too familiar with advanced functionality the typing package, but I was wondering: for the pd.DataFrame and np.ndarray derived types, is it possible to define a type by an assert or checker property? E.g., insist that the array should have dimension 2 or 3?
given that we are re-working much of the scientific typing here: would it be worth considering use of abc, given that we already introduce typing?
is it going to bite us in the back if we do not properly distinguish sequences from series - series being indexed sequences?
why is type checking by checking inheritance preferable above a tag or property here?
why does BaseTransformer always expect X and y? That seems a bit superfluous and not necessary for all the types? It also may invite confusion with the supervised paradigm.
for psychological reasons, one could call the series capital-Y, and make X optional? Or, one could even use Z instead of X.
another option would be to do the following: X for exogenous variables that are known up until "present", W for exogenous variables that are known into the future, Y for the series? That would have to pull through the forecasting composition principles, but may make dealing with the featurizer easier.
series-to-primitives are not stateful, but they still could be parametric! Imagine somthing like the choice of a cut-off or smoothing parameter. This is currently not exposed to tuning. I would prefer the option to inherit BaseEstimator.
I also thought the sklearn FunctionTransformer operates on 2D input arrays X?
regarding the "featurizer", I prefer option 2, it seems cleaner regarding scitype. I would also suggest there should be a wrapper that turns a forecaster into a featurizer.
for series-as-features, while we're at it: what about the important case where we have a data set where samples are a tupe that involves series as features, and primitives? E.g., ECG of a patient, and gender/age. Would that be sitting in a pd.DataFrame? Is this already meant to be included by the SeriesAsFeatures type?

fkiraly · 2020-10-19T19:25:19Z

Quick Q: would you be able to explain why the replacement for RowTransformer in the modular time series forest no longer supports unequal length random intervals?

If I understand correctly, the RowTransformer is now replaced by SeriestoPrimitivesRowTransformer and SeriesToSeriesRowTransformer? Is this because np.ndarray is assumed throughout, as opposed to flexibility in input/output type?

fkiraly · 2020-10-20T20:49:51Z

?

fkiraly · 2020-10-22T13:27:29Z

?? @mloning

mloning · 2020-10-22T14:16:53Z

Yes @fkiraly, wanted to discuss your points in our next meeting, thanks for the feedback!

To answer your quick question: the reason is that unequal length intervals are unequal length panel data, even though we can represent this in the nested pd.DataFrame, we don't really have functionality to deal with that yet, the previous implementation was very hacky, that's why I removed it.

fkiraly · 2020-10-23T21:37:30Z

Note on the modular time series forest: the source of the problem is not being able to properly handle samples of multivariate series, sampled and unequal time points, and of unequal lengths.
This was hacky in the original implementation, and seems to lead to problems with "apply" whichever representation you pick, e.g.,

series in data frames
data frames in data frames

while numpy array is not possible due to the unequal length issue.

fkiraly · 2020-10-24T21:31:51Z

I've looked at the code in greater detail now.
I've focused on the treatment of transformers and their types, as in:

transformers/base for the template
contents of the various sub-folders of transformers to understand the templates and type check conventions
utils/validation to understand the type checks

I agree with the general design, I think this is making things much clearer and cleaner.

Major comments:

I think we very much need to write up template designs, type conventions, extension conventions. This information is typically not suitably covered by an auto-generated API reference, as it is distributed across classes - the generic template, and the conventions implemented in the concrete desendants. Sth like in sklearn dev guidelines could work, but I'd actually prefer to also have "formal" template definitions collected in one place. While the typing makes things clearer, it also makes the necessity to document the interface cleanly greater.
I like the idea of using typing for definition of templates, checking, and simultaneous treatment of input types. However, I think we are introducing two conceptually disconnected systems now, with the typing type checks, and the separate runtime type checks in utils/validation - compare for example the VALID_DATA_TYPES in validation.series, and theUnivariateSeries/MultivariateSeries type in transformers.base
I think we need to think carefully how we best streamline this, while avoiding duplication or design inconsistency.
I'll discuss a potential way below.

Minor comments:

I found the rationale behind the choice of argument names (X or Z or what?) a bit odd. Not sure what the rule is, but it looked inconsistent?
not all panel transformers look like they are "full" panel transformers, some are row-wise. Are these correctly classified?
not sure whether we want to adopt the sklearn convention for random_state, but currently we don't adhere to it? Similar about dealing with optional arguments to __init__ in the template(s), and args or kwargs.

fkiraly · 2020-10-24T21:42:40Z

Regarding the issue of simultaneous use of typing and runtime type checks:

I think the most elegant way would be to use the typing, and perhaps annotated types, to introduce generic type checks for function arguments.
This could work by using decorators for input/output checks on methods such as fit and transform, similar to what we had discussed earlier.

This can be covered by two parts:

Single-argument type checks, and input/output conversions between different machine types with the same scitype.
This can be done by a generic decorator which is applicable to any such function.
Why: the typing of the decorated function allows look-up of the scitye via get_type_hints in the decorator. Then, one can dispatch to type specific type checks, on each argument.
This can replace check_X and check_y and the like.
joint argument type checks, such as whether X and y are of equal length, or other compatibility. This would replace check_y_X and the like. Since current such functions call check_X and check_y, this could easily be replaced by decoration with the first kind.

The VALID_DATA_TYPES could be rolled into union type definitions or scitypes.

fkiraly

Questions for discussion above.
Won't approve/request since I think the typing and doc issue may be good to discuss.

mloning · 2020-10-26T07:23:04Z

@fkiraly thanks again for the review, merged now, but happy to discuss this further and push new changes!

fkiraly · 2020-10-26T08:47:34Z

so, is the idea to have this as a "working version" and deal with the type checks later?
You haven't really replied to that, and you've pushed without a PR approval, using your admin privileges...
In my understanding of best PR practice, you should say why, whether you agree, and/or try clarify outstanding Q.

mloning · 2020-10-26T09:06:41Z

Sorry if that wasn't clear @fkiraly! As discussed on Friday, other work depends on this and I want to move forward. You also didn't request these changes, so I thought they were points for discussion rather than change requests.

Regarding your comments:

we're already using scikit-learn's check_random_state functions and their conventions for handling random states, have we missed some cases?
Using X and y as standard argument names is even more confusing (consider the detrender where X is y and y is X), hence changed it to Z for series and X for exogenous (that was also what you suggest independently in your first review above)
I agree about tidying up the different type checks/notations and extension guidelines but given the limited type support and problems of subclassing numpy or pandas, this isn't straightforward unfortunately
would you suggest another type distinction between full panel and row wise panel transformers?

fkiraly · 2020-10-26T09:15:36Z

we're already using scikit-learn's check_random_state functions and their conventions for handling random states, have we missed some cases?

But not all learners can be given random_state arguments? That would cause problems in composition, when "passing through" the random state.

Using X and y as standard argument names is even more confusing (consider the detrender where X is y and y is X), hence changed it to Z for series and X for exogenous (that was also what you suggest independently in your first review above)

I suspected this, but what is the "rule" in assigning letters here? Is it consistent across scitypes?

I agree about tidying up the different type checks/notations and extension guidelines but given the limited type support and problems of subclassing numpy or pandas, this isn't straightforward unfortunately

Well, at least the extension guidelines are straightforward.
I agree that the type checks need some thinking, but it's also not insurmountable. I feel if this is not done cleanly, we'll end up again with a major refactor or other technical debt that we could otherwise avoid.

would you suggest another type distinction between full panel and row wise panel transformers?

My intuition would tell me to get the "row wise panel transformers" from consistently applying a factory, decorator, or wrapper/reduction method to row wise transformers, to obtain panel versions.
(the caveat here is efficiency)

Perhaps add a tag that says a panel transformer is row-wise, rather than a full scitype, but not sure whether that's useful anywhere.

mloning added 20 commits October 9, 2020 18:13

Implement transform typing, WIP

8c9e3b3

Merge branch 'master' into transform-types

a54121e

Addtional changes

f2405a6

Merge branch 'master' into transform-types

e820801

Merge branch 'transform-types' of https://github.com/alan-turing-inst…

5c4eb75

…itute/sktime into transform-types

Sync with master

b2c1740

Fixed time series slope function

09b5979

Fixed RowTransformer

bc3ffa9

all tests pass, removed functionality for unequal length panel data

42cddaa

More fixes

aaf0ef3

Merge branch 'master' into transform-types

b574ae1

Fix notebooks

5460db6

Merge branch 'transform-types' of https://github.com/alan-turing-inst…

a1c1a62

…itute/sktime into transform-types

Fix notebooks

cc3e272

Minor fixes, added LogTransformer and MeanTransformer

676f131

Change naming: series-as-features to panel

b469b99

Minor changes after self-review

4044c67

More changes after self-review

a5069e6

Sync with master

2a68b81

Making changes to docs

ab8c4ab

fkiraly self-requested a review October 24, 2020 21:28

fkiraly reviewed Oct 24, 2020

View reviewed changes

fkiraly mentioned this pull request Oct 26, 2020

Replace nested pd.DataFrame with awkward-array #448

Closed

This was referenced Nov 2, 2020

[DOC] Developer guide (extension guidelines and scitype templates) #464

Closed

proposal for declarative scityping, checking, conversion of inputs/outputs sktime/enhancement-proposals#2

Merged

fkiraly mentioned this pull request Jan 23, 2022

[ENH] investigate use of abc for base classes, write STEP #1923

Closed

Uh oh!

Conversation

mloning commented Oct 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Major changes

Minor changes

Key files to review

Uh oh!

fkiraly commented Oct 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fkiraly commented Oct 19, 2020

Uh oh!

fkiraly commented Oct 20, 2020

Uh oh!

fkiraly commented Oct 22, 2020

Uh oh!

mloning commented Oct 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fkiraly commented Oct 23, 2020

Uh oh!

fkiraly commented Oct 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fkiraly commented Oct 24, 2020

Uh oh!

fkiraly left a comment

Choose a reason for hiding this comment

Uh oh!

mloning commented Oct 26, 2020

Uh oh!

fkiraly commented Oct 26, 2020

Uh oh!

mloning commented Oct 26, 2020

Uh oh!

fkiraly commented Oct 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mloning commented Oct 9, 2020 •

edited

Loading

fkiraly commented Oct 19, 2020 •

edited

Loading

mloning commented Oct 22, 2020 •

edited

Loading

fkiraly commented Oct 24, 2020 •

edited

Loading

fkiraly commented Oct 26, 2020 •

edited

Loading