Skip to content

Transform typing#420

Merged
mloning merged 20 commits intomasterfrom
transform-types
Oct 26, 2020
Merged

Transform typing#420
mloning merged 20 commits intomasterfrom
transform-types

Conversation

@mloning
Copy link
Copy Markdown
Contributor

@mloning mloning commented Oct 9, 2020

Reference Issues/PRs

STEP03 on transform typing

What does this implement/fix? Explain your changes.

Major changes

There are some backward-incompatible changes here and I'm planning to make this release v0.5.0 to indicate this.

  • introduced types for transformers (series-to-primitives, series-to-series, panel-to-tabular, panel-to-panel)
  • added estimator tags
  • added more comprehensive unit testing framework for all transformer types
  • reworked RowTransformer based on new transformer types, but no longer handles multivariate unequal length data, so we cannot use it together with multiple random intervals, to do this properly we need better support for unequal length data, so this basically removes the most modular version of time series forest
  • reworked RandomIntervalFeatureExtractionTransformer to no longer inherit but now wrap RandomIntervalSegmenter (avoids change of transform type through inheritance)
  • changed the forecaster fit API from fit(y, fh=None, X=None) to fit(y, X=None, fh=None) to make it easier to pass exogenous arguments via fit(y, X) as fh is rarely required in fit
  • renamed series_as_features to panel everywhere including module/function names as panel is shorter, easier to read, and more widely used, except sktime/series_as_features module which exclusively refers to classification/regression/clustering setting and not other panel data related functionality
Minor changes
  • enhanced all_estimators function to accept exclude and return_names keyword argument
  • reworked time_series_slope to consistently handle multidimensional inputs, combined with _lsq_fit used in TSF
  • minor fixes

Key files to review

  • sktime/transformers/base.py

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Oct 19, 2020

This is a huge pull request...

Starting review with the proposed typing (from the STEP), questions/comments.

  • a thing that make this hard to read: you don't comment on the "meaning" of arguments, i.e., what they represent mathematically; nor do you annotate well what the "right dimension" or expected type should be. Sure, it is asserted to be np.ndarray, but with which parameters?
  • I'm not too familiar with advanced functionality the typing package, but I was wondering: for the pd.DataFrame and np.ndarray derived types, is it possible to define a type by an assert or checker property? E.g., insist that the array should have dimension 2 or 3?
  • given that we are re-working much of the scientific typing here: would it be worth considering use of abc, given that we already introduce typing?
  • is it going to bite us in the back if we do not properly distinguish sequences from series - series being indexed sequences?
  • why is type checking by checking inheritance preferable above a tag or property here?
  • why does BaseTransformer always expect X and y? That seems a bit superfluous and not necessary for all the types? It also may invite confusion with the supervised paradigm.
  • for psychological reasons, one could call the series capital-Y, and make X optional? Or, one could even use Z instead of X.
  • another option would be to do the following: X for exogenous variables that are known up until "present", W for exogenous variables that are known into the future, Y for the series? That would have to pull through the forecasting composition principles, but may make dealing with the featurizer easier.
  • series-to-primitives are not stateful, but they still could be parametric! Imagine somthing like the choice of a cut-off or smoothing parameter. This is currently not exposed to tuning. I would prefer the option to inherit BaseEstimator.
  • I also thought the sklearn FunctionTransformer operates on 2D input arrays X?
  • regarding the "featurizer", I prefer option 2, it seems cleaner regarding scitype. I would also suggest there should be a wrapper that turns a forecaster into a featurizer.
  • for series-as-features, while we're at it: what about the important case where we have a data set where samples are a tupe that involves series as features, and primitives? E.g., ECG of a patient, and gender/age. Would that be sitting in a pd.DataFrame? Is this already meant to be included by the SeriesAsFeatures type?

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Oct 19, 2020

Quick Q: would you be able to explain why the replacement for RowTransformer in the modular time series forest no longer supports unequal length random intervals?

If I understand correctly, the RowTransformer is now replaced by SeriestoPrimitivesRowTransformer and SeriesToSeriesRowTransformer? Is this because np.ndarray is assumed throughout, as opposed to flexibility in input/output type?

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Oct 20, 2020

?

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Oct 22, 2020

?? @mloning

@mloning
Copy link
Copy Markdown
Contributor Author

mloning commented Oct 22, 2020

Yes @fkiraly, wanted to discuss your points in our next meeting, thanks for the feedback!

To answer your quick question: the reason is that unequal length intervals are unequal length panel data, even though we can represent this in the nested pd.DataFrame, we don't really have functionality to deal with that yet, the previous implementation was very hacky, that's why I removed it.

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Oct 23, 2020

Note on the modular time series forest: the source of the problem is not being able to properly handle samples of multivariate series, sampled and unequal time points, and of unequal lengths.
This was hacky in the original implementation, and seems to lead to problems with "apply" whichever representation you pick, e.g.,

  • series in data frames
  • data frames in data frames

while numpy array is not possible due to the unequal length issue.

@fkiraly fkiraly self-requested a review October 24, 2020 21:28
@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Oct 24, 2020

I've looked at the code in greater detail now.
I've focused on the treatment of transformers and their types, as in:

  • transformers/base for the template
  • contents of the various sub-folders of transformers to understand the templates and type check conventions
  • utils/validation to understand the type checks

I agree with the general design, I think this is making things much clearer and cleaner.

Major comments:

  1. I think we very much need to write up template designs, type conventions, extension conventions. This information is typically not suitably covered by an auto-generated API reference, as it is distributed across classes - the generic template, and the conventions implemented in the concrete desendants. Sth like in sklearn dev guidelines could work, but I'd actually prefer to also have "formal" template definitions collected in one place. While the typing makes things clearer, it also makes the necessity to document the interface cleanly greater.
  2. I like the idea of using typing for definition of templates, checking, and simultaneous treatment of input types. However, I think we are introducing two conceptually disconnected systems now, with the typing type checks, and the separate runtime type checks in utils/validation - compare for example the VALID_DATA_TYPES in validation.series, and theUnivariateSeries/MultivariateSeries type in transformers.base
    I think we need to think carefully how we best streamline this, while avoiding duplication or design inconsistency.
    I'll discuss a potential way below.

Minor comments:

  • I found the rationale behind the choice of argument names (X or Z or what?) a bit odd. Not sure what the rule is, but it looked inconsistent?
  • not all panel transformers look like they are "full" panel transformers, some are row-wise. Are these correctly classified?
  • not sure whether we want to adopt the sklearn convention for random_state, but currently we don't adhere to it? Similar about dealing with optional arguments to __init__ in the template(s), and args or kwargs.

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Oct 24, 2020

Regarding the issue of simultaneous use of typing and runtime type checks:

I think the most elegant way would be to use the typing, and perhaps annotated types, to introduce generic type checks for function arguments.
This could work by using decorators for input/output checks on methods such as fit and transform, similar to what we had discussed earlier.

This can be covered by two parts:

  • Single-argument type checks, and input/output conversions between different machine types with the same scitype.
    This can be done by a generic decorator which is applicable to any such function.
    Why: the typing of the decorated function allows look-up of the scitye via get_type_hints in the decorator. Then, one can dispatch to type specific type checks, on each argument.
    This can replace check_X and check_y and the like.
  • joint argument type checks, such as whether X and y are of equal length, or other compatibility. This would replace check_y_X and the like. Since current such functions call check_X and check_y, this could easily be replaced by decoration with the first kind.

The VALID_DATA_TYPES could be rolled into union type definitions or scitypes.

Copy link
Copy Markdown
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions for discussion above.
Won't approve/request since I think the typing and doc issue may be good to discuss.

@mloning
Copy link
Copy Markdown
Contributor Author

mloning commented Oct 26, 2020

@fkiraly thanks again for the review, merged now, but happy to discuss this further and push new changes!

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Oct 26, 2020

so, is the idea to have this as a "working version" and deal with the type checks later?
You haven't really replied to that, and you've pushed without a PR approval, using your admin privileges...
In my understanding of best PR practice, you should say why, whether you agree, and/or try clarify outstanding Q.

@mloning
Copy link
Copy Markdown
Contributor Author

mloning commented Oct 26, 2020

Sorry if that wasn't clear @fkiraly! As discussed on Friday, other work depends on this and I want to move forward. You also didn't request these changes, so I thought they were points for discussion rather than change requests.

Regarding your comments:

  • we're already using scikit-learn's check_random_state functions and their conventions for handling random states, have we missed some cases?
  • Using X and y as standard argument names is even more confusing (consider the detrender where X is y and y is X), hence changed it to Z for series and X for exogenous (that was also what you suggest independently in your first review above)
  • I agree about tidying up the different type checks/notations and extension guidelines but given the limited type support and problems of subclassing numpy or pandas, this isn't straightforward unfortunately
  • would you suggest another type distinction between full panel and row wise panel transformers?

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Oct 26, 2020

we're already using scikit-learn's check_random_state functions and their conventions for handling random states, have we missed some cases?

But not all learners can be given random_state arguments? That would cause problems in composition, when "passing through" the random state.

Using X and y as standard argument names is even more confusing (consider the detrender where X is y and y is X), hence changed it to Z for series and X for exogenous (that was also what you suggest independently in your first review above)

I suspected this, but what is the "rule" in assigning letters here? Is it consistent across scitypes?

I agree about tidying up the different type checks/notations and extension guidelines but given the limited type support and problems of subclassing numpy or pandas, this isn't straightforward unfortunately

Well, at least the extension guidelines are straightforward.
I agree that the type checks need some thinking, but it's also not insurmountable. I feel if this is not done cleanly, we'll end up again with a major refactor or other technical debt that we could otherwise avoid.

would you suggest another type distinction between full panel and row wise panel transformers?

My intuition would tell me to get the "row wise panel transformers" from consistently applying a factory, decorator, or wrapper/reduction method to row wise transformers, to obtain panel versions.
(the caveat here is efficiency)

Perhaps add a tag that says a panel transformer is row-wise, rather than a full scitype, but not sure whether that's useful anywhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

implementing framework Implementing or improving framework for learning tasks, e.g., base class functionality maintenance Continuous integration, unit testing & package distribution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants