Skip to content

Update metric to handle y_train#858

Merged
mloning merged 27 commits intosktime:mainfrom
RNKuhns:metric-handle-y_train
Jun 3, 2021
Merged

Update metric to handle y_train#858
mloning merged 27 commits intosktime:mainfrom
RNKuhns:metric-handle-y_train

Conversation

@RNKuhns
Copy link
Copy Markdown
Contributor

@RNKuhns RNKuhns commented May 4, 2021

Reference Issues/PRs

This addresses functionality in #712.

What does this implement/fix? Explain your changes.

Add functionality to accept y_train in __call__ method and pass to underlying function if it requires y_train.

Also updated underlying metric classes to inherit from BaseMetric.

Does your contribution introduce a new dependency? If yes, which one?

What should a reviewer concentrate their feedback on?

Any other comments?

PR checklist

For all contributions
  • I've added myself to the list of contributors.
  • Optionally, I've updated sktime's CODEOWNERS to receive notifications about future changes to these files.
  • I've added unit tests and made sure they pass locally.
For new estimators
  • I've added the estimator to the online documentation.
  • I've updated the existing example notebooks or provided a new one to showcase how my estimator works.

@RNKuhns
Copy link
Copy Markdown
Contributor Author

RNKuhns commented May 7, 2021

@mloning and @fkiraly the more I'm looking at this while looking to simplify the unit tests for these metrics, it makes more and more sense to me that the metric functions in _functions.py should also get updated to accept y_train via 1**kwargs` so we have a uniform interface.

Difference to the user passing info to functions will be relatively minimal.

Benefit is that y_train would be passed same way to function as the __call__ of the class version of the metric:

mean_absolute_scaled_error(y_true, y_pred, y_train=y_train, ... # Any other keyword args like multioutput)
mase = MeanAbsoluteScaledError()
mase(y_true, y_pred, y_train=y_train)

versus the current function setup accepting y_train as arg and class version of metric accepting it as keyword:

mean_absolute_scaled_error(y_true, y_pred, y_train, ... # Any other keyword args like multioutput)
mase = MeanAbsoluteScaledError()
mase(y_true, y_pred, y_train=y_train)

Note that I'll do whatever we decide for y_train for metrics that accept y_pred_benchmark.

I'll I've got this coded up but wanted to get your feedback on the approach before committing and pushing to Github.

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented May 11, 2021

@mloning and @fkiraly the more I'm looking at this while looking to simplify the unit tests for these metrics, it makes more and more sense to me that the metric functions in _functions.py should also get updated to accept y_train via 1**kwargs` so we have a uniform interface.

I've got this coded up but wanted to get your feedback on the approach before committing and pushing to Github.

Yes, very much agreed! I think this indeed makes unit tests easier, and any other exercise that requires a unified interface on the function level.

@RNKuhns
Copy link
Copy Markdown
Contributor Author

RNKuhns commented May 12, 2021

@fkiraly that sounds great. I think this should basically be ready for review. But I see the PR is failing the manylinux build. @mloning the details look like there is an issue with numba versioning in the linux builds, but not entirely sure what is going on there.

@RNKuhns RNKuhns marked this pull request as ready for review May 12, 2021 00:29
@RNKuhns RNKuhns requested review from aiwalter and mloning as code owners May 12, 2021 00:29
Copy link
Copy Markdown
Contributor

@mloning mloning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RNKuhns - the manylinux CI will hopefully be fixed in #870.

I had a first look at the code and left a few minor comments below.

Copy link
Copy Markdown
Contributor

@mloning mloning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RNKuhns I had look at the changes, I think we're almost there, just a few more comments.

@RNKuhns
Copy link
Copy Markdown
Contributor Author

RNKuhns commented May 19, 2021

@mloning I think I've incorporated all your comments. As a bonus I've tweaked all the docstrings to better align with NumPy conventions and pass the pydocstyle checks.

I also opened an issue related to the creation of a BaseObject per our discussion above. I can start tackling that next.

@RNKuhns RNKuhns mentioned this pull request May 19, 2021
mloning
mloning previously approved these changes May 20, 2021
Copy link
Copy Markdown
Contributor

@mloning mloning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good to me - I'll merge in the next few days in case anyone else wants to take a look!

Copy link
Copy Markdown
Contributor

@mloning mloning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @RNKuhns, we should also update evaluate in sktime.forecasting.model_evaluation to pass y_train as a keyword argument to the scoring: https://github.com/alan-turing-institute/sktime/blob/c611e6a3587d7b3a44cb7deefd4b6baa4897fc9b/sktime/forecasting/model_evaluation/_functions.py#L106

If we do that here, we should also add a metric that requires y_train to the current test cases here (perhaps add MASE and remove MSE which doesn't add much as an additional test case): https://github.com/alan-turing-institute/sktime/blob/c611e6a3587d7b3a44cb7deefd4b6baa4897fc9b/sktime/forecasting/model_evaluation/tests/test_evaluate.py#L80

@RNKuhns
Copy link
Copy Markdown
Contributor Author

RNKuhns commented May 22, 2021

@mloning I'm working on update to evaluate.

Ran into two minor hitches. First is that in the call to scoring within evaluate() the y_true and y_pred arguments were flipped (doesn't really matter for most metrics, but I fixed it).

The bigger hitch is that the test cases include in-sample predictions. In MeanAbsoluteScaledError() we have a check to make sure that y_train is before y_true. This makes sense in the context of forecast evaluation in general (and MASE's definition). But some of the test cases are looking at in-sample predictions and in those cases the check is failed in those cases; hence, the tests fail.

I can think of two options for approaching this:

  1. We have separate test function for MASE that just tests the common configs with only the out-of-sample forecasting horizons or just pass on the tests with MASE when the forecasting horizon is negative
  2. We remove the check in mean_absolute_scaled_error to require y_train prior to y_true

I am in the camp that MASE is designed to be applied when evaluating out-of-sample forecasts so we just include a line of code to exclude tests of evaluate for MASE when the forecasting horizon is negative (in-sample). That still ensures the MASE functionality is working in a way that makes sense for users (the whole evaluating in-sample forecasts is not a good gauge of the model's future predictive ability thing). But what do you think?

I could also figure out how to add a quick check in evaluate to raise an informative error if the user tries to use a metric that requires y_train with a CV object with a negative forecasting horizon (I'll also add note in docstring).

RNKuhns added 3 commits May 23, 2021 09:55
Test tune was creating scorer from scikit-learn
mean_squared_error. Using sktime metric class now.

Also updated documentation of
make_forecasting_scorer to make input function
signature clear.
@mloning
Copy link
Copy Markdown
Contributor

mloning commented May 24, 2021

@RNKuhns I agree, but perhaps the check that y_train is prior to y_true is a bit over eager and we make everything a bit simpler by not enforcing that. In principle MASE still works for in-sample predictions, so perhaps we should exclude that option. This moves the responsibility to users for making sure they're evaluating models on genuine forecasts which I think is fine. But happy to follow your lead here @RNKuhns.

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented May 25, 2021

@RNKuhns, I'm with @mloning on this one - in my opinion, the metric should just model the "bare" mathematical/scientific object, not make any checks regarding a plausible use case. This is because user expectations - they will expect the code object to behave as the scientific one, and because of the domain driven design principle to follow this mapping.

A secondary point - in my opinion not the main argument here, but it's to the same effect - is that it can make sense for a metric in-principle to have training and test time points overlapping, for example when you are trying to estimate over-optimism of in-sample estimates. We shouldn't exclude a rare sensible use case.

Copy link
Copy Markdown
Contributor

@mloning mloning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RNKuhns - looks all good to me now! Will merge in the next few days in case anyone else wants to take a look.

@mloning
Copy link
Copy Markdown
Contributor

mloning commented Jun 3, 2021

Now merged - great work @RNKuhns 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants