Skip to content

[ENH] added feature_kind metadata in datatype checks#6490

Merged
yarnabrina merged 20 commits intosktime:mainfrom
Abhay-Lejith:feature_kind
Jul 19, 2024
Merged

[ENH] added feature_kind metadata in datatype checks#6490
yarnabrina merged 20 commits intosktime:mainfrom
Abhay-Lejith:feature_kind

Conversation

@Abhay-Lejith
Copy link
Copy Markdown
Contributor

@Abhay-Lejith Abhay-Lejith commented May 28, 2024

Reference Issues/PRs

Fixes #6470.

What does this implement/fix? Explain your changes.

Adding a feature_kind field to metadata to store info about datatypes of columns.

Does your contribution introduce a new dependency? If yes, which one?

No.

What should a reviewer concentrate their feedback on?

Did you add any tests for the change?

Yes.

Any other comments?

PR checklist

For all contributions
  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.

@Abhay-Lejith
Copy link
Copy Markdown
Contributor Author

@fkiraly @yarnabrina Few other tests are failing because of the changes I've made to the panel examples.
The metadata inference tests are successful as I checked locally. Other tests which use the same examples are failing.

I think it is because I've set the numpy3D and numpyflat fixtures to None in example 0 and example 1.

What I've done is I've moved the numpy examples to fixtures 4 and 5.
The reason I had to do this was because in the first few fixtures, the dtype of the input is int , so the expected feature_kind is DtypeKind.INT. But numpy is an exception where expected feature_kind is DtypeKind.FLOAT.

One solution is to leave the tests as it was earlier and change the input dtype to float so that the expected output is same for all mtypes and I don't have to separate the numpy examples.
Any other ideas?

@Abhay-Lejith Abhay-Lejith changed the title [ENH] added feature_kind metadata in series check [ENH] added feature_kind metadata in datatype checks Jun 3, 2024
@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Jun 5, 2024

@fkiraly so to make it short, the current tests are structured in a way that assumes all the different mtype inputs(nested_univ, numpy3D, pd-MultiIndex etc) in an example have the same metadata. Which was the case so far. But now that feature_kind has been added ,it is not the case anymore.

I see, I still think we should not change the examples, because so far they were treated as objects with the same metadata, or more generall objects that were the same on abstract data level, and we should not change that, or it would be breaking a central piece of logic.

The fact that naive dtype inferences leads to different results between the examples raises a serious design question - namely, whether we want to support all of the dtype_kind on the level of abstract mtype metadata.

FYI @yarnabrina

My instinct would be, we coerce to only two internally: FLOAT and CATEGORICAL. There is currently no separate handling for INT even.

The simplest version is that only these two are returned, coercing the int anf float types to FLOAT, and string, bool, and categorical to CATEGORICAL. (we need to check how bool is currently coerced in pandas dataframe)

Mechanically, a way to not lose the more granular implementation, would be to have, say, feature_kind_raw vs feature_kind, and the latter is just mapped. Ignoring feature_kind_raw in the examples comparisons and only using feature_kind would resolve the need to change the examples.

This is also a very consequential decision, even if it seems small, because it defines the internal data model, so we should ensure it is flagged and discussed properly. We could table it at the next tech session, FYI @yarnabrina.

This is of course not a recommendation to stop working on this, feel free to explore different options in the code, e.g., restoring the examples. I would advise you could keep multiple branches locally if you atr trying multiple things.

@Abhay-Lejith
Copy link
Copy Markdown
Contributor Author

I see, I still think we should not change the examples, because so far they were treated as objects with the same metadata, or more generall objects that were the same on abstract data level, and we should not change that, or it would be breaking a central piece of logic.

I'd like to mention that this is still the case. Just that now all of them are of float dtype instead of int.
These issues were not faced in series examples because they were float already.

I have already explored other options such as separating the numpy examples into different fixtures at the end as example 4 and 5. But this breaks other tests ( for functions like convert_to) as they expect to see numpy examples in the first fixtures itself.

My instinct would be, we coerce to only two internally: FLOAT and CATEGORICAL
Mechanically, a way to not lose the more granular implementation, would be to have, say, feature_kind_raw vs feature_kind, and the latter is just mapped. Ignoring feature_kind_raw in the examples comparisons and only using feature_kind would resolve the need to change the examples.

Yes, this would solve the issue.
I think I'll proceed with this after we discuss it during the tech session.

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Jun 5, 2024

I'd like to mention that this is still the case. Just that now all of them are of float dtype instead of int.

Sorry if I disagree, we have to be really precise here formally: you are contradicting yourself in the same sentence. If we were to use the feature_dtype field as proposed and as the core column typing metadata field, we would be treating the float and int examples differently.

I also want to reaffirm, this is not your "fault" or anyone's, it's a design question we have overlooked and that we need to resolve, that we have seen only because you've implemented things according to the earlier design.

I think I'll proceed with this after we discuss it during the tech session.

ok, I think you have other things to work on until then?

I think it is important that @sktime/core-developers have an opportunity to reflect on this.

Could you also kindly send a message in the dev-chat discord to alert them to this design question?

@fkiraly fkiraly added the module:datatypes datatypes module: data containers, checkers & converters label Jun 5, 2024
@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Jun 6, 2024

what are your thoughts, @Abhay-Lejith, on how to proceed?

@Abhay-Lejith
Copy link
Copy Markdown
Contributor Author

Abhay-Lejith commented Jun 6, 2024

Considering how we want the different mtype data in an example to have the same metadata, one option is to modify the examples so that the numpy based mtypes are separate from the pandas based mtypes. This would allow us to continue using the existing DtypeKind implementation as per the exchange protocol. But this would require a major rework of the testing module.

The other option ( and the better one I would say) is like you had suggested, create an extra layer on top of this implementation where we can group the int, uint, and float into a single NUMERIC category and the rest into a CATEGORICAL category. Then the expected metadata for both numpy and pandas based mtypes will be the same (since INT is no longer there) and there will be no need to modify the examples at all.

@Abhay-Lejith
Copy link
Copy Markdown
Contributor Author

@fkiraly @yarnabrina , I've reverted changes to examples and added the broader classification into just categorical and float.

@Abhay-Lejith Abhay-Lejith marked this pull request as ready for review June 12, 2024 13:11
@Abhay-Lejith Abhay-Lejith requested a review from fkiraly June 24, 2024 06:08
Copy link
Copy Markdown
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

One thing I noticed, we need to add tests for the new categorical type. This could be done by adding more examples, or separately.

@Abhay-Lejith
Copy link
Copy Markdown
Contributor Author

Abhay-Lejith commented Jun 27, 2024

@fkiraly adding examples with categorical does not work because even though the metadata is inferred correctly ,the current checks fail if the dataframe contains 'object' dtype.
Would it be better to add tests with categorical examples after we add categorical support and can remove this check?
Otherwise to test now, I will have to create a new file with separate examples and then test by calling the private feature_kind functions directly.
What do you suggest ?

To test now, I mean something like this :

def test_feature_kind_for_series():
    df = pd.DataFrame({"a": ["a", "b", "c", "d"], "b": [3, 7, 2, -3 / 7]})
    expected_feature_kind = [DtypeKind.CATEGORICAL, DtypeKind.FLOAT]

    dtype_kind = _get_series_dtypekind(df, pd.DataFrame)
    feature_kind = _get_feature_kind(dtype_kind)

    assert feature_kind == expected_feature_kind, (
        f"feature_kind was not correctly inferred, expected {expected_feature_kind} but"
        f"found {feature_kind}"
    )

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Jun 27, 2024

Yes, I think a separate test would be great for now, until we activate the entire machinery.

I would also advise that you create a branch for yourself where you have the update for the checks integrated, so you can test the full integrated change. Good practice for juggling multiple branches imo.

fkiraly
fkiraly previously approved these changes Jul 3, 2024
Copy link
Copy Markdown
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I think this is ready.

I added some suggestions on improving code quality of a private method.

@Abhay-Lejith
Copy link
Copy Markdown
Contributor Author

@fkiraly I noticed that I haven't added the metadata for the table scitype. So consider this still in progress for now.

@Abhay-Lejith
Copy link
Copy Markdown
Contributor Author

@fkiraly @yarnabrina , if possible, please run the full test suite on this pr.
I suspect the commits from this pr are causing some of the errors in the test all in #6732 .

@yarnabrina yarnabrina merged commit 3eed526 into sktime:main Jul 19, 2024
yarnabrina added a commit to Abhay-Lejith/sktime that referenced this pull request Jul 20, 2024
* upstream/main:
  [ENH] Allow object dtype in series (sktime#5886)
  [ENH] `check_pdmultiindex_panel` to return names of invalid `object` columns if there are any (sktime#6797)
  [ENH] added feature_kind metadata in datatype checks (sktime#6490)
  [ENH] Adding tag for categorical support in `X` (sktime#6704)
  [BUG] Fix bug when predicting segments from clasp change point annotator (sktime#6756)
  [ENH] Adding support for GluonTS' PandasDataset object (sktime#6668)
Abhay-Lejith added a commit to Abhay-Lejith/sktime that referenced this pull request Jul 31, 2024
commit 9f95d89
Merge: 02bd3fb 6caeb0d
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Tue Jul 30 22:01:03 2024 +0530

    Merge branch 'main' into step3-4_categorical_support

commit 02bd3fb
Merge: 53ee977 a21d97f
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Mon Jul 29 21:47:41 2024 +0530

    Merge branch 'main' into step3-4_categorical_support

commit 53ee977
Merge: 74fd3f8 9b066f2
Author: Franz Király <f.kiraly@ucl.ac.uk>
Date:   Sun Jul 28 15:19:13 2024 +0100

    Merge branch 'main' into pr/6732

commit 74fd3f8
Merge: e9f5e2b 116faa2
Author: Franz Király <f.kiraly@ucl.ac.uk>
Date:   Sun Jul 28 15:18:26 2024 +0100

    Merge branch 'main' into pr/6732

commit e9f5e2b
Author: Abhay-Lejith <Abhay-Lejith@users.noreply.github.com>
Date:   Sun Jul 28 10:32:15 2024 +0000

    [AUTOMATED] update CONTRIBUTORS.md

commit f3e212b
Merge: 95c213b 81d2d08
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sun Jul 28 16:01:12 2024 +0530

    Merge branch 'main' into step3-4_categorical_support

commit 95c213b
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sun Jul 28 15:59:02 2024 +0530

    added est.update to test_categorical

commit 8d14bde
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sat Jul 27 09:35:51 2024 +0530

    typo, docs

commit 81b4862
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Thu Jul 25 22:27:00 2024 +0530

    separate test for categorical in y

commit 23b4279
Merge: a0d950d 3a55ca5
Author: Anirban Ray <39331844+yarnabrina@users.noreply.github.com>
Date:   Sat Jul 20 09:29:27 2024 +0530

    Merge remote-tracking branch 'upstream/main' into pr/Abhay-Lejith/6732

    * upstream/main:
      [ENH] Allow object dtype in series (sktime#5886)
      [ENH] `check_pdmultiindex_panel` to return names of invalid `object` columns if there are any (sktime#6797)
      [ENH] added feature_kind metadata in datatype checks (sktime#6490)
      [ENH] Adding tag for categorical support in `X` (sktime#6704)
      [BUG] Fix bug when predicting segments from clasp change point annotator (sktime#6756)
      [ENH] Adding support for GluonTS' PandasDataset object (sktime#6668)

commit a0d950d
Merge: 2d77fd6 1fa8cfc
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Wed Jul 17 22:03:27 2024 +0530

    Merge branch 'main' into step3-4_categorical_support

commit 2d77fd6
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sat Jul 13 18:25:44 2024 +0530

    test error fix

commit 194ec28
Merge: 143a176 23bbb9b
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sat Jul 13 17:16:00 2024 +0530

    Merge updated branch 'feature_kind' with timedelta dtype added

commit 143a176
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sat Jul 13 17:11:59 2024 +0530

    added tests for categorical

commit 23bbb9b
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Fri Jul 12 12:31:52 2024 +0530

    added timedelta to DATETIME category

commit f926c58
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Fri Jul 12 10:40:29 2024 +0530

    added ignores-exog-X check to test for invalid X input

commit 460caba
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Fri Jul 12 09:32:09 2024 +0530

    added checks in transformers

commit 08aefa2
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Thu Jul 11 13:16:09 2024 +0530

    added categorical checks in other base classes

commit 0812220
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Tue Jul 9 13:29:22 2024 +0530

    linting blank line fix

commit 5f3be53
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Tue Jul 9 13:16:07 2024 +0530

    added feature_kind to req metadata in early classifiers

commit 1d7e443
Merge: 7377837 b1f6167
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Tue Jul 9 13:11:34 2024 +0530

    Merge branch updated 'feature_kind' with xarray and dask metadata support

commit b1f6167
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Tue Jul 9 13:00:36 2024 +0530

    added metadata for xarray and dask with tests

commit 0905bbb
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Mon Jul 8 13:23:36 2024 +0530

    dataloader test error workaround

commit 7377837
Merge: d36e05a 8652ba9
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Mon Jul 8 12:33:23 2024 +0530

    Merge branch updated 'feature_kind' with new bool mapping

commit 8652ba9
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Mon Jul 8 12:32:22 2024 +0530

    changed mapping of boolean dtypekind to float

commit d36e05a
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sun Jul 7 16:56:21 2024 +0530

    moved handle_categorical call to after check_is_error

commit 403ee6e
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sun Jul 7 16:52:39 2024 +0530

    indexing fix

commit a006004
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sun Jul 7 16:52:39 2024 +0530

    indexing fix

commit 3696f7b
Author: Abhay-Lejith <Abhay-Lejith@users.noreply.github.com>
Date:   Sun Jul 7 10:17:26 2024 +0000

    [AUTOMATED] update CONTRIBUTORS.md

commit 05c452c
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sun Jul 7 15:46:37 2024 +0530

    updated tag name

commit 4b557c6
Merge: 8cd48ed 978263e
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sun Jul 7 15:43:14 2024 +0530

    Merge updated branch 'categorical_tag' with class for new tag

commit 978263e
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sun Jul 7 15:41:19 2024 +0530

    added class for tag

commit 8cd48ed
Merge: 90c9f31 c1af183
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sun Jul 7 13:19:34 2024 +0530

    Merge updated branch 'feature_kind' with table support

commit c1af183
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sun Jul 7 13:07:56 2024 +0530

    extended to table scitype along with tests

commit 90c9f31
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sun Jul 7 11:19:35 2024 +0530

    calling _handle_categorical in forecasting, classification, regression

commit 655f6e0
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sun Jul 7 11:19:08 2024 +0530

    added categorical handling logic

commit db94559
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Thu Jul 4 10:29:48 2024 +0530

    added docstring and modified code acc to suggestion

commit 1321e1e
Merge: de6c92b 0eb9231
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Tue Jul 2 13:11:12 2024 +0530

    Merge pr sktime#5886 into step3-4_categorical_support

commit de6c92b
Merge: e376768 3f62d98
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Tue Jul 2 13:09:50 2024 +0530

    Merge branch 'feature_kind' into step3-4_categorical_support

commit e376768
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Tue Jul 2 11:06:07 2024 +0530

    added categorical tag

commit 3f62d98
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Mon Jul 1 11:08:47 2024 +0530

    added separate tests for testing with categorical data

commit e806cc1
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Mon Jun 24 11:09:08 2024 +0530

    renamed fields and rmeoved use of pandas func for numpy based mtypes

commit 7a58087
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Wed Jun 12 18:33:18 2024 +0530

    added simple_feature_kind and reverted changes to examples

commit ee0bc1c
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Mon Jun 3 12:02:30 2024 +0530

    changed dtype to float in panel examples which I missed earlier

commit 2f84b24
Merge: 1a64d2f e22d8dc
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Mon Jun 3 10:11:30 2024 +0530

    Merge branch 'main' into feature_kind

commit 1a64d2f
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Mon Jun 3 10:07:19 2024 +0530

    changed dtype to float in panel test examples

commit dd81851
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sat Jun 1 16:20:17 2024 +0530

    added dtypekind metadata for panel and hierarchical

commit 83b6fe8
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Fri May 31 10:25:28 2024 +0530

    using pandas api functions to detect dtype

commit c0962f9
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Thu May 30 12:17:51 2024 +0530

    added function for getting dtypekind of series

commit 1e2110a
Merge: 898b839 6f42855
Author: Franz Király <f.kiraly@ucl.ac.uk>
Date:   Wed May 29 13:54:43 2024 +0100

    Merge branch 'main' into pr/6490

commit 898b839
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Wed May 29 11:04:28 2024 +0530

    using dtypekind enum values

commit ee8e0c4
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Tue May 28 14:50:43 2024 +0530

    added feature_kind metadata in series check

commit 0eb9231
Author: Anirban Ray <39331844+yarnabrina@users.noreply.github.com>
Date:   Fri Feb 16 23:37:36 2024 +0530

    @fkiraly suggestion: add rule to differentiate nested_univ and Panel

commit 0e28842
Author: Anirban Ray <39331844+yarnabrina@users.noreply.github.com>
Date:   Fri Feb 16 10:46:16 2024 +0530

    support "object" dtype in Panel

    (cherry picked from commit 1477012)

commit 25eb6f5
Author: Anirban Ray <39331844+yarnabrina@users.noreply.github.com>
Date:   Fri Feb 16 10:42:17 2024 +0530

    @fkiraly edit: add rule to differentiate nested_univ and pd.DataFrame

    Ref. sktime#5886 (review)

    (cherry picked from commit 2cecd6b)

commit a729843
Author: Anirban Ray <39331844+yarnabrina@users.noreply.github.com>
Date:   Sat Feb 3 18:29:15 2024 +0530

    added test

    (cherry picked from commit 2de7d8c)

commit c624e42
Author: Anirban Ray <39331844+yarnabrina@users.noreply.github.com>
Date:   Sat Feb 3 18:28:47 2024 +0530

    support "object" dtype in Series

    (cherry picked from commit cc3d056)
fkiraly pushed a commit that referenced this pull request Aug 6, 2024
#### Reference Issues/PRs

Reference umbrella issue comment
:#6109 (comment)
This is step 4 mentioned in that comment.

This PR **depends** on:
#6704 : Adding new categorical tag
#6490 : Feature_kind metadata

i.e this PR requires and contains commits of the above two PRs.

#### What does this implement/fix? Explain your changes.

Implementation of logic for handling categorical data and raising error
appropriately.
Raising error in yes/no case is only in forecasters. 
yes/no case - data has categorical , forecaster does not support.
In other modules(except classification), error is raised directly if
categorical is present.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:datatypes datatypes module: data containers, checkers & converters

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[ENH] categorical feature support: input checking - column type encoding by the __dataframe__ protocol

3 participants