Skip to content

[ENH] Allow object dtype in series#5886

Merged
yarnabrina merged 15 commits intosktime:mainfrom
yarnabrina:allow-object-dtype-in-series
Jul 20, 2024
Merged

[ENH] Allow object dtype in series#5886
yarnabrina merged 15 commits intosktime:mainfrom
yarnabrina:allow-object-dtype-in-series

Conversation

@yarnabrina
Copy link
Copy Markdown
Member

Reference: #5867 (comment)

@yarnabrina
Copy link
Copy Markdown
Member Author

@fkiraly I see a new failure already, but I don't know what part of my change is affecting this. Can you guide?

Also, I have a question. What is the difference between mtype and scitype?

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Feb 3, 2024

scitype = abstract data type, e.g., time series, or collection of time series
mtype = full python specification

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Feb 3, 2024

The failures are due to a secondary problem, namely that objects of mtype nested_univ now also pass the check for pd.DataFrame, in addition to the nested_univ check.

That means the two mtypes are no longer distinguishable by checks, which leads to further trouble as the data checking layer can no longer decide what mtype it shoud assume

To fix this, we need to ensure tha nested_univ does not pass the pd.DataFrame check.

An idea would be to check that, in object typed columns, the first row entry is not a pd.Series.

@fkiraly fkiraly added module:datatypes datatypes module: data containers, checkers & converters enhancement Adding new functionality labels Feb 3, 2024
@yarnabrina yarnabrina marked this pull request as draft February 4, 2024 05:37
@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Feb 5, 2024

should I try to do that?

@yarnabrina
Copy link
Copy Markdown
Member Author

yarnabrina commented Feb 5, 2024 via email

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Feb 5, 2024

@yarnabrina, your remote is rejecting push. You need to check the box "allow maintainers to make changes".

To fix, add following lines starting line 66 of _check._series:

    # check to delineate from nested_univ mtype (Panel)
    # pd.DataFrame mtype allows object dtype,
    # but if we allow object dytpe with pd.Series entries,
    # the mtype becomes ambiguous, i.e., non-delineable from nested_univ
    if isinstance(obj.iloc[0, 0], (pd.Series, pd.DataFrame)):
        msg = f"{var_name} cannot contain nested pd.Series or pd.DataFrame"
        return ret(False, msg, None, return_metadata)

@yarnabrina
Copy link
Copy Markdown
Member Author

You need to check the box "allow maintainers to make changes".

Done, can you please check now?

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Feb 5, 2024

ok, works

@yarnabrina
Copy link
Copy Markdown
Member Author

@fkiraly I'll try to get back to this PR on the weekend. Can you give some suggestions about the new errors? I definitely did not expect IndexError to come after allowing a new dtype.

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Feb 8, 2024

Can you give some suggestions about the new errors? I definitely did not expect IndexError to come after allowing a new dtype.

tried to fix - failure is incomplete edge case treatment of empty data frames (zero rows or cols)

@yarnabrina yarnabrina marked this pull request as ready for review February 9, 2024 14:47
@yarnabrina
Copy link
Copy Markdown
Member Author

@fkiraly making it as ready, all tests pass after that additional check you added.

Copy link
Copy Markdown
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, works now and I'd be happy with it

The change I made was the additional check in now lines 66-72 of _series._check, which allows the system to delineate nested_univ from pd.DataFrame mtype, at least in all test cases. Empty pd.DataFrame count as the latter.

Copy link
Copy Markdown
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I noticed multiple things that imo we need to address before this:

  • Panel and Hierarchical types still forbid object
  • we are not systematically testing this input type - from now, users will not receive clear warning.
  • possibly, many estimators will break if given object dtype-d data - we need to think about what to do. Options: some kind of coercion; having estimator tags which tell the user which dtypes are supported, and an informative message is raised based on that.
    • none seems fully satisfactory, as this is very pandas specific.

@yarnabrina
Copy link
Copy Markdown
Member Author

Panel and Hierarchical types still forbid object

I actually skipped this one intentionally. I thought it may be better to allow separately for different types. Unless you think they are interlinked and can cause issues, I really have very little, if any, familiarity with how scitype/mtype/etc. works in sktime?

many estimators will break if given object dtype-d data

This one is sort of a dilemma - currently a lot can use object type and we don't let them use it, while allowing may break some which do support. I don't know how to choose. Also, there's a "category" type which is being more and more popular. pandas has recently planning to use even a dedicated "string" type, I think I saw in their latest release notes.

we are not systematically testing this input type - from now, users will not receive clear warning.

I didn't understnad what you meant by this one. What is happening now, and what will change to confuse users?

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Feb 10, 2024

Unless you think they are interlinked and can cause issues

Yes, they are linked through broadcasting.

One potential failure case that is not covered due to lack of examples for the contract expansion is as follows:

suppose we have a transformer that internally is implemented only for Panel (e.g., for efficiency reasons).
Upon receiving a Series, it will be upcast to Panel, and suddenly the object no longer satisfies the contract.

Similarly, there is now an inconsistency the other way round - even if an estimator would now accept an individual series, it would no longer accept it if it is found with others of similar type in a collection of series.

PS: I think for Panel, the trick with the iloc is not needed, as the delineation happens via the MultiIndex already.

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Feb 10, 2024

This one is sort of a dilemma - currently a lot can use object type and we don't let them use it, while allowing may break some which do support. I don't know how to choose. Also, there's a "category" type which is being more and more popular

I do think ew should support it, but "properly".
Here's what I suggest:

  • extend the types for Panel and Hierarchical as well
  • in scenarios_forecasting and scenarios_transformers add coverage for this
    • if not too much breaks, hooray - fix the few bugs etc
    • if a lot breaks, we need to consider how we make the contract tight. A way would be introducing tags for the supported types.

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Feb 10, 2024

we are not systematically testing this input type - from now, users will not receive clear warning.

I didn't understnad what you meant by this one. What is happening now, and what will change to confuse users?

Consider the situation of an estimator that does not support the type genuinely, i.e., would break internally of pd.DataFrame with object dtype is passed.

Suppose a user passes a pd.DataFrame with object dtype to that estimator. Currently, on main, this will fail with an informative error message, namely "object dtype is not supported" or similar. This is because the check catches it, and also returns an informative error message together with the check.

However, if we widen the contract without testing and adding catches where they fail, then in this estimator something random will break, and the error will be uninformative, e.g., a ValueError or type error or similar. This is of course not good.

From a higher poin of view, by widening the contract, we may now be creating uncovered bugs. So either we have to move the contract boundary "further in" by one of the measures proposed above (e.g., tag and catch), or we have to ensure all estimators satisfy the new contract. In any case, it has to be tested.

@yarnabrina yarnabrina marked this pull request as draft February 15, 2024 18:08
@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Feb 16, 2024

well, that's strange. One of the notebooks fails, while the test suite is fine.

We need to identify what object is being passed to the checkers, and why it is ambiguous. Perhaps it has mixed columns?

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Feb 16, 2024

oh wait, the problem is actually Panel. I am reverting the change I made to the Series checks.

We just need to add the same trick in pd-multiindex checker.

For Hierarchical, this will not be needed.

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Feb 16, 2024

to avoid conflicts: do you want ot add the mini-check in Panel, or should I?

@yarnabrina
Copy link
Copy Markdown
Member Author

yarnabrina commented Feb 16, 2024

Please do. I've very little experience and confidence in this datatypes modules.

I just faced weird set of conflicts. Fixed locally, so let me make one last attempt.

Copy link
Copy Markdown
Member Author

@yarnabrina yarnabrina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a rebase from main and force push, so I think you will get conflicts. So, if possible can you delete your local branch and pull this?

Also, I do not know where are the other changes you said are necessary to do, so can you please apply that? I'll not modify this branch anymore to avoid future conflicts, deleting from my local.

@yarnabrina yarnabrina marked this pull request as draft July 7, 2024 16:38
@yarnabrina
Copy link
Copy Markdown
Member Author

FYI @fkiraly @Abhay-Lejith

I updated this branch with main branch (with rebase), but cancelled the CI to avoid affecting other release PR's. Also, converting this into draft as I believe we're going to wait for the series of PR plan Abhay listed in the umbrella issue to be ready.

(cherry picked from commit cc3d056)
(cherry picked from commit 8622582)
(cherry picked from commit 2de7d8c)
(cherry picked from commit 210ca04)
Ref. #5886 (review)

(cherry picked from commit 2cecd6b)
(cherry picked from commit d1368cf)
(cherry picked from commit 1477012)
(cherry picked from commit e68a72d)
@yarnabrina yarnabrina marked this pull request as ready for review July 19, 2024 17:05
@yarnabrina yarnabrina requested a review from fkiraly July 19, 2024 17:06
@yarnabrina
Copy link
Copy Markdown
Member Author

@fkiraly I merged #6704 and #6490 as per the discussion in Discord and #6109, and updated this PR with main. Assuming tests pass (no unexpected failure in @Abhay-Lejith's fork's test-all), can you please review/approve/merge this PR to unblock the next PR in sequence, viz. #6732?

Copy link
Copy Markdown
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure!

@yarnabrina yarnabrina merged commit 3a55ca5 into sktime:main Jul 20, 2024
@yarnabrina yarnabrina deleted the allow-object-dtype-in-series branch July 20, 2024 03:49
yarnabrina added a commit to Abhay-Lejith/sktime that referenced this pull request Jul 20, 2024
* upstream/main:
  [ENH] Allow object dtype in series (sktime#5886)
  [ENH] `check_pdmultiindex_panel` to return names of invalid `object` columns if there are any (sktime#6797)
  [ENH] added feature_kind metadata in datatype checks (sktime#6490)
  [ENH] Adding tag for categorical support in `X` (sktime#6704)
  [BUG] Fix bug when predicting segments from clasp change point annotator (sktime#6756)
  [ENH] Adding support for GluonTS' PandasDataset object (sktime#6668)
Abhay-Lejith added a commit to Abhay-Lejith/sktime that referenced this pull request Jul 31, 2024
commit 9f95d89
Merge: 02bd3fb 6caeb0d
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Tue Jul 30 22:01:03 2024 +0530

    Merge branch 'main' into step3-4_categorical_support

commit 02bd3fb
Merge: 53ee977 a21d97f
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Mon Jul 29 21:47:41 2024 +0530

    Merge branch 'main' into step3-4_categorical_support

commit 53ee977
Merge: 74fd3f8 9b066f2
Author: Franz Király <f.kiraly@ucl.ac.uk>
Date:   Sun Jul 28 15:19:13 2024 +0100

    Merge branch 'main' into pr/6732

commit 74fd3f8
Merge: e9f5e2b 116faa2
Author: Franz Király <f.kiraly@ucl.ac.uk>
Date:   Sun Jul 28 15:18:26 2024 +0100

    Merge branch 'main' into pr/6732

commit e9f5e2b
Author: Abhay-Lejith <Abhay-Lejith@users.noreply.github.com>
Date:   Sun Jul 28 10:32:15 2024 +0000

    [AUTOMATED] update CONTRIBUTORS.md

commit f3e212b
Merge: 95c213b 81d2d08
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sun Jul 28 16:01:12 2024 +0530

    Merge branch 'main' into step3-4_categorical_support

commit 95c213b
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sun Jul 28 15:59:02 2024 +0530

    added est.update to test_categorical

commit 8d14bde
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sat Jul 27 09:35:51 2024 +0530

    typo, docs

commit 81b4862
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Thu Jul 25 22:27:00 2024 +0530

    separate test for categorical in y

commit 23b4279
Merge: a0d950d 3a55ca5
Author: Anirban Ray <39331844+yarnabrina@users.noreply.github.com>
Date:   Sat Jul 20 09:29:27 2024 +0530

    Merge remote-tracking branch 'upstream/main' into pr/Abhay-Lejith/6732

    * upstream/main:
      [ENH] Allow object dtype in series (sktime#5886)
      [ENH] `check_pdmultiindex_panel` to return names of invalid `object` columns if there are any (sktime#6797)
      [ENH] added feature_kind metadata in datatype checks (sktime#6490)
      [ENH] Adding tag for categorical support in `X` (sktime#6704)
      [BUG] Fix bug when predicting segments from clasp change point annotator (sktime#6756)
      [ENH] Adding support for GluonTS' PandasDataset object (sktime#6668)

commit a0d950d
Merge: 2d77fd6 1fa8cfc
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Wed Jul 17 22:03:27 2024 +0530

    Merge branch 'main' into step3-4_categorical_support

commit 2d77fd6
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sat Jul 13 18:25:44 2024 +0530

    test error fix

commit 194ec28
Merge: 143a176 23bbb9b
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sat Jul 13 17:16:00 2024 +0530

    Merge updated branch 'feature_kind' with timedelta dtype added

commit 143a176
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sat Jul 13 17:11:59 2024 +0530

    added tests for categorical

commit 23bbb9b
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Fri Jul 12 12:31:52 2024 +0530

    added timedelta to DATETIME category

commit f926c58
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Fri Jul 12 10:40:29 2024 +0530

    added ignores-exog-X check to test for invalid X input

commit 460caba
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Fri Jul 12 09:32:09 2024 +0530

    added checks in transformers

commit 08aefa2
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Thu Jul 11 13:16:09 2024 +0530

    added categorical checks in other base classes

commit 0812220
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Tue Jul 9 13:29:22 2024 +0530

    linting blank line fix

commit 5f3be53
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Tue Jul 9 13:16:07 2024 +0530

    added feature_kind to req metadata in early classifiers

commit 1d7e443
Merge: 7377837 b1f6167
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Tue Jul 9 13:11:34 2024 +0530

    Merge branch updated 'feature_kind' with xarray and dask metadata support

commit b1f6167
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Tue Jul 9 13:00:36 2024 +0530

    added metadata for xarray and dask with tests

commit 0905bbb
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Mon Jul 8 13:23:36 2024 +0530

    dataloader test error workaround

commit 7377837
Merge: d36e05a 8652ba9
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Mon Jul 8 12:33:23 2024 +0530

    Merge branch updated 'feature_kind' with new bool mapping

commit 8652ba9
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Mon Jul 8 12:32:22 2024 +0530

    changed mapping of boolean dtypekind to float

commit d36e05a
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sun Jul 7 16:56:21 2024 +0530

    moved handle_categorical call to after check_is_error

commit 403ee6e
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sun Jul 7 16:52:39 2024 +0530

    indexing fix

commit a006004
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sun Jul 7 16:52:39 2024 +0530

    indexing fix

commit 3696f7b
Author: Abhay-Lejith <Abhay-Lejith@users.noreply.github.com>
Date:   Sun Jul 7 10:17:26 2024 +0000

    [AUTOMATED] update CONTRIBUTORS.md

commit 05c452c
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sun Jul 7 15:46:37 2024 +0530

    updated tag name

commit 4b557c6
Merge: 8cd48ed 978263e
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sun Jul 7 15:43:14 2024 +0530

    Merge updated branch 'categorical_tag' with class for new tag

commit 978263e
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sun Jul 7 15:41:19 2024 +0530

    added class for tag

commit 8cd48ed
Merge: 90c9f31 c1af183
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sun Jul 7 13:19:34 2024 +0530

    Merge updated branch 'feature_kind' with table support

commit c1af183
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sun Jul 7 13:07:56 2024 +0530

    extended to table scitype along with tests

commit 90c9f31
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sun Jul 7 11:19:35 2024 +0530

    calling _handle_categorical in forecasting, classification, regression

commit 655f6e0
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sun Jul 7 11:19:08 2024 +0530

    added categorical handling logic

commit db94559
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Thu Jul 4 10:29:48 2024 +0530

    added docstring and modified code acc to suggestion

commit 1321e1e
Merge: de6c92b 0eb9231
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Tue Jul 2 13:11:12 2024 +0530

    Merge pr sktime#5886 into step3-4_categorical_support

commit de6c92b
Merge: e376768 3f62d98
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Tue Jul 2 13:09:50 2024 +0530

    Merge branch 'feature_kind' into step3-4_categorical_support

commit e376768
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Tue Jul 2 11:06:07 2024 +0530

    added categorical tag

commit 3f62d98
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Mon Jul 1 11:08:47 2024 +0530

    added separate tests for testing with categorical data

commit e806cc1
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Mon Jun 24 11:09:08 2024 +0530

    renamed fields and rmeoved use of pandas func for numpy based mtypes

commit 7a58087
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Wed Jun 12 18:33:18 2024 +0530

    added simple_feature_kind and reverted changes to examples

commit ee0bc1c
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Mon Jun 3 12:02:30 2024 +0530

    changed dtype to float in panel examples which I missed earlier

commit 2f84b24
Merge: 1a64d2f e22d8dc
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Mon Jun 3 10:11:30 2024 +0530

    Merge branch 'main' into feature_kind

commit 1a64d2f
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Mon Jun 3 10:07:19 2024 +0530

    changed dtype to float in panel test examples

commit dd81851
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Sat Jun 1 16:20:17 2024 +0530

    added dtypekind metadata for panel and hierarchical

commit 83b6fe8
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Fri May 31 10:25:28 2024 +0530

    using pandas api functions to detect dtype

commit c0962f9
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Thu May 30 12:17:51 2024 +0530

    added function for getting dtypekind of series

commit 1e2110a
Merge: 898b839 6f42855
Author: Franz Király <f.kiraly@ucl.ac.uk>
Date:   Wed May 29 13:54:43 2024 +0100

    Merge branch 'main' into pr/6490

commit 898b839
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Wed May 29 11:04:28 2024 +0530

    using dtypekind enum values

commit ee8e0c4
Author: Abhay-Lejith <abhaylejith@gmail.com>
Date:   Tue May 28 14:50:43 2024 +0530

    added feature_kind metadata in series check

commit 0eb9231
Author: Anirban Ray <39331844+yarnabrina@users.noreply.github.com>
Date:   Fri Feb 16 23:37:36 2024 +0530

    @fkiraly suggestion: add rule to differentiate nested_univ and Panel

commit 0e28842
Author: Anirban Ray <39331844+yarnabrina@users.noreply.github.com>
Date:   Fri Feb 16 10:46:16 2024 +0530

    support "object" dtype in Panel

    (cherry picked from commit 1477012)

commit 25eb6f5
Author: Anirban Ray <39331844+yarnabrina@users.noreply.github.com>
Date:   Fri Feb 16 10:42:17 2024 +0530

    @fkiraly edit: add rule to differentiate nested_univ and pd.DataFrame

    Ref. sktime#5886 (review)

    (cherry picked from commit 2cecd6b)

commit a729843
Author: Anirban Ray <39331844+yarnabrina@users.noreply.github.com>
Date:   Sat Feb 3 18:29:15 2024 +0530

    added test

    (cherry picked from commit 2de7d8c)

commit c624e42
Author: Anirban Ray <39331844+yarnabrina@users.noreply.github.com>
Date:   Sat Feb 3 18:28:47 2024 +0530

    support "object" dtype in Series

    (cherry picked from commit cc3d056)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Adding new functionality module:datatypes datatypes module: data containers, checkers & converters

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants