[ENH] Allow object dtype in series#5886
[ENH] Allow object dtype in series#5886yarnabrina merged 15 commits intosktime:mainfrom yarnabrina:allow-object-dtype-in-series
Conversation
|
@fkiraly I see a new failure already, but I don't know what part of my change is affecting this. Can you guide? Also, I have a question. What is the difference between |
|
scitype = abstract data type, e.g., time series, or collection of time series |
|
The failures are due to a secondary problem, namely that objects of mtype That means the two mtypes are no longer distinguishable by checks, which leads to further trouble as the data checking layer can no longer decide what mtype it shoud assume To fix this, we need to ensure tha An idea would be to check that, in |
|
should I try to do that? |
|
I'm unlikely to any time this week, so please do if you have time. I'll
attempt over weekend otherwise.
…On Mon, Feb 5, 2024, 21:36 Franz Király ***@***.***> wrote:
should I try to do that?
—
Reply to this email directly, view it on GitHub
<#5886 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJMCQBD5YMLVE4G26S2D4RDYSD7RVAVCNFSM6AAAAABCYAU4BCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRXGMZTQNRTGE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
|
@yarnabrina, your remote is rejecting push. You need to check the box "allow maintainers to make changes". To fix, add following lines starting line 66 of # check to delineate from nested_univ mtype (Panel)
# pd.DataFrame mtype allows object dtype,
# but if we allow object dytpe with pd.Series entries,
# the mtype becomes ambiguous, i.e., non-delineable from nested_univ
if isinstance(obj.iloc[0, 0], (pd.Series, pd.DataFrame)):
msg = f"{var_name} cannot contain nested pd.Series or pd.DataFrame"
return ret(False, msg, None, return_metadata) |
Done, can you please check now? |
|
ok, works |
|
@fkiraly I'll try to get back to this PR on the weekend. Can you give some suggestions about the new errors? I definitely did not expect IndexError to come after allowing a new dtype. |
tried to fix - failure is incomplete edge case treatment of empty data frames (zero rows or cols) |
|
@fkiraly making it as ready, all tests pass after that additional check you added. |
There was a problem hiding this comment.
So, works now and I'd be happy with it
The change I made was the additional check in now lines 66-72 of _series._check, which allows the system to delineate nested_univ from pd.DataFrame mtype, at least in all test cases. Empty pd.DataFrame count as the latter.
fkiraly
left a comment
There was a problem hiding this comment.
actually, I noticed multiple things that imo we need to address before this:
PanelandHierarchicaltypes still forbidobject- we are not systematically testing this input type - from now, users will not receive clear warning.
- possibly, many estimators will break if given
objectdtype-d data - we need to think about what to do. Options: some kind of coercion; having estimator tags which tell the user whichdtypesare supported, and an informative message is raised based on that.- none seems fully satisfactory, as this is very
pandasspecific.
- none seems fully satisfactory, as this is very
I actually skipped this one intentionally. I thought it may be better to allow separately for different types. Unless you think they are interlinked and can cause issues, I really have very little, if any, familiarity with how scitype/mtype/etc. works in sktime?
This one is sort of a dilemma - currently a lot can use object type and we don't let them use it, while allowing may break some which do support. I don't know how to choose. Also, there's a "category" type which is being more and more popular.
I didn't understnad what you meant by this one. What is happening now, and what will change to confuse users? |
Yes, they are linked through broadcasting. One potential failure case that is not covered due to lack of examples for the contract expansion is as follows: suppose we have a transformer that internally is implemented only for Similarly, there is now an inconsistency the other way round - even if an estimator would now accept an individual series, it would no longer accept it if it is found with others of similar type in a collection of series. PS: I think for |
I do think ew should support it, but "properly".
|
Consider the situation of an estimator that does not support the type genuinely, i.e., would break internally of Suppose a user passes a However, if we widen the contract without testing and adding catches where they fail, then in this estimator something random will break, and the error will be uninformative, e.g., a From a higher poin of view, by widening the contract, we may now be creating uncovered bugs. So either we have to move the contract boundary "further in" by one of the measures proposed above (e.g., tag and catch), or we have to ensure all estimators satisfy the new contract. In any case, it has to be tested. |
|
well, that's strange. One of the notebooks fails, while the test suite is fine. We need to identify what object is being passed to the checkers, and why it is ambiguous. Perhaps it has mixed columns? |
|
oh wait, the problem is actually We just need to add the same trick in For |
|
to avoid conflicts: do you want ot add the mini-check in |
|
I just faced weird set of conflicts. Fixed locally, so let me make one last attempt. |
yarnabrina
left a comment
There was a problem hiding this comment.
I did a rebase from main and force push, so I think you will get conflicts. So, if possible can you delete your local branch and pull this?
Also, I do not know where are the other changes you said are necessary to do, so can you please apply that? I'll not modify this branch anymore to avoid future conflicts, deleting from my local.
(cherry picked from commit cc3d056)
(cherry picked from commit 2de7d8c)
Ref. #5886 (review) (cherry picked from commit 2cecd6b)
(cherry picked from commit 1477012)
|
I updated this branch with |
Ref. #5886 (review) (cherry picked from commit 2cecd6b) (cherry picked from commit d1368cf)
(cherry picked from commit e6ebe3c)
* upstream/main: [ENH] Allow object dtype in series (sktime#5886) [ENH] `check_pdmultiindex_panel` to return names of invalid `object` columns if there are any (sktime#6797) [ENH] added feature_kind metadata in datatype checks (sktime#6490) [ENH] Adding tag for categorical support in `X` (sktime#6704) [BUG] Fix bug when predicting segments from clasp change point annotator (sktime#6756) [ENH] Adding support for GluonTS' PandasDataset object (sktime#6668)
commit 9f95d89 Merge: 02bd3fb 6caeb0d Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Tue Jul 30 22:01:03 2024 +0530 Merge branch 'main' into step3-4_categorical_support commit 02bd3fb Merge: 53ee977 a21d97f Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Mon Jul 29 21:47:41 2024 +0530 Merge branch 'main' into step3-4_categorical_support commit 53ee977 Merge: 74fd3f8 9b066f2 Author: Franz Király <f.kiraly@ucl.ac.uk> Date: Sun Jul 28 15:19:13 2024 +0100 Merge branch 'main' into pr/6732 commit 74fd3f8 Merge: e9f5e2b 116faa2 Author: Franz Király <f.kiraly@ucl.ac.uk> Date: Sun Jul 28 15:18:26 2024 +0100 Merge branch 'main' into pr/6732 commit e9f5e2b Author: Abhay-Lejith <Abhay-Lejith@users.noreply.github.com> Date: Sun Jul 28 10:32:15 2024 +0000 [AUTOMATED] update CONTRIBUTORS.md commit f3e212b Merge: 95c213b 81d2d08 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Sun Jul 28 16:01:12 2024 +0530 Merge branch 'main' into step3-4_categorical_support commit 95c213b Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Sun Jul 28 15:59:02 2024 +0530 added est.update to test_categorical commit 8d14bde Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Sat Jul 27 09:35:51 2024 +0530 typo, docs commit 81b4862 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Thu Jul 25 22:27:00 2024 +0530 separate test for categorical in y commit 23b4279 Merge: a0d950d 3a55ca5 Author: Anirban Ray <39331844+yarnabrina@users.noreply.github.com> Date: Sat Jul 20 09:29:27 2024 +0530 Merge remote-tracking branch 'upstream/main' into pr/Abhay-Lejith/6732 * upstream/main: [ENH] Allow object dtype in series (sktime#5886) [ENH] `check_pdmultiindex_panel` to return names of invalid `object` columns if there are any (sktime#6797) [ENH] added feature_kind metadata in datatype checks (sktime#6490) [ENH] Adding tag for categorical support in `X` (sktime#6704) [BUG] Fix bug when predicting segments from clasp change point annotator (sktime#6756) [ENH] Adding support for GluonTS' PandasDataset object (sktime#6668) commit a0d950d Merge: 2d77fd6 1fa8cfc Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Wed Jul 17 22:03:27 2024 +0530 Merge branch 'main' into step3-4_categorical_support commit 2d77fd6 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Sat Jul 13 18:25:44 2024 +0530 test error fix commit 194ec28 Merge: 143a176 23bbb9b Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Sat Jul 13 17:16:00 2024 +0530 Merge updated branch 'feature_kind' with timedelta dtype added commit 143a176 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Sat Jul 13 17:11:59 2024 +0530 added tests for categorical commit 23bbb9b Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Fri Jul 12 12:31:52 2024 +0530 added timedelta to DATETIME category commit f926c58 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Fri Jul 12 10:40:29 2024 +0530 added ignores-exog-X check to test for invalid X input commit 460caba Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Fri Jul 12 09:32:09 2024 +0530 added checks in transformers commit 08aefa2 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Thu Jul 11 13:16:09 2024 +0530 added categorical checks in other base classes commit 0812220 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Tue Jul 9 13:29:22 2024 +0530 linting blank line fix commit 5f3be53 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Tue Jul 9 13:16:07 2024 +0530 added feature_kind to req metadata in early classifiers commit 1d7e443 Merge: 7377837 b1f6167 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Tue Jul 9 13:11:34 2024 +0530 Merge branch updated 'feature_kind' with xarray and dask metadata support commit b1f6167 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Tue Jul 9 13:00:36 2024 +0530 added metadata for xarray and dask with tests commit 0905bbb Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Mon Jul 8 13:23:36 2024 +0530 dataloader test error workaround commit 7377837 Merge: d36e05a 8652ba9 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Mon Jul 8 12:33:23 2024 +0530 Merge branch updated 'feature_kind' with new bool mapping commit 8652ba9 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Mon Jul 8 12:32:22 2024 +0530 changed mapping of boolean dtypekind to float commit d36e05a Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Sun Jul 7 16:56:21 2024 +0530 moved handle_categorical call to after check_is_error commit 403ee6e Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Sun Jul 7 16:52:39 2024 +0530 indexing fix commit a006004 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Sun Jul 7 16:52:39 2024 +0530 indexing fix commit 3696f7b Author: Abhay-Lejith <Abhay-Lejith@users.noreply.github.com> Date: Sun Jul 7 10:17:26 2024 +0000 [AUTOMATED] update CONTRIBUTORS.md commit 05c452c Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Sun Jul 7 15:46:37 2024 +0530 updated tag name commit 4b557c6 Merge: 8cd48ed 978263e Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Sun Jul 7 15:43:14 2024 +0530 Merge updated branch 'categorical_tag' with class for new tag commit 978263e Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Sun Jul 7 15:41:19 2024 +0530 added class for tag commit 8cd48ed Merge: 90c9f31 c1af183 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Sun Jul 7 13:19:34 2024 +0530 Merge updated branch 'feature_kind' with table support commit c1af183 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Sun Jul 7 13:07:56 2024 +0530 extended to table scitype along with tests commit 90c9f31 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Sun Jul 7 11:19:35 2024 +0530 calling _handle_categorical in forecasting, classification, regression commit 655f6e0 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Sun Jul 7 11:19:08 2024 +0530 added categorical handling logic commit db94559 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Thu Jul 4 10:29:48 2024 +0530 added docstring and modified code acc to suggestion commit 1321e1e Merge: de6c92b 0eb9231 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Tue Jul 2 13:11:12 2024 +0530 Merge pr sktime#5886 into step3-4_categorical_support commit de6c92b Merge: e376768 3f62d98 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Tue Jul 2 13:09:50 2024 +0530 Merge branch 'feature_kind' into step3-4_categorical_support commit e376768 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Tue Jul 2 11:06:07 2024 +0530 added categorical tag commit 3f62d98 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Mon Jul 1 11:08:47 2024 +0530 added separate tests for testing with categorical data commit e806cc1 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Mon Jun 24 11:09:08 2024 +0530 renamed fields and rmeoved use of pandas func for numpy based mtypes commit 7a58087 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Wed Jun 12 18:33:18 2024 +0530 added simple_feature_kind and reverted changes to examples commit ee0bc1c Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Mon Jun 3 12:02:30 2024 +0530 changed dtype to float in panel examples which I missed earlier commit 2f84b24 Merge: 1a64d2f e22d8dc Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Mon Jun 3 10:11:30 2024 +0530 Merge branch 'main' into feature_kind commit 1a64d2f Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Mon Jun 3 10:07:19 2024 +0530 changed dtype to float in panel test examples commit dd81851 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Sat Jun 1 16:20:17 2024 +0530 added dtypekind metadata for panel and hierarchical commit 83b6fe8 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Fri May 31 10:25:28 2024 +0530 using pandas api functions to detect dtype commit c0962f9 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Thu May 30 12:17:51 2024 +0530 added function for getting dtypekind of series commit 1e2110a Merge: 898b839 6f42855 Author: Franz Király <f.kiraly@ucl.ac.uk> Date: Wed May 29 13:54:43 2024 +0100 Merge branch 'main' into pr/6490 commit 898b839 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Wed May 29 11:04:28 2024 +0530 using dtypekind enum values commit ee8e0c4 Author: Abhay-Lejith <abhaylejith@gmail.com> Date: Tue May 28 14:50:43 2024 +0530 added feature_kind metadata in series check commit 0eb9231 Author: Anirban Ray <39331844+yarnabrina@users.noreply.github.com> Date: Fri Feb 16 23:37:36 2024 +0530 @fkiraly suggestion: add rule to differentiate nested_univ and Panel commit 0e28842 Author: Anirban Ray <39331844+yarnabrina@users.noreply.github.com> Date: Fri Feb 16 10:46:16 2024 +0530 support "object" dtype in Panel (cherry picked from commit 1477012) commit 25eb6f5 Author: Anirban Ray <39331844+yarnabrina@users.noreply.github.com> Date: Fri Feb 16 10:42:17 2024 +0530 @fkiraly edit: add rule to differentiate nested_univ and pd.DataFrame Ref. sktime#5886 (review) (cherry picked from commit 2cecd6b) commit a729843 Author: Anirban Ray <39331844+yarnabrina@users.noreply.github.com> Date: Sat Feb 3 18:29:15 2024 +0530 added test (cherry picked from commit 2de7d8c) commit c624e42 Author: Anirban Ray <39331844+yarnabrina@users.noreply.github.com> Date: Sat Feb 3 18:28:47 2024 +0530 support "object" dtype in Series (cherry picked from commit cc3d056)
Reference: #5867 (comment)