[ENH] scitype and mtype typing via enum by chrisholder · Pull Request #1423 · sktime/sktime

chrisholder · 2021-09-16T14:27:02Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This is a simple pr with the goal of making the datatypes modules easier to use. Currently the datatypes module requires str inputs for different datatypes. For example if I want to convert a given object to a numpy I would do something like:

convert_to(ts_obj, 'numpy3D')

This requires an extact string match which normally will require the developer to look up the possible strings in a notebook or at the documentation. This extra action of always having to look up the conversion can be solved very simply by allowing the functions in datatypes to accept enums like so:

convert_to(ts_obj, PanelMtype.NUMPY3D)

This has the benefit of the developer not needing to know the exact string and in addition allows for an ide to autocomplete and suggest other potential datatypes.

In addition with these enums I hope to implement a new testing utils that allows us to test over every potential type of ts input sktime supports and enums will make this much easier to achieve.

Does your contribution introduce a new dependency? If yes, which one?

No

What should a reviewer concentrate their feedback on?

While this isn't in stone and is currently only planned for use in this module, I've added a base registry enum class so other modules can potentially use enums as their registry I don't know if this is a good idea or where I did put this base class is an appropirate place for this to live.

Any other comments?

PR checklist

For all contributions

I've added myself to the list of contributors.
Optionally, I've updated sktime's CODEOWNERS to receive notifications about future changes to these files.
I've added unit tests and made sure they pass locally.

For new estimators

I've added the estimator to the online documentation.
I've updated the existing example notebooks or provided a new one to showcase how my estimator works.

fkiraly · 2021-09-17T12:48:49Z

Really great idea in general!

As discussed earlier, my main thoughts are around how we avoid multiplication of "sources of truth" a for the conversion strings/enums and increasing the extension burden, i.e., through necessity to update the registry in multiple places in parallel when we introduce a new mtype.

Also, it seems like this PR breaks some things?
Do you mind looking into it?

chrisholder · 2021-09-17T13:03:56Z

Fixing them now it is to do with me not checking for None - Added some extra tests to make sure this can't happen in the future as well since the modules tests didn't flag this to me

chrisholder · 2021-09-17T13:05:35Z

As discussed earlier, my main thoughts are around how we avoid multiplication of "sources of truth" a for the conversion strings/enums and increasing the extension burden, i.e., through necessity to update the registry in multiple places in parallel when we introduce a new mtype.

I made it so the registry reads from the enum so there is only one place you need to add a new mtype (or even scitype if there is ever a need for that)

sktime/datatypes/_convert.py

fkiraly

Interesting idea!
Also, I appreciate that this is downwards compatible, great!

Not directly a change request, since I'm not sure how to address this, but two problems that I think need to be addressed or at least discussed before we merge:

there is a difference in capitalization between the enum keys and the enum values. This can easily become confusing.
Naive use of the enums will result in something like convert_to(x, SeriesMtype.PD_DATAFRAME), this is quite clunky and seems confusing. Any way to improve this? I was thinking about convert_to(x, Series.df), but shouldn't Series be the type appearing in fit/predict etc which would give a clash.

fkiraly · 2021-09-19T20:53:15Z

sktime/datatypes/_registry.py

+SCITYPE_REGISTER = [tuple(scitype) for scitype in Scitype]
+
+
+def mtype_to_scitype(mtype: Union[PanelMtype, SeriesMtype, str]) -> str:


should this not just be Mtype?

Unfortunatly I can't import the types into this due to circular importing.

Argh - is there any way to resolve this, e.g., by splitting files?

That is, is the circularity coming through object imports, or just via module imports?

Module imports since the types are based off the enums defined in that file. I can move it but it was where you orginally defined mtype_to_scitpye where would you want me to put it instead?

really good question - perhaps the enums should move somewhere, e.g., types, which contains the enums and any logic that allows shorthand use as a typing type?

chrisholder · 2021-09-20T19:18:02Z

Interesting idea!
Also, I appreciate that this is downwards compatible, great!

Not directly a change request, since I'm not sure how to address this, but two problems that I think need to be addressed or at least discussed before we merge:

there is a difference in capitalization between the enum keys and the enum values. This can easily become confusing.

Naive use of the enums will result in something like convert_to(x, SeriesMtype.PD_DATAFRAME), this is quite clunky and seems confusing. Any way to improve this? I was thinking about convert_to(x, Series.df), but shouldn't Series be the type appearing in fit/predict etc which would give a clash.

So I've made some changes based on your feedback to the enums names. Not sure if they are perfect so now you would do

convert_to(x, SeriesMtype.pd_dataframe);

I can rename the SeriesMtype to just 'Series' but am concerned this could further confuse the scitype stuff. Another thought I had could be access it through a chain of enums i.e.

convert_to(x, SciType.Series.pd_dataframe)
convert_to(x, SciType.Panel.nested_univariate)

EDIT:
Reading what you said before I will note this doesn't match the strings? Did you mean you just wanted it to match the strings. Only reason they were upper case is simply because people tend to put enums in all caps as they are constants.

fkiraly · 2021-09-26T08:47:25Z

@chrisholder, I think this is excellent still, in terms of ideas.

However, I think we really need to be clear as to the intended target interface, since the devil is in the details when it comes to useability.

Would you kindly be able to write a short summary or STEP that explains the "end state" ways to address the types and conversions? Including "old" and "new" way, to compare? And, intended usage?

I really think it's not much, perhaps 10 or 20 lines or so - but it's important!

We need to ensure:

consistency of naming to avoid user confusion
immediacy/simplicity of the interface
we're close to "one obvious way" as per python zen (or at least, one quick way and one very configurable way)

Sorry for asking for this, but I feel we need to spend more time on design here before we go to implementation.
I'm not saying the features you implemented are not great, but there's the obvious conditionality where if we make even minor changes to the design (on which we have not aligned yet) it may create major work in the implementation. And, generally, it's a good idea to jot down the design before spending much time on implementing and refining it - that avoids frustration of the kind where you have already done all the work but we don't agree on function signatures etc.

fkiraly · 2021-09-26T08:51:13Z

Also, @ABostrom - could you look at this from the perspective of consistency with your typing extension, please?

fkiraly · 2021-10-03T09:44:26Z

?

chrisholder · 2021-10-07T02:11:55Z

@fkiraly sorry this has taken me a while to get back to you but based on one of your suggestions I've been trying to do something clever with type hinting that didn't work out - Ill talk to you another time about it.

However, what I have done is I've gone away, taken your suggestion, and made it so the enums are now nested. This means to access them you would do:

Mtype.Panel.np_flat
Mtype.Series.np_array

Which I agree is a much better solution.

In addition to all this because I changed some of the string names of the mtypes (and I knew you wouldn't be happy with some of them) I did a big refactor in the check and example files for panel and series so that if you're unhappy with a name or want to rename something then we only need to change it in one place rather than in every file. So if you're unhappy with any of the naming I've done then it quite literally a one line change to make it what you want it to be. These are the renaming (in places) of types I've done:

Series
Mtype.Series.pd_series
Mtype.Series.pd_dataframe
Mtype.Series.np_array

Panel
Mtype.Panel.pd_univariate_nested_dataframe
Mtype.Panel.np_3d_array
Mtype.Panel.np_flat
Mtype.Panel.pd_multi_index_dataframe
Mtype.Panel.pd_wide_dataframe
Mtype.Panel.pd_long_dataframe
Mtype.Panel.list_pd_dataframe

So what I've tried to do with renaming (and again if you want it reverted just ask literally one line changes now) is just make it more consistent as in your naming conventions you were prefixes some mtype with 'pd' or 'np' but not others, there were some random capitals and you were inconsistently seperating words, some using '-', some using'.', and others using '_'. All my change here hopes to do is make it more consistent and more clear to the user exactly what the datatype consists of.

In response to your question about the register and source of truth for the Mtypes the only source of truth IS the enums. So for examples lets say I want to add a new Series Mtype called list_series (that for the sake of argument lets just say it's a list of floats). To do this I would go to sktime/datatypes/_series/_registry.py and see an enum that looks like this:

class SeriesMtype(BaseRegistryEnum):
    """Enum class for series mtypes."""

    pd_series = (
        "pd.Series representation of a univariate series",
        "Series",
        pd.Series,
    )
    pd_dataframe = (
        "pd.DataFrame representation of a uni- or multivariate series",
        "Series",
        pd.DataFrame,
    )
    np_array = (
        "2D numpy.ndarray with rows=samples, cols=variables, index=integers",
        "Series",
        np.ndarray,
    )

To add a new Series mtype I would simply create a new variable (the variable name is the string name of the datatype), and then pass a tuple with the first argument being a description of the datatype, the second being the scitype it belongs to (this is fairly redundent but I need this piece of data to do some clever stuff under the hood) and finally the python 'type' it belongs to. Therefore to add out 'list_series', this would look like:

class SeriesMtype(BaseRegistryEnum):
    """Enum class for series mtypes."""

    pd_series = (
        "pd.Series representation of a univariate series",
        "Series",
        pd.Series,
    )
    pd_dataframe = (
        "pd.DataFrame representation of a uni- or multivariate series",
        "Series",
        pd.DataFrame,
    )
    np_array = (
        "2D numpy.ndarray with rows=samples, cols=variables, index=integers",
        "Series",
        np.ndarray,
    )
    list_series = (
        "List of floats where rows=samples, cols=variables, index=integers"
        "Series",
        list[Union[float, List]]
    )

The process to add a new Panel mtype is the exact same but would be in the enum in the file sktime/datatypes/_panel/_registry.py.

These enums at compile time are then read into a tuple in memory which is the same as the registry before. Just to re-emphasise while they are read at compile time to a registry, the enums are the only source of truth.

However, to add a new Scitype the solution I now have is a bit less 'elegant'. The reason is: in order to achieve the 'nested' aspect I've had to move the scitypes out from being and Enum and make a stand alone Mtype class with static variables and methods. This can be found in sktime/datatypes/_datatypes.py. This looks like the following:


class Datatypes:
    """Mtype class containing valid types.

    Attributes
    -----------
    Series: Enum
        Attribute containing the valid Series mtypes.
    Panel: Enum
        Attribute containing the valid Panel mtypes.
    """

    Series = SeriesMtype
    Panel = PanelMtype

    @staticmethod
    def list_scitypes() -> list:
        """Method to get all valid scitypes.

        Returns
        -------
        list
            List containing the valid scitypes
        """
        temp = []
        for attr in dir(Datatypes()):
            if not attr.startswith("__") and attr[0].isupper():
                temp.append(attr)
        return temp

    @staticmethod
    def list_mtypes() -> dict:
        """Get list of valid mtypes.

        Returns
        -------
        dict
            Dict where the scitype is the key and the value is a list of
            valid mtypes
        """
        scitypes = Datatypes.list_scitypes()
        temp = {}

        for type in scitypes:
            curr = [str(val) for val in getattr(Datatypes(), type)]
            temp[type] = curr
        return temp

The reason this had to be done is while nested enums are a thing, they only work at run time. This means that most (if not all) IDEs won't be able to do autocomplete and autosuggest (which was one of the main points of using enums) because it requires the code to be compiled to resolve the values. As such to get the 'nested' effect of Mtype.Panel. ... we have to use static variables. I've also added some utils methods to the class to get the mtypes and scitypes.

While this works great for code the main issue is it means in order to add a new scitype you need to add it in two places. The first is in the class above. So lets say im creating a 'Dictionary' Scitype, I would first need to define an enum that inherits from the BaseRegistryEnum located in sktime/base/_registry_enum.py like so:

class DictionaryMtype(BaseRegistryEnum):
    """Enum class for dictionary mtypes."""

    python_dict = (
        "Python dict storing my data",
        "Dictionary",
        dict,
    )


MTYPE_REGISTER_DICT = [tuple(mtype) for mtype in DictionaryMtype]

MTYPE_LIST_DICT = pd.DataFrame(MTYPE_REGISTER_DICT)[0].values

NOTE: The last two lines are just exporting the enum to the tuple registry.

Once defined I can then add the new scitype to the Mtype class like so:


class Datatypes:
    """Mtype class containing valid types.

    Attributes
    -----------
    Series: Enum
        Attribute containing the valid Series mtypes.
    Panel: Enum
        Attribute containing the valid Panel mtypes.
    DictionaryMtype: Enum
        Attribute containing the valid Dictionary mtypes.
    """

    Series = SeriesMtype
    Panel = PanelMtype
    Dictionary = DictionaryMtype
    
   ...etc.

However, as an additional step; as the Datatypes isn't an enum (and therefore can't be exported to the registry in the same way), an additional entry into the registry located in sktime/datatypes/_registry.py is needed. So this would result in:

SCITYPE_REGISTER = [
    ("Series", "uni- or multivariate time series"),
    ("Panel", "panel of uni- or multivariate time series"),
    ("Dictionary", "dict representation of time series"),
]

While this does mean there are 'two sources of truth' for scitypes, I figured we're much less likely to add a new scitype than mtype and therefore this being a bit clunky is probably acceptable.

Ill do a follow up comment on this tomorrow outlining the cool things you can actually do using the enums and the changes made (other than checking if something is a valid enum value).

chrisholder · 2021-10-07T15:35:57Z

Just an fyi with the changes to the forecasting module all it is is where you had the convert module things using "pd.Series", "np.ndarray" and "pd.Dataframe", I've just refactored them to instead use Datatypes.Series.pd_series, Datatypes.Series.np_array and Datatypes.Series.pd_dataframe. No logic has been touched and at runtime these are resolved back to the same strings.

fkiraly · 2021-10-08T11:27:42Z

Which I agree is a much better solution.

NICE!!!

I've also looked at the rest, and it looks really cool now!

One comment about the "scitype source of truth".

Could this not simply be solved as follows:

In Datatypes, make the values tuples rather than singletons.
E.g., sth like (SeriesMType, "Series", "uni- or multivariate time series"), or in the opposite order, that way it would be precisely consistent with the mtype enums?

Then anything in current SCITYPE_REGISTER would be in there, and you can either remove SCITYPE_REGISTER, or construct it from the Datatypes enum?

fkiraly · 2021-10-08T11:28:11Z

(obviously, there's something wrong with the CI tests, so I'll let you fix this before reviewing. Probably linting, since code-quality fails)

ABostrom · 2021-10-22T09:41:46Z

Thanks!

Yes, this looks quite convincing.

Can I ask some questions:

* IMPORTANT: can you outline the steps that a user would do if they were to add a new mtype; and where they add a new scitype? I.e., where would things have to be added? What probably doesn't change is the logic added to convert dicts, example dicts, and check dicts; but how does it change on the registry level? Currently you'd add one or more entries to the register.

* can you be explicit which existing objects are _removed_, _replaced_, or _duplicated_? I'm uncomfortable with the fact that we may be duplicating the register and there is not a clear, unique source of truth.

* I'm uncomfortable with the fact that a single registry has been split in two, by the third field (`Panel` or `Series`), so the information which scitype an mtype belongs to is no longer easy to obtain using the new structure, e.g., as in the function `mtype_to_scitype`.

* The consequene of this split also seems to be that a new enum needs to be added per scitype. Is there a good alternative way to keep this in one place? One that leads to a call `Mtype.Series.pd_dataframe` perhaps?

* is there a good way we can use the scitypes themselves as type annotations? Following @ABostrom's suggestion of type aliases.

A line of thought: say we move the entire mtype registry to a single enum, and remove (or deprecate), not duplicate the registry within two enums, resulting in two enums Mtype, and Scitype and no other registries. How would this look like? Would an access like Mtype.Series.pd_dataframe be possible?

Sorry I've been slow to reply too.

I'm not sure whether we can use these as type aliases. Although I need to do some experimenting. Type Aliases has fallen down my priority list. Lecturing is killing me this semester. My teaching should begin to wind down soon (end of November) then i'm going to do a big push on aliases to try and get it into a workable state.

fkiraly · 2022-04-25T07:20:52Z

@chrisholder, could we keep these open? I might be using this someday!

fkiraly · 2022-10-16T12:14:32Z

I'm reopening as a draft to have this as a basis for potential m/scitype type hinting and design discussion. No expectation for @chrisholder to work on this.

chrisholder added 4 commits September 13, 2021 05:08

Base registry enum class and SciType defined

def45b8

Added enum type for SeriesMtype

a5ffe47

Added enum type for PanelMtype

dfccfd2

datatypes supports enums

58bdefc

chrisholder requested a review from mloning as a code owner September 16, 2021 14:27

chrisholder added 3 commits September 16, 2021 15:30

Merge branch 'main' into convert-enum

ae99c3b

code qual

43dfb53

code qual

22a5c5b

chrisholder added 4 commits September 17, 2021 18:05

fixed bug and improved testing for the enum types

67a74a2

docstring fixes

9b02491

docstring fixes

525d2b5

docstring fixes

246bccd

fkiraly reviewed Sep 19, 2021

View reviewed changes

sktime/datatypes/_convert.py Outdated Show resolved Hide resolved

fkiraly reviewed Sep 19, 2021

View reviewed changes

sktime/datatypes/_convert.py Outdated Show resolved Hide resolved

fkiraly requested changes Sep 19, 2021

View reviewed changes

fkiraly reviewed Sep 19, 2021

View reviewed changes

resolved feedback

5915ded

chrisholder added 2 commits September 20, 2021 20:27

types renamed

fa70e2d

implemented more intelligent indexing for enum

c3b3fff

This was referenced Sep 25, 2021

Convert resolver #1459

Closed

datatypes.convert to infer indirect (chained) conversions when direct conversions are not implemented #1460

Closed

fkiraly requested review from ABostrom and MatthewMiddlehurst and removed request for MatthewMiddlehurst September 26, 2021 08:51

changed enums so now nested

492b031

chrisholder added 2 commits October 7, 2021 03:12

Merge remote-tracking branch 'origin' into convert-enum

65f0aca

docstring fixes

a1f4a7b

chrisholder force-pushed the convert-enum branch from fe52b97 to a1f4a7b Compare October 7, 2021 02:17

chrisholder added 2 commits October 7, 2021 15:14

renamed bad varaible name

7306ba6

refactor forecasting convert

b197c9d

chrisholder requested review from GuzalBulatova and aiwalter as code owners October 7, 2021 15:33

chrisholder closed this Oct 8, 2021

chrisholder reopened this Oct 8, 2021

This was referenced Oct 16, 2021

Transformers refactor #1365

Merged

[ENH] rename check_is to check_is_mtype #1533

Closed

Dynamic time warping - aligners and distances #1264

Merged

chrisholder added 3 commits October 19, 2021 08:10

Merge branch 'main' into convert-enum

3b1665d

docstring changes

7b21f59

sort order

05fde4e

chrisholder closed this Apr 25, 2022

fkiraly reopened this Oct 16, 2022

fkiraly marked this pull request as draft October 16, 2022 12:14

fkiraly added implementing framework Implementing or improving framework for learning tasks, e.g., base class functionality module:datatypes datatypes module: data containers, checkers & converters labels Oct 16, 2022

fkiraly changed the title ~~Datatypes enum~~ [ENH] scitype and mtype typing via enum Oct 16, 2022

		SCITYPE_REGISTER = [tuple(scitype) for scitype in Scitype]


		def mtype_to_scitype(mtype: Union[PanelMtype, SeriesMtype, str]) -> str:

Uh oh!

Conversation

chrisholder commented Sep 16, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Does your contribution introduce a new dependency? If yes, which one?

What should a reviewer concentrate their feedback on?

Any other comments?

PR checklist

For all contributions

For new estimators

Uh oh!

fkiraly commented Sep 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrisholder commented Sep 17, 2021

Uh oh!

chrisholder commented Sep 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fkiraly left a comment

Choose a reason for hiding this comment

Uh oh!

fkiraly Sep 19, 2021

Choose a reason for hiding this comment

Uh oh!

chrisholder Sep 20, 2021

Choose a reason for hiding this comment

Uh oh!

fkiraly Sep 20, 2021

Choose a reason for hiding this comment

Uh oh!

chrisholder Sep 20, 2021

Choose a reason for hiding this comment

Uh oh!

fkiraly Sep 30, 2021

Choose a reason for hiding this comment

Uh oh!

chrisholder commented Sep 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fkiraly commented Sep 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fkiraly commented Sep 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fkiraly commented Oct 3, 2021

Uh oh!

chrisholder commented Oct 7, 2021

Uh oh!

chrisholder commented Oct 7, 2021

Uh oh!

fkiraly commented Oct 8, 2021

Uh oh!

fkiraly commented Oct 8, 2021

Uh oh!

ABostrom commented Oct 22, 2021

Uh oh!

fkiraly commented Apr 25, 2022

Uh oh!

fkiraly commented Oct 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fkiraly commented Sep 17, 2021 •

edited

Loading

chrisholder commented Sep 17, 2021 •

edited

Loading

chrisholder commented Sep 20, 2021 •

edited

Loading

fkiraly commented Sep 26, 2021 •

edited

Loading

fkiraly commented Sep 26, 2021 •

edited

Loading