Skip to content

[ENH] scitype and mtype typing via enum#1423

Draft
chrisholder wants to merge 22 commits intomainfrom
convert-enum
Draft

[ENH] scitype and mtype typing via enum#1423
chrisholder wants to merge 22 commits intomainfrom
convert-enum

Conversation

@chrisholder
Copy link
Copy Markdown
Collaborator

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This is a simple pr with the goal of making the datatypes modules easier to use. Currently the datatypes module requires str inputs for different datatypes. For example if I want to convert a given object to a numpy I would do something like:

convert_to(ts_obj, 'numpy3D')

This requires an extact string match which normally will require the developer to look up the possible strings in a notebook or at the documentation. This extra action of always having to look up the conversion can be solved very simply by allowing the functions in datatypes to accept enums like so:

convert_to(ts_obj, PanelMtype.NUMPY3D)

This has the benefit of the developer not needing to know the exact string and in addition allows for an ide to autocomplete and suggest other potential datatypes.

In addition with these enums I hope to implement a new testing utils that allows us to test over every potential type of ts input sktime supports and enums will make this much easier to achieve.

Does your contribution introduce a new dependency? If yes, which one?

No

What should a reviewer concentrate their feedback on?

While this isn't in stone and is currently only planned for use in this module, I've added a base registry enum class so other modules can potentially use enums as their registry I don't know if this is a good idea or where I did put this base class is an appropirate place for this to live.

Any other comments?

PR checklist

For all contributions
  • I've added myself to the list of contributors.
  • Optionally, I've updated sktime's CODEOWNERS to receive notifications about future changes to these files.
  • I've added unit tests and made sure they pass locally.
For new estimators
  • I've added the estimator to the online documentation.
  • I've updated the existing example notebooks or provided a new one to showcase how my estimator works.

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Sep 17, 2021

Really great idea in general!

As discussed earlier, my main thoughts are around how we avoid multiplication of "sources of truth" a for the conversion strings/enums and increasing the extension burden, i.e., through necessity to update the registry in multiple places in parallel when we introduce a new mtype.

Also, it seems like this PR breaks some things?
Do you mind looking into it?

@chrisholder
Copy link
Copy Markdown
Collaborator Author

Fixing them now it is to do with me not checking for None - Added some extra tests to make sure this can't happen in the future as well since the modules tests didn't flag this to me

@chrisholder
Copy link
Copy Markdown
Collaborator Author

chrisholder commented Sep 17, 2021

As discussed earlier, my main thoughts are around how we avoid multiplication of "sources of truth" a for the conversion strings/enums and increasing the extension burden, i.e., through necessity to update the registry in multiple places in parallel when we introduce a new mtype.

I made it so the registry reads from the enum so there is only one place you need to add a new mtype (or even scitype if there is ever a need for that)

Copy link
Copy Markdown
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea!
Also, I appreciate that this is downwards compatible, great!

Not directly a change request, since I'm not sure how to address this, but two problems that I think need to be addressed or at least discussed before we merge:

  • there is a difference in capitalization between the enum keys and the enum values. This can easily become confusing.
  • Naive use of the enums will result in something like convert_to(x, SeriesMtype.PD_DATAFRAME), this is quite clunky and seems confusing. Any way to improve this? I was thinking about convert_to(x, Series.df), but shouldn't Series be the type appearing in fit/predict etc which would give a clash.

SCITYPE_REGISTER = [tuple(scitype) for scitype in Scitype]


def mtype_to_scitype(mtype: Union[PanelMtype, SeriesMtype, str]) -> str:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this not just be Mtype?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunatly I can't import the types into this due to circular importing.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh - is there any way to resolve this, e.g., by splitting files?

That is, is the circularity coming through object imports, or just via module imports?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Module imports since the types are based off the enums defined in that file. I can move it but it was where you orginally defined mtype_to_scitpye where would you want me to put it instead?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really good question - perhaps the enums should move somewhere, e.g., types, which contains the enums and any logic that allows shorthand use as a typing type?

@chrisholder
Copy link
Copy Markdown
Collaborator Author

chrisholder commented Sep 20, 2021

Interesting idea!
Also, I appreciate that this is downwards compatible, great!

Not directly a change request, since I'm not sure how to address this, but two problems that I think need to be addressed or at least discussed before we merge:

  • there is a difference in capitalization between the enum keys and the enum values. This can easily become confusing.
  • Naive use of the enums will result in something like convert_to(x, SeriesMtype.PD_DATAFRAME), this is quite clunky and seems confusing. Any way to improve this? I was thinking about convert_to(x, Series.df), but shouldn't Series be the type appearing in fit/predict etc which would give a clash.

So I've made some changes based on your feedback to the enums names. Not sure if they are perfect so now you would do

convert_to(x, SeriesMtype.pd_dataframe);

I can rename the SeriesMtype to just 'Series' but am concerned this could further confuse the scitype stuff. Another thought I had could be access it through a chain of enums i.e.

convert_to(x, SciType.Series.pd_dataframe)
convert_to(x, SciType.Panel.nested_univariate)

EDIT:
Reading what you said before I will note this doesn't match the strings? Did you mean you just wanted it to match the strings. Only reason they were upper case is simply because people tend to put enums in all caps as they are constants.

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Sep 26, 2021

@chrisholder, I think this is excellent still, in terms of ideas.

However, I think we really need to be clear as to the intended target interface, since the devil is in the details when it comes to useability.

Would you kindly be able to write a short summary or STEP that explains the "end state" ways to address the types and conversions? Including "old" and "new" way, to compare? And, intended usage?

I really think it's not much, perhaps 10 or 20 lines or so - but it's important!

We need to ensure:

  • consistency of naming to avoid user confusion
  • immediacy/simplicity of the interface
  • we're close to "one obvious way" as per python zen (or at least, one quick way and one very configurable way)

Sorry for asking for this, but I feel we need to spend more time on design here before we go to implementation.
I'm not saying the features you implemented are not great, but there's the obvious conditionality where if we make even minor changes to the design (on which we have not aligned yet) it may create major work in the implementation. And, generally, it's a good idea to jot down the design before spending much time on implementing and refining it - that avoids frustration of the kind where you have already done all the work but we don't agree on function signatures etc.

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Sep 26, 2021

Also, @ABostrom - could you look at this from the perspective of consistency with your typing extension, please?

@fkiraly fkiraly requested review from ABostrom and MatthewMiddlehurst and removed request for MatthewMiddlehurst September 26, 2021 08:51
@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Oct 3, 2021

?

@chrisholder
Copy link
Copy Markdown
Collaborator Author

@fkiraly sorry this has taken me a while to get back to you but based on one of your suggestions I've been trying to do something clever with type hinting that didn't work out - Ill talk to you another time about it.

However, what I have done is I've gone away, taken your suggestion, and made it so the enums are now nested. This means to access them you would do:

Mtype.Panel.np_flat
Mtype.Series.np_array

Which I agree is a much better solution.

In addition to all this because I changed some of the string names of the mtypes (and I knew you wouldn't be happy with some of them) I did a big refactor in the check and example files for panel and series so that if you're unhappy with a name or want to rename something then we only need to change it in one place rather than in every file. So if you're unhappy with any of the naming I've done then it quite literally a one line change to make it what you want it to be. These are the renaming (in places) of types I've done:

Series
Mtype.Series.pd_series
Mtype.Series.pd_dataframe
Mtype.Series.np_array

Panel
Mtype.Panel.pd_univariate_nested_dataframe
Mtype.Panel.np_3d_array
Mtype.Panel.np_flat
Mtype.Panel.pd_multi_index_dataframe
Mtype.Panel.pd_wide_dataframe
Mtype.Panel.pd_long_dataframe
Mtype.Panel.list_pd_dataframe

So what I've tried to do with renaming (and again if you want it reverted just ask literally one line changes now) is just make it more consistent as in your naming conventions you were prefixes some mtype with 'pd' or 'np' but not others, there were some random capitals and you were inconsistently seperating words, some using '-', some using'.', and others using '_'. All my change here hopes to do is make it more consistent and more clear to the user exactly what the datatype consists of.

In response to your question about the register and source of truth for the Mtypes the only source of truth IS the enums. So for examples lets say I want to add a new Series Mtype called list_series (that for the sake of argument lets just say it's a list of floats). To do this I would go to sktime/datatypes/_series/_registry.py and see an enum that looks like this:

class SeriesMtype(BaseRegistryEnum):
    """Enum class for series mtypes."""

    pd_series = (
        "pd.Series representation of a univariate series",
        "Series",
        pd.Series,
    )
    pd_dataframe = (
        "pd.DataFrame representation of a uni- or multivariate series",
        "Series",
        pd.DataFrame,
    )
    np_array = (
        "2D numpy.ndarray with rows=samples, cols=variables, index=integers",
        "Series",
        np.ndarray,
    )

To add a new Series mtype I would simply create a new variable (the variable name is the string name of the datatype), and then pass a tuple with the first argument being a description of the datatype, the second being the scitype it belongs to (this is fairly redundent but I need this piece of data to do some clever stuff under the hood) and finally the python 'type' it belongs to. Therefore to add out 'list_series', this would look like:

class SeriesMtype(BaseRegistryEnum):
    """Enum class for series mtypes."""

    pd_series = (
        "pd.Series representation of a univariate series",
        "Series",
        pd.Series,
    )
    pd_dataframe = (
        "pd.DataFrame representation of a uni- or multivariate series",
        "Series",
        pd.DataFrame,
    )
    np_array = (
        "2D numpy.ndarray with rows=samples, cols=variables, index=integers",
        "Series",
        np.ndarray,
    )
    list_series = (
        "List of floats where rows=samples, cols=variables, index=integers"
        "Series",
        list[Union[float, List]]
    )

The process to add a new Panel mtype is the exact same but would be in the enum in the file sktime/datatypes/_panel/_registry.py.

These enums at compile time are then read into a tuple in memory which is the same as the registry before. Just to re-emphasise while they are read at compile time to a registry, the enums are the only source of truth.

However, to add a new Scitype the solution I now have is a bit less 'elegant'. The reason is: in order to achieve the 'nested' aspect I've had to move the scitypes out from being and Enum and make a stand alone Mtype class with static variables and methods. This can be found in sktime/datatypes/_datatypes.py. This looks like the following:


class Datatypes:
    """Mtype class containing valid types.

    Attributes
    -----------
    Series: Enum
        Attribute containing the valid Series mtypes.
    Panel: Enum
        Attribute containing the valid Panel mtypes.
    """

    Series = SeriesMtype
    Panel = PanelMtype

    @staticmethod
    def list_scitypes() -> list:
        """Method to get all valid scitypes.

        Returns
        -------
        list
            List containing the valid scitypes
        """
        temp = []
        for attr in dir(Datatypes()):
            if not attr.startswith("__") and attr[0].isupper():
                temp.append(attr)
        return temp

    @staticmethod
    def list_mtypes() -> dict:
        """Get list of valid mtypes.

        Returns
        -------
        dict
            Dict where the scitype is the key and the value is a list of
            valid mtypes
        """
        scitypes = Datatypes.list_scitypes()
        temp = {}

        for type in scitypes:
            curr = [str(val) for val in getattr(Datatypes(), type)]
            temp[type] = curr
        return temp

The reason this had to be done is while nested enums are a thing, they only work at run time. This means that most (if not all) IDEs won't be able to do autocomplete and autosuggest (which was one of the main points of using enums) because it requires the code to be compiled to resolve the values. As such to get the 'nested' effect of Mtype.Panel. ... we have to use static variables. I've also added some utils methods to the class to get the mtypes and scitypes.

While this works great for code the main issue is it means in order to add a new scitype you need to add it in two places. The first is in the class above. So lets say im creating a 'Dictionary' Scitype, I would first need to define an enum that inherits from the BaseRegistryEnum located in sktime/base/_registry_enum.py like so:

class DictionaryMtype(BaseRegistryEnum):
    """Enum class for dictionary mtypes."""

    python_dict = (
        "Python dict storing my data",
        "Dictionary",
        dict,
    )


MTYPE_REGISTER_DICT = [tuple(mtype) for mtype in DictionaryMtype]

MTYPE_LIST_DICT = pd.DataFrame(MTYPE_REGISTER_DICT)[0].values

NOTE: The last two lines are just exporting the enum to the tuple registry.

Once defined I can then add the new scitype to the Mtype class like so:


class Datatypes:
    """Mtype class containing valid types.

    Attributes
    -----------
    Series: Enum
        Attribute containing the valid Series mtypes.
    Panel: Enum
        Attribute containing the valid Panel mtypes.
    DictionaryMtype: Enum
        Attribute containing the valid Dictionary mtypes.
    """

    Series = SeriesMtype
    Panel = PanelMtype
    Dictionary = DictionaryMtype
    
   ...etc.

However, as an additional step; as the Datatypes isn't an enum (and therefore can't be exported to the registry in the same way), an additional entry into the registry located in sktime/datatypes/_registry.py is needed. So this would result in:

SCITYPE_REGISTER = [
    ("Series", "uni- or multivariate time series"),
    ("Panel", "panel of uni- or multivariate time series"),
    ("Dictionary", "dict representation of time series"),
]

While this does mean there are 'two sources of truth' for scitypes, I figured we're much less likely to add a new scitype than mtype and therefore this being a bit clunky is probably acceptable.

Ill do a follow up comment on this tomorrow outlining the cool things you can actually do using the enums and the changes made (other than checking if something is a valid enum value).

@chrisholder
Copy link
Copy Markdown
Collaborator Author

Just an fyi with the changes to the forecasting module all it is is where you had the convert module things using "pd.Series", "np.ndarray" and "pd.Dataframe", I've just refactored them to instead use Datatypes.Series.pd_series, Datatypes.Series.np_array and Datatypes.Series.pd_dataframe. No logic has been touched and at runtime these are resolved back to the same strings.

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Oct 8, 2021

Which I agree is a much better solution.

NICE!!!

I've also looked at the rest, and it looks really cool now!

One comment about the "scitype source of truth".

Could this not simply be solved as follows:

In Datatypes, make the values tuples rather than singletons.
E.g., sth like (SeriesMType, "Series", "uni- or multivariate time series"), or in the opposite order, that way it would be precisely consistent with the mtype enums?

Then anything in current SCITYPE_REGISTER would be in there, and you can either remove SCITYPE_REGISTER, or construct it from the Datatypes enum?

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Oct 8, 2021

(obviously, there's something wrong with the CI tests, so I'll let you fix this before reviewing. Probably linting, since code-quality fails)

@ABostrom
Copy link
Copy Markdown
Contributor

Thanks!

Yes, this looks quite convincing.

Can I ask some questions:

* IMPORTANT: can you outline the steps that a user would do if they were to add a new mtype; and where they add a new scitype? I.e., where would things have to be added? What probably doesn't change is the logic added to convert dicts, example dicts, and check dicts; but how does it change on the registry level? Currently you'd add one or more entries to the register.

* can you be explicit which existing objects are _removed_, _replaced_, or _duplicated_? I'm uncomfortable with the fact that we may be duplicating the register and there is not a clear, unique source of truth.

* I'm uncomfortable with the fact that a single registry has been split in two, by the third field (`Panel` or `Series`), so the information which scitype an mtype belongs to is no longer easy to obtain using the new structure, e.g., as in the function `mtype_to_scitype`.

* The consequene of this split also seems to be that a new enum needs to be added per scitype. Is there a good alternative way to keep this in one place? One that leads to a call `Mtype.Series.pd_dataframe` perhaps?

* is there a good way we can use the scitypes themselves as type annotations? Following @ABostrom's suggestion of type aliases.

A line of thought: say we move the entire mtype registry to a single enum, and remove (or deprecate), not duplicate the registry within two enums, resulting in two enums Mtype, and Scitype and no other registries. How would this look like? Would an access like Mtype.Series.pd_dataframe be possible?

Sorry I've been slow to reply too.

I'm not sure whether we can use these as type aliases. Although I need to do some experimenting. Type Aliases has fallen down my priority list. Lecturing is killing me this semester. My teaching should begin to wind down soon (end of November) then i'm going to do a big push on aliases to try and get it into a workable state.

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Apr 25, 2022

@chrisholder, could we keep these open? I might be using this someday!

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Oct 16, 2022

I'm reopening as a draft to have this as a basis for potential m/scitype type hinting and design discussion. No expectation for @chrisholder to work on this.

@fkiraly fkiraly reopened this Oct 16, 2022
@fkiraly fkiraly marked this pull request as draft October 16, 2022 12:14
@fkiraly fkiraly added implementing framework Implementing or improving framework for learning tasks, e.g., base class functionality module:datatypes datatypes module: data containers, checkers & converters labels Oct 16, 2022
@fkiraly fkiraly changed the title Datatypes enum [ENH] scitype and mtype typing via enum Oct 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

API design API design & software architecture implementing framework Implementing or improving framework for learning tasks, e.g., base class functionality module:datatypes datatypes module: data containers, checkers & converters

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants