Conversation
|
Really great idea in general! As discussed earlier, my main thoughts are around how we avoid multiplication of "sources of truth" a for the conversion strings/enums and increasing the extension burden, i.e., through necessity to update the registry in multiple places in parallel when we introduce a new mtype. Also, it seems like this PR breaks some things? |
|
Fixing them now it is to do with me not checking for None - Added some extra tests to make sure this can't happen in the future as well since the modules tests didn't flag this to me |
I made it so the registry reads from the enum so there is only one place you need to add a new mtype (or even scitype if there is ever a need for that) |
fkiraly
left a comment
There was a problem hiding this comment.
Interesting idea!
Also, I appreciate that this is downwards compatible, great!
Not directly a change request, since I'm not sure how to address this, but two problems that I think need to be addressed or at least discussed before we merge:
- there is a difference in capitalization between the enum keys and the enum values. This can easily become confusing.
- Naive use of the enums will result in something like
convert_to(x, SeriesMtype.PD_DATAFRAME), this is quite clunky and seems confusing. Any way to improve this? I was thinking aboutconvert_to(x, Series.df), but shouldn'tSeriesbe the type appearing infit/predictetc which would give a clash.
| SCITYPE_REGISTER = [tuple(scitype) for scitype in Scitype] | ||
|
|
||
|
|
||
| def mtype_to_scitype(mtype: Union[PanelMtype, SeriesMtype, str]) -> str: |
There was a problem hiding this comment.
should this not just be Mtype?
There was a problem hiding this comment.
Unfortunatly I can't import the types into this due to circular importing.
There was a problem hiding this comment.
Argh - is there any way to resolve this, e.g., by splitting files?
That is, is the circularity coming through object imports, or just via module imports?
There was a problem hiding this comment.
Module imports since the types are based off the enums defined in that file. I can move it but it was where you orginally defined mtype_to_scitpye where would you want me to put it instead?
There was a problem hiding this comment.
really good question - perhaps the enums should move somewhere, e.g., types, which contains the enums and any logic that allows shorthand use as a typing type?
So I've made some changes based on your feedback to the enums names. Not sure if they are perfect so now you would do convert_to(x, SeriesMtype.pd_dataframe); I can rename the SeriesMtype to just 'Series' but am concerned this could further confuse the scitype stuff. Another thought I had could be access it through a chain of enums i.e. convert_to(x, SciType.Series.pd_dataframe) EDIT: |
|
@chrisholder, I think this is excellent still, in terms of ideas. However, I think we really need to be clear as to the intended target interface, since the devil is in the details when it comes to useability. Would you kindly be able to write a short summary or STEP that explains the "end state" ways to address the types and conversions? Including "old" and "new" way, to compare? And, intended usage? I really think it's not much, perhaps 10 or 20 lines or so - but it's important! We need to ensure:
Sorry for asking for this, but I feel we need to spend more time on design here before we go to implementation. |
|
Also, @ABostrom - could you look at this from the perspective of consistency with your typing extension, please? |
|
? |
|
@fkiraly sorry this has taken me a while to get back to you but based on one of your suggestions I've been trying to do something clever with type hinting that didn't work out - Ill talk to you another time about it. However, what I have done is I've gone away, taken your suggestion, and made it so the enums are now nested. This means to access them you would do: Which I agree is a much better solution. In addition to all this because I changed some of the string names of the mtypes (and I knew you wouldn't be happy with some of them) I did a big refactor in the check and example files for panel and series so that if you're unhappy with a name or want to rename something then we only need to change it in one place rather than in every file. So if you're unhappy with any of the naming I've done then it quite literally a one line change to make it what you want it to be. These are the renaming (in places) of types I've done: Series Panel So what I've tried to do with renaming (and again if you want it reverted just ask literally one line changes now) is just make it more consistent as in your naming conventions you were prefixes some mtype with 'pd' or 'np' but not others, there were some random capitals and you were inconsistently seperating words, some using '-', some using'.', and others using '_'. All my change here hopes to do is make it more consistent and more clear to the user exactly what the datatype consists of. In response to your question about the register and source of truth for the Mtypes the only source of truth IS the enums. So for examples lets say I want to add a new Series Mtype called list_series (that for the sake of argument lets just say it's a list of floats). To do this I would go to sktime/datatypes/_series/_registry.py and see an enum that looks like this: To add a new Series mtype I would simply create a new variable (the variable name is the string name of the datatype), and then pass a tuple with the first argument being a description of the datatype, the second being the scitype it belongs to (this is fairly redundent but I need this piece of data to do some clever stuff under the hood) and finally the python 'type' it belongs to. Therefore to add out 'list_series', this would look like: The process to add a new Panel mtype is the exact same but would be in the enum in the file sktime/datatypes/_panel/_registry.py. These enums at compile time are then read into a tuple in memory which is the same as the registry before. Just to re-emphasise while they are read at compile time to a registry, the enums are the only source of truth. However, to add a new Scitype the solution I now have is a bit less 'elegant'. The reason is: in order to achieve the 'nested' aspect I've had to move the scitypes out from being and Enum and make a stand alone Mtype class with static variables and methods. This can be found in sktime/datatypes/_datatypes.py. This looks like the following: The reason this had to be done is while nested enums are a thing, they only work at run time. This means that most (if not all) IDEs won't be able to do autocomplete and autosuggest (which was one of the main points of using enums) because it requires the code to be compiled to resolve the values. As such to get the 'nested' effect of Mtype.Panel. ... we have to use static variables. I've also added some utils methods to the class to get the mtypes and scitypes. While this works great for code the main issue is it means in order to add a new scitype you need to add it in two places. The first is in the class above. So lets say im creating a 'Dictionary' Scitype, I would first need to define an enum that inherits from the BaseRegistryEnum located in sktime/base/_registry_enum.py like so: NOTE: The last two lines are just exporting the enum to the tuple registry. Once defined I can then add the new scitype to the Mtype class like so: However, as an additional step; as the Datatypes isn't an enum (and therefore can't be exported to the registry in the same way), an additional entry into the registry located in sktime/datatypes/_registry.py is needed. So this would result in: While this does mean there are 'two sources of truth' for scitypes, I figured we're much less likely to add a new scitype than mtype and therefore this being a bit clunky is probably acceptable. Ill do a follow up comment on this tomorrow outlining the cool things you can actually do using the enums and the changes made (other than checking if something is a valid enum value). |
fe52b97 to
a1f4a7b
Compare
|
Just an fyi with the changes to the forecasting module all it is is where you had the convert module things using "pd.Series", "np.ndarray" and "pd.Dataframe", I've just refactored them to instead use Datatypes.Series.pd_series, Datatypes.Series.np_array and Datatypes.Series.pd_dataframe. No logic has been touched and at runtime these are resolved back to the same strings. |
NICE!!! I've also looked at the rest, and it looks really cool now! One comment about the "scitype source of truth". Could this not simply be solved as follows: In Then anything in current |
|
(obviously, there's something wrong with the CI tests, so I'll let you fix this before reviewing. Probably linting, since |
Sorry I've been slow to reply too. I'm not sure whether we can use these as type aliases. Although I need to do some experimenting. Type Aliases has fallen down my priority list. Lecturing is killing me this semester. My teaching should begin to wind down soon (end of November) then i'm going to do a big push on aliases to try and get it into a workable state. |
|
@chrisholder, could we keep these open? I might be using this someday! |
|
I'm reopening as a draft to have this as a basis for potential m/scitype type hinting and design discussion. No expectation for @chrisholder to work on this. |
Reference Issues/PRs
What does this implement/fix? Explain your changes.
This is a simple pr with the goal of making the datatypes modules easier to use. Currently the datatypes module requires str inputs for different datatypes. For example if I want to convert a given object to a numpy I would do something like:
convert_to(ts_obj, 'numpy3D')
This requires an extact string match which normally will require the developer to look up the possible strings in a notebook or at the documentation. This extra action of always having to look up the conversion can be solved very simply by allowing the functions in datatypes to accept enums like so:
convert_to(ts_obj, PanelMtype.NUMPY3D)
This has the benefit of the developer not needing to know the exact string and in addition allows for an ide to autocomplete and suggest other potential datatypes.
In addition with these enums I hope to implement a new testing utils that allows us to test over every potential type of ts input sktime supports and enums will make this much easier to achieve.
Does your contribution introduce a new dependency? If yes, which one?
No
What should a reviewer concentrate their feedback on?
While this isn't in stone and is currently only planned for use in this module, I've added a base registry enum class so other modules can potentially use enums as their registry I don't know if this is a good idea or where I did put this base class is an appropirate place for this to live.
Any other comments?
PR checklist
For all contributions
For new estimators