Conversation
TonyBagnall
left a comment
There was a problem hiding this comment.
in the broader scheme of things, I do not think check_X is the place for the operation coerce to numpy. I think check_X should just do the checking. Coercion should be done in _fit not fit I think
|
@TonyBagnall, any good ideas about the error? |
it seems to be failing on an azure test, I cant even find which one it is failing, all I can see when I click on the above is "TypeError: object of type 'float' has no len()" in test_stat. I have no working knowledge of azure, sorry. I'll take a look if you like |
|
The really odd thing about the failing test is that it's coming from the orchestration/benchmarking module, while all time series classifier related tests pass as it would be expected. @ViktorKaz, do you have an idea what is going on here? |
The name of the test is |
|
@fkiraly I cloned your PR and ran the failing test locally. I get the following error: ModuleNotFoundError: No module named 'sktime.distances.elastic_cython' |
The likely cause of this is your C compilers not being up to date? If you are using windows, make sure you have VS build tools installed. Check the tests if you want to see which one failes - e.g., to on |
|
@fkiraly I am using Linux and generally don't have issues with Cython or the C compilers with other PR. I tried to clone your PR in order to be able to debug it and got the error in question. Just by interpreting the tests output and code in GitHub, I can try to have a closer look if you want. |
That's really strange - I assume you can access the test logs and see that this doesn't get the same error you describe? Thanks for offering to have a look, that would be great - TSC refactor would unlock a number of interesting items to work on. |
|
the problem I envisage with handling conversion in the base class comes when we adapt to handle unequal length series. Suppose the classifier always wants to pull the data from pandas into a numpy array when it is given equal length problems, due to the gross inefficiency of not doing so. However, if the data is not equal length, it cannot do this and must dictate how to convert the data into a compatible data structure. This data structure could be completely different for different classifiers, and may differ if, for example, dimensions are different lengths in addition to series. This would not be a problem if we separate out the checking (done in fit) from the conversion (done in _fit and possibly bespoke). |
Well, currently there is no conversion in this PR. This aims to be a 1:1 replication of current behaviour with the main change being purely internal: namely, generic input logic that's already there moving to the base class. My opinion on the "next bit", i.e., conversion and different input handling:
|
check_X(X, coerce_to_numpy=coerce_to_numpy) can convert X to numpy if coerce_to_numpy is true. Now the issue of whether to do that is not just dependent on the type of X (convert if pandas) but also the content of X (if it contains unequal length series, may do something bespoke). Having it here means that a classifier with this condition would have to override fit(), thus making it all more complex. As I said, if the check and coerce were separated, it may not be an issue. This is future proofing against planned changes, not questioning whether it would work now. |
|
what would be your proposed change? Moving the |
|
I would like to move all of the data conversion out of check_X and check_X_y. These should just check whether the classifier has the capability to deal with the input data. I would delegate all conversions to the classifiers in _fit, which could call standard converters or do bespoke operations. This seems to me to be an argument against #980 model for classification, if the variation is internal to the data structure (equal/unequal length) I dont think 980 works? |
|
or define a bespoke time series data structure ..... https://github.com/uea-machine-learning/tsml/tree/master/src/main/java/tsml/data_containers |
No, I am against that - users expect
I am against the first part, that would result in a lot of boilerplate code, which is precisely what the I agree with the second part, have standard converters and a simple interface to access these. To align: why not have some "default conversion" in |
Agreed, it is an unexpected location, and a textbook violation of the single responsibility principle. |
I think its inevitable that classifiers will handle the same data structures (so called panel data, aka 3D arrays) in different ways dependent on the contents. Without a bespoke data structure it is inevitable you have to delegate to the classifiers. If the data is in a panda and is equal length, most classifiers will pull it out into a 3D numpy. If it is unequal length, it is classifier dependent. It may leave it in the panda, extract it to a list of arrays, pad it with a transformer into equal length or indeed use some ragged array solution such as awkward arrays. The same classifier could have an option to do any one of these things, dependent on the algorithm selected. None of this is defined, but will be considered soon. I think until these use cases are sorted out, we should not perform any data structure conversions in the base class. |
|
@TonyBagnall, I think the thought of moving to a silver bullet format/standard to solve problems with multiple appropriate but divergent formats/standards is very seductive but notoriously misleading... |
What do you mean with use cases here? |
|
@TonyBagnall, more thoughts about the data structure, and I agree more with you than I did above. With the Unlike in I now even think this may be the "end state" architecture we should be aiming for. What do you think of this architecture, @ninamiolane? With #980, we wouldn't even need to implement it for all estimators at the same time, but could introduce support for the data wrapper step-by-step. This is because all types would be supported at all times as long as all conversions have been implemented at the start, so no change in the user-sided interface would occur between updates and releases. |
|
So, as a sketch, in the vanilla version of the idea, the def fit(self, X, y):
X = some_check(X)
y = some_check(y)
X = NinaDataWrapper(X)
y = NinaDataWrapper(y)
self._fit(X, y)Combined with #980, we would have: def fit(self, X, y):
X = some_check(X)
y = some_check(y)
X = convert(X, self.get_tag("inner_type_X"))
y = convert(y, self.get_tag("inner_type_y"))
self._fit(X, y)where for the classifiers that have been "upgraded", the " |
@ViktorKaz, no - this looks like a problem with your |
|
@TonyBagnall, it looks like the problem is coming from I haven't tracked it down to debug it, but the earlier commit should give you code to reproduce it. |
|
I'm a bit confused, there are long engineering conversations with some bug talk embedded in it I haven't tracked it down to debug it, but the earlier commit should give you code to reproduce it." could you raise an issue describing the bug please? |
as said, I have not isolated the issue, but it's in the interaction of the orchestration framework and the There is a Happy to raise a bug report. |
@ViktorKaz, I noticed that my earlier comment was incorrect. The 44 tests failing were probably due to one of the tags not being correctly registered in the new registry. The registry was created after this PR, so merging from To reproduce the bug that was earlier here, look at my description to reproduce in #1114. Sorry for the confusion. |
|
Thank, I have not been involved with this code at all, so will try find someone to look into it |
|
@fkiraly no problem, I see that you sorted out the failing test. |
Well, I sorted it by switching the estimator causing the problem to one of the others, but it didn't really "solve" the problem. |
|
@TonyBagnall, would you mind re-reviewing this? The test bug is a separate thing, I'll add details in #1114. |
np, I'll look this afternoon |
TonyBagnall
left a comment
There was a problem hiding this comment.
all looks good to me. Probably best not to use KNeighborsTimeSeriesClassifier as the orchestration, it is slow. TimeSeriesForest is faster. I like the template idea
This reverts commit 59db2e1.
I've started refactoring the time series classification base template according to #993.
This is just a start, but I would like to:
@TonyBagnall, we will be discussing tags today in the core dev sprint hours, with @mloning, @aiwalterand @thayeylolu - would be great if you could join.