Output data type and format for time series annotation #2848

lmmentel · 2022-06-22T21:53:44Z

lmmentel
Jun 22, 2022

Hi All,

As we are ramping up our efforts around time series annotation it might be worth opening up the discussion about proposed interfaces, data structures and formats for the learning tasks in scope of annotation. Hopefully we will be able to converge at a proposal for a standard for sktime to adopt.

Please share freely any thoughts, examples, suggestions that might inform our discussion.

FYI @miraep8 @KatieBuc @fkiraly @Lovkush-A @NoaBenAmi @lielleravid

KatieBuc · 2022-06-24T08:18:36Z

KatieBuc
Jun 24, 2022

Hello everyone,

Good idea! Okay, here are some initial thoughts for discussion. In the context of output types, I’m considering anomalies and segments as the same, except segments must partition the index. Some specific examples that come to mind:

Example 1. A binary anomaly indicator

feature1_anomalies_ = [(start1,end1), (start2,end2)…]
is a more compact way to denote some binary anomaly indicator which could equivalently be represented as
feature1_anomalies_ = np.array([True, False, True, True, …])
The mask representation though, does make it explicit if there are missing index values, which might be helpful.

Question 1: would we consider a class of standardised outputs? We could have ^these both defined as standardised Outputs() with helper functions to map between them? (inheriting the index information from somewhere).

Example 2. Multiple anomaly types {‘A’, ‘B’,…}

feature1_anomalies_ = {‘A’: [(start11,end11), (start21,end21)…], ‘B’:[(start12,end12), (start22,end22)…], …}
or
feature1_A_ = np.array([True, False, True, True, …])
feature1_B_ = np.array([False, True, False, False…])
…
or
feature1_ = np.array([‘A’, ‘B’, ‘A’, ‘A’ …])

Example 3. Rolling windows

Some algorithms have rolling windows as inputs, or we can input a feature (like ewma) to any anomaly algorithm, so I guess with a window size k=2 we have something like
feature1_A_ = np.array([None, None, True, False, True, True, …])
feature1_B_ = np.array([None, None, False, True, False, False…])
or
feature1_ = np.array([None, None, ‘A’, ‘B’, ‘A’, ‘A’ …])
(not sure about mixing None and different dtypes in a mask, a bad idea probably)

Example 4. Reducing the outputs of the features

Some algorithms look at a multivariate timeseries and output a combined assessment of the system (like intrinsic dimensionality). Input would be n_row x n_col and output has shape n_row x 1. Maybe an algorithm tag {per_feature=False}? The output would be of the same form as in Example 1.

Any more examples? Thoughts on the above?

Cheers!

4 replies

Lovkush-A Jun 24, 2022

Some algorithms look at a multivariate timeseries and output a combined assessment of the system. Maybe an algorithm tag {per_feature=False}?

I think (but not sure) this is related to difference between panel data and multivariate data. Roughly, if you treat each column as separate and independent from all other columns, you have a panel of univariate series. If you treat the columns as inter-dependent and you want to treat the columns as a 'combined system', that is a (single) multivariate series. (You can also have a panel of multivariate series).

Lovkush-A Jun 24, 2022

Thoughts on the above?

Options make sense! Big thing to consider is how/if you want to keep the index information. Inputs in sktime always include an index, so I imagine we want to keep the index in the output too.

Rolling windows.

I am not sure I understood this example. This looks like it has the same output format as Multiple anomaly types?

KatieBuc Jun 24, 2022

I am not sure I understood this example. This looks like it has the same output format as Multiple anomaly types?

Yeah, except with None just to note the shifting of index from the input

KatieBuc Jun 24, 2022

You can also have a panel of multivariate series

hmmmm I don't quite understand yet (maybe we can draw some pictures). Okay so panel would be like pd.concat([feature1_anomalies_,feature2_anomalies_]) when each feature is treated independently?

miraep8 · 2022-06-24T11:21:13Z

miraep8
Jun 24, 2022

I think Katie's examples are great! Here are a few other cases I wanted to add on in addition to hers:

I think so far we have discussed a lot having binary labels (which allow for multiple labels to occur simultaneously), but it might be worth considering a case where we wanted to return probabilities instead, for example, if we had 3 potential segment labels A B and C, we might want:

Time Index	A	B	C
0	0.7	0.2	0.1
1	0.8	0.05	0.15
2	0.55	0.3	0.15

One could imagine a similar case if you had k change points that you were trying to find, ie - you could try to return a probability about where they were, here is a 1 change point example of what that might look like:

Time Index	Change Point 1
0	0.01
1	0.01
2	0
3	0.2
4	0.75
5	0.03

In the multiple change point case this of course might lead to one accidentally having overlapping probabilities for separate change points, something you might want to restrict in the algorithm itself.

I wonder if it might be convenient to have most algorithms return something that looks like a pandas DataFrame with the input time as the index? And different labels could be different columns with masks and/or probabilities? Then one could take that output and convert to another type if desired - ie just give me a list of change points or something like that? Though I agree with Katie's point that this wouldn't be the most compact way to store the data.

0 replies

lmmentel · 2022-06-24T12:01:36Z

lmmentel
Jun 24, 2022
Author

Open Questions:

for a given input type should the annotation output type preserve the index (dimension)?
if we support multiple output types (e.g. compressed vs full dim) where in the estimator should the choice be made and how is that choice exposed to the user?
what is a the format/data type used as input for annotation algorithms?
how are the results of annotation tasks used downstream?
is it important for time series to be equally spaced?
- for which algorithm is it important?
are there algorithm that are using the time/datetime-index information or only the values of sequences?

0 replies

lmmentel · 2022-06-30T22:33:52Z

lmmentel
Jun 30, 2022
Author

Current data signature of the GGS implementation is:

input: multivariate time series as numpy ndarray with time index running along rows and column being individual series
output: list of integers (indexes) representing positions of identified change points

0 replies

miraep8 · 2022-07-01T03:09:41Z

miraep8
Jul 1, 2022

Similarly, current implementation of HMM is:

input single dimensional numpy array, with time assumed to correspond to index, and for temporal spacing to be even (no explicit time index is passed yet).
output - single dimensional numpy array with a label at each index (corresponding to the segment label assigned for that observation/time point).

Should be easy to extend to multi-variate. I like the numpy array input because I have a lot of numpy math optimization under the hood (inspired by some work done by @conniesaur on a different project)

0 replies

miraep8 · 2022-08-15T01:22:49Z

miraep8
Aug 15, 2022

In advance of beginning of base class discussion tomorrow some thoughts/suggestions I had:

I think BaseAnnotator should inherit from BaseEstimator
Current implementation of BaseAnnotator seems to have lots of logic built in related to different input/output types. (ie dense vs sparse). I think rather than building all this logic in the base class it would be nice to have the input/output handling done by a separate modular classes that can be passed to the BaseAnnotator to build the correct child annotator by construction.
I think this would also be a good way to handle various checks (my understanding is that both Katie and I ran into some issues implementing things with the current BaseAnnotator because of built in checks it was running.
Synthesizing all these parts together - I think it would be interesting to experiment with the simplest BaseAnnotator possible - ie something that just inherits from BaseEstimator and has some added functionality to handle composition of input/output types, and other checks.
In long run if we were to build some of this composition logic into BaseEstimator itself it could be worth exploring whether we actually need a separate BaseAnnotator class at all, but in the meantime I think it works well to implement one simply for experimental value! :)

Looking forward to discussion tomorrow!

0 replies

KatieBuc · 2022-08-15T09:52:08Z

KatieBuc
Aug 15, 2022

That sounds really reasonable^^

This BaseAnnotator will cover both anomaly detection tasks and segmenting tasks, right?

In the first case, it's clear to me how boolean/float values describe outliers/anomalies

e.g. X = [1, 2 ,3 ,100, 2] -> output = [False, False, False, True, False] or as float (scores).

But how would we describe segments?

e.g. X = [1, 2 ,3 ,100, 102] -> output = [0,0,0,1,1] (incrementing floats? up until K?)

(lazy notation, I don't mean they are lists, imagine them as pd.DataFrame or np.array)

1 reply

KatieBuc Aug 15, 2022

In the anomaly detection case, whether the output type is bool or float - I think whatever holds the most information is preferable.

i.e., you can go from floats to bool, but no the other way around. Therefore, floats seem to make sense as a preference (if there were a default).

I like the idea of various conversion utilities. I think, though, that floats -> bool would take in a "threshold" parameter.

Question: should this logic exist before outputting anything?

e.g. X = [1, 2 ,3 ,100, 2]

Might give raw output scores
a) output = [0.1, 0.12, 1.3, 34.2, 0.12]

And with threshold = 30 applied, the float type output would be
b) output = [0,0,0,34.2,0]

In case b), the threshold logic would be entirely contained within the algorithm, and any output conversion functions could map agnostic to that?

I didn't come with answers this time, just questions :D

Lovkush-A · 2022-08-15T10:33:19Z

Lovkush-A
Aug 15, 2022

(Copying my comment (and Mirae's reply) because I used incorrect account for my contribution.)

I think first thing to do is pin down (in writing) what we want the input and output formats to be, and what learning tasks we intend to cover, before deciding on the nitty gritty of implementation details and where input checks will live.

My preferences for input output types:

General preference for being consistent with other parts of sktime. However, happy to explore alternative designs and compare pros and cons.
Input types.
- (uni or multivariate) series. pandas dataframe following sktime conventions that already exist
- (uni or multivariate) panel. do we need this straightaway? if not, then skip. Otherwise, might have to use some the input conversion functionality that exists in forecasting (because different algorithms may have different assumptions about panel data. e.g. all have same columns or not)
Output type.
- at least in early stages, preference for single output mtype. happy to go with the 'dense' format. If input is dataframe X, then output is dataframe with same index as X. I do not know enough about future plans to say if we want to restrict output to have one column, or if we want to restrict the type of the values (binary or float or ...).
- have various conversion utilities available in case users want to convert between different output types for their use case. e.g. going from dense to sparse, or going from dense to 'pandas interval index format'
learning tasks
- i have preference for sticking with unsupervised and non-panel tasks.
- in future, when we include broader range of tasks, will need to introduce tag system (or equivalent alternative) to deal with multiple kinds of learning tasks.
quick thoughts on mirae's comments above
- i had quick skim of current annotation base class, and there are not yet any checks or conversions related to dense/format. As far as I can see, the only checks that exist are that inputs should follow the 'Series scitype' used everywhere else in sktime, which I think is a reasonable restriction.
- letting future developers decide what input or output checks they want in their algorithms has two downsides in my mind. First, if developers can choose what checks they do, then we cannot enforce uniform interfaces. Second, if we do somehow enforce uniform interfaces, then it means developers have to copy and paste a lot of code.

1 reply

Lovkush-A Aug 15, 2022

Mirae's reply:

Yes, good idea to collect some of the things that have been discussed regarding input/output in writing!

And you are correct, there isn't as much logic as I thought there was regarding sparse vs dense! I think I had conflated some of the experiences I had had with BaseForecaster with BaseAnnotator. :) Good catch, and I agree checks in place sound reasonable.

Personally, I think we could possibly get around the input/output check boilerplate by requiring that each developer include a minimal set of checks in the template (by importing/composition)? I think if done right it wouldn't be more work than the current tag system! :) But lets have a discussion about it, I am definitely open to being convinced otherwise!

Uh oh!

Output data type and format for time series annotation #2848

Uh oh!

Replies: 8 comments · 6 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lmmentel Jun 24, 2022 Author

Uh oh!

lmmentel Jun 30, 2022 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 8 comments 6 replies

lmmentel
Jun 24, 2022
Author

lmmentel
Jun 30, 2022
Author