Open Forecasting

smooth forecasting with the smooth package in Python

Ivan Svetunkov — Thu, 09 Apr 2026 09:15:02 +0000

Here is another piece of news I have been hoping to deliver for quite some time now (since January 2026 actually). We have finally created the first release of the smooth package for Python and it is available on PyPI! Anyone interested? Read more!

On this page:

Why does “smooth” exist?
A bit of history
How to install
What works
An example
Evaluation

Setup
Jupiter notebook
sktime non-collaborative stance
Results

What’s next?
Summary

Why does “smooth” exist?

There are lots of implementations of ETS and ARIMA (dynamic models) out there, both in Python and in R (and also in Julia now, see Durbyn). So, why bother creating yet another one?

The main philosophy of the smooth package in R is flexibility. It is not just an implementation of ETS or ARIMA – it is there to give you more control over what you can do with these models in different situations. The main function in the package is called “ADAM” – the Augmented Dynamic Adaptive Model. It is the single source of error state space model that unites ETS, ARIMA, and regression, and supports the following list of features (taken from the Introduction in the book):

ETS;
ARIMA;
Regression;
TVP regression;
Combination of (1), (2), and either (3) or (4). i.e. ARIMAX/ETSX;
Automatic selection/combination of states for ETS;
Automatic orders selection for ARIMA;
Variables selection for regression;
Normal and non-normal distributions;
Automatic selection of most suitable distributions;
Multiple seasonality;
Occurrence part of the model to handle zeroes in data (intermittent demand);
Modelling scale of distribution (GARCH and beyond);
A variety of ways to produce forecasts for different situation;
Advanced loss functions for model estimation;
…

All of these features come with the ability to fine tune the optimiser (e.g. how the parameters are estimated) and to manually adjust any model parameters you want. This allows, for example, fitting iETSX(M,N,M) with multiple frequencies with Gamma distribution to better model hourly emergency department arrivals (intermittent demand) and thus producing more accurate forecasts of it. There is no other package either in R or Python that could give such flexibility with the dynamic models to users.

And at the same time, over the years, I have managed to iron out the R functions so much that they handle almost any real life situation and do not break. And at the same time, they work quite fast and produce accurate forecasts, sometimes outperforming the other existing R implementations.

So, here we are. We want to bring this flexibility, robustness, and speed to Python.

A bit of history

The smooth package for R was first released on CRAN in 2016, when I finished my PhD. It went from v1.4.3 to v4.4.0 over the last 10 years. It saw a rise in popularity, but then an inevitable decline due to the decreasing number of R users in business. So, back in 2021, Rebecca Killick and I applied for an EPSRC grant to develop Python packages for forecasting and time series analysis. The idea was to translate what we have in R (including greybox, smooth, forecast, and changepoint detection packages) to Python with the help of professional programmers. Unfortunately, we did not receive the funding (it went to sktime for a good reason – they already had an existing codebase in Python).

In the beginning of 2023, Leonidas Tsaprounis got in touch with me, suggesting some help with the development and translation of the smooth package to Python. The idea was to use the existing C++ core and simply create a wrapper in Python. “Simply” is actually an oversimplification here, because I am not a programmer, so my functions are messy and hard to read. Nonetheless, we started cooking. Leo helped in setting up pybind11 and carma, and creating the necessary files for the compilation of the C++ code. Just to test whether that worked, we managed to create a basic function for the simple moving average based on the sma() from the R version. Our progress was a bit slow, because we were both busy with other projects. All of that changed when in July 2023 Filotas Theodosiou joined our small team and started working on the translation. We decided to implement the hardest thing first – ADAM.

What Fil did was use LLMs to translate the code from R to Python. Fil will write about this work at some point; the only thing I can say here is that it was not an easy process, because thousands of lines of R code needed to be translated to Python and then refactored. I only helped with suggestions and explanations of what is happening inside, and Leo provided guidance regarding the coding philosophy. It was Fil and his AI tools that did the main heavy lifting. By Summer 2025, we had a basic working ADAM() function, but it worked slightly differently from the one in R due to differences in initialisation and optimisation. He presented his work at the IIF Open Source Forecasting Software Workshop, explaining his experience with LLMs and code translation. Because everyone was pretty busy, it took a bit more time to reach the first proper release of smooth in Python.

In December 2025, I bought a Claude AI subscription and started vibe coding my way through the existing Python code. Between the three of us, we managed to progress the project, and finally in January 2026 we reached v1.0.0 of smooth in Python. Now ADAM works in exactly the same way in both R and Python: if you give it the same time series, it will select the same model and produce the same parameter estimates across languages. It took us time and effort to reach this, but we feel it is a critically important step – ensuring that users working in different languages have the same experience.

How to install

We are entirely grateful to Gustavo Niemeyer, who gave us the name smooth on PyPI. It belonged to him since 2020, but the project was abandoned, and he agreed to transfer it to us. So, now you can install smooth simply by running:

pip install smooth

There is also a development version of the package, which you can install by following the instructions in our Installation wiki on GitHub.

What works

The package currently does not have the full functionality, but there are already some things:

ADAM() – the main function that supports ETS with:
- components selection via model="ZXZ" or other pools (see wiki on GitHub for details);
- forecast combination using AIC weights via model="CCC" or other pools (again, explained in the wiki);
- multiple seasonalities via lags=[24,168];
- ability to provide some smoothing parameters or initial values (e.g. only alpha), letting the function estimate the rest;
- different distributions;
- advanced loss functions;
- several options for model initialisation;
- fine-tuning of the optimiser via nlopt_kargs (read more in the wiki);
ES() – the wrapper of ADAM() with the normal distribution. Supports the same functionality, but is a simplification of ADAM.

We also have the standard methods for fitting and forecasting, and many attributes that allow extracting information from the model, all of which are explained in the wiki of the project.

An example

Here is an example of how to work with ADAM in Python. For this example to work, you will need to install the fcompdata package from pip:
pip install fcompdata

The example:

from smooth import ADAM
from fcompdata import M3

model = ADAM(model="ZXZ", lags=12)
model.fit(M3[2568].x)

print(model)

This is what you should see as the result of print:

Time elapsed: 0.21 seconds
Model estimated using ADAM() function: ETS(MAM)
With backcasting initialisation
Distribution assumed in the model: Gamma
Loss function type: likelihood; Loss function value: 868.7085
Persistence vector g:
 alpha   beta  gamma
0.0205 0.0203 0.1568
Damping parameter: 1.0000
Sample size: 116
Number of estimated parameters: 4
Number of degrees of freedom: 112
Information criteria:
      AIC      AICc       BIC      BICc
1745.4170 1745.7774 1756.4314 1757.2879

which is exactly the same output as in R (see, for example, some explanations here). We can then produce a forecast from this model:

predict(model, h=18, interval="prediction", level=[0.9,0.95])

The predict method currently supports analytical (aka “parametric”/”approximate”) and simulated prediction intervals. The interval="prediction" will tell the function to choose between the two depending on the type of model (multiplicative ETS models do not have analytical formulae for the multistep conditional variance and, as a result, do not have proper analytical prediction intervals). The level parameter can accept either a vector (which will produce several quantiles) or a scalar. What I get after running this is:

             mean    lower_0.05   lower_0.025    upper_0.95   upper_0.975
116  11234.592643  10061.150811   9853.119370  12443.904763  12686.143870
117   8050.810544   7228.709864   7064.356429   8896.078713   9080.867866
118   7658.608163   6886.165498   6746.881596   8475.545469   8633.149796
119  10552.382933   9452.306261   9236.493206  11679.816814  11892.381046
120  10889.816551   9768.665327   9559.580233  12066.628218  12313.872205
121   7409.545388   6643.378080   6495.237349   8232.558913   8363.550598
122   7591.183726   6800.878319   6650.514904   8425.167556   8576.257297
123  14648.452226  13089.346824  12793.226771  16263.206997  16582.786939
124   6953.045603   6206.829418   6079.126523   7730.260301   7892.917098
125  11938.941882  10650.759513  10427.172989  13307.563498  13579.662925
126   8299.626845   7379.550080   7200.075336   9280.095498   9468.327654
127   8508.558884   7530.987557   7367.541698   9534.117611   9734.218257
128  11552.654541  10162.615284   9907.770755  13039.710057  13313.534412
129   8286.727505   7273.279560   7100.450038   9401.374461   9627.922116
130   7889.999721   6879.771040   6710.494741   8948.052817   9186.279622
131  10860.671447   9438.536335   9174.365147  12353.444279  12664.983138
132  11218.330395   9690.780430   9453.114931  12849.545829  13201.679610
133   7620.922782   6564.497975   6391.080974   8748.561119   8977.120673

The separate wiki on Fitted Values and Forecasts explains all the parameters accepted by the predict method and what is returned by it.

Evaluation

To see how the developed function works, I decided to conduct exactly the same evaluation that I did for the recent R release of the smooth package, running the functions on the M1, M3, and Tourism competition data (5,315 time series) using the fcompdata package.

Setup

I have selected the same set of models for Python as I did in R. Here are several options for the ADAM model parameter to see how the specific pools impact accuracy (this is discussed in detail in Section 15.1 of ADAM):

XXX – select between pure additive ETS models only;
ZZZ – select from the pool of all 30 models, but use branch-and-bound to remove the less suitable models;
ZXZ – same as (2), but without the multiplicative trend models. This is used in the smooth functions by default;
FFF – select from the pool of all 30 models (exhaustive search);
SXS – the pool of models used by default in ets() from the forecast package in R.

I also tested three types of ETS initialisation (read more about them here):

Back – initial="backcasting" – this is the default initialisation method;
Opt – initial="optimal";
Two – initial="two-stage".

I have also found the following implementations of ETS in Python and included them in my evaluation:

There is also a darts implementation of AutoETS, which is actually a wrapper of the statsforecast one. So I ran it just to check how it works, and found that it failed in 1,518 cases. I filed the issue, and it turned out that their implementation does not deal with short time series (10 observations or fewer), which is their design decision. They are now considering what to do about that, if anything.

I used RMSSE (M5 competition, motivated by Athanasopoulos & Kourentzes (2023)) and SAME error measure together with the computational time for each time series:
\begin{equation*}
\mathrm{RMSSE} = \frac{1}{\sqrt{\frac{1}{T-1} \sum_{t=1}^{T-1} \Delta_t^2}} \mathrm{RMSE},
\end{equation*}
where \(\mathrm{RMSE} = \sqrt{\frac{1}{h} \sum_{j=1}^h e^2_{t+j}}\) is the Root Mean Squared Error of the point forecasts, and \(\Delta_t\) is the first differences of the in-sample actual values.

\begin{equation*}
\mathrm{SAME} = \frac{1}{\frac{1}{T-1} \sum_{t=1}^{T-1} |\Delta_t|} \mathrm{AME},
\end{equation*}
where \(\mathrm{AME}= \left| \frac{1}{h} \sum_{j=1}^h e_{t+j} \right|\).

All of this was implemented in a Jupyter notebook, which is available here in case you want to reproduce the results.

sktime non-collaborative stance

In the first run of this (on 28th January 2026), I encountered several errors in AutoETS in sktime: it took an extremely long time to compute (see the table below – on average around 30 seconds per time series) and produced ridiculous forecasts (mean RMSSE was 106,951). I filed an issue in their repo and sent a courtesy message to Franz Kiraly on LinkedIn the same day, saying that I would be happy to rerun the results if this was fixed. I then received an insulting email from him, blaming me for not collaborating and trying to diminish sktime. They then closed the issue, claiming that it was fixed. I reran the experiment with their development version from GitHub on 21st March (two months later!), only to get exactly the same results. I do not think their fix is working, but given Franz’s toxic behaviour, I am not going to rerun this any further or help him in any way. The Jupyter notebook with the experiment is here, so he can investigate on his own.

Results

So, here are the summary results for the tested models:

====================================================================================
EVALUATION RESULTS for RMSSE (All Series)
====================================================================================
               Method       Min        Q1       Med        Q3       Max      Mean
        ADAM ETS Back  0.018252  0.663358  1.161473  2.301861  50.25854  1.928086
         ADAM ETS Opt  0.024155  0.670682  1.185932  2.365498  51.61599  1.943729
         ADAM ETS Two  0.024599  0.669522  1.182516  2.342385  51.61599  1.947715
              ES Back  0.018252  0.667225  1.160971  2.313932  50.25854  1.927436
               ES Opt  0.024155  0.673575  1.185756  2.364915  51.61599  1.947180
               ES Two  0.024467  0.671771  1.187368  2.346343  51.61599  1.955076
               ES XXX  0.018252  0.677717  1.170823  2.306197  50.25854  1.961318
               ES ZZZ  0.011386  0.670211  1.179916  2.353334  115.5442  2.053459
               ES FFF  0.011386  0.680956  1.211736  2.449626  115.5442  2.100899
               ES SXS  0.018252  0.674537  1.169187  2.353334  50.25854  1.939847
statsforecast AutoETS  0.024468  0.673157  1.189209  2.326650  51.61597  1.923925
   skforecast AutoETS  0.074744  0.747200  1.344916  2.721083  50.54339  2.273724
       sktime AutoETS  0.024467  0.676191  1.190093  2.456184 565753200  106951.7

Things to note:

The best performing ETS on average is from Nixtla’s statsforecast package. The second best is our implementation (ADAM/ES) with backcasting;
Given that I ran exactly the same experiment for the R packages, we can conclude that Nixtla’s implementation is even better than the one in the forecast package in R;
In terms of median RMSSE, ES with backcasting outperforms all other implementations;
ADAM ETS is the best in terms of the first and third quartiles of RMSSE;
ADAM ETS and ES perform quite similarly. This is expected, because ES is a wrapper of ADAM ETS, which assumes normality for the error term. ADAM ETS switches between Normal and Gamma distributions based on the type of error term;
Backcasting leads to the most accurate forecasts on these datasets. This does not mean it is a universal rule, and I am sure the situation will change for other datasets;
ES XXX gives exactly the same results as the one implemented in the R version of the package. This is important because we were aiming to reproduce results between R and Python with 100% precision, and we did. The reason why other ETS flavours differ between R and Python is that since smooth 4.4.0 for R, we changed how point forecasts are calculated for multiplicative component models: previously, they relied on simulations; now we simply use point forecasts. While this is not entirely statistically accurate, it is pragmatic because it avoids explosive trajectories.

The results for SAME are qualitatively similar to those for RMSSE:

===================================================================================
EVALUATION RESULTS for SAME (All Series)
===================================================================================
               Method       Min        Q1       Med        Q3       Max      Mean
        ADAM ETS Back  0.001142  0.374466  0.995272  2.402342  52.34177  1.951070
         ADAM ETS Opt  0.000106  0.373551  1.021661  2.485596  55.10179  1.962040
         ADAM ETS Two  0.000782  0.380398  1.029629  2.451008  55.10186  1.970422
              ES Back  0.001142  0.372777  0.994547  2.412503  53.45041  1.946517
               ES Opt  0.000217  0.372725  1.024666  2.478435  54.68603  1.967217
               ES Two  0.000095  0.384795  1.028561  2.454543  54.68558  1.982261
               ES XXX  0.000094  0.373315  1.005006  2.425682  53.16973  1.992656
               ES ZZZ  0.000760  0.386673  1.017732  2.467912  145.7604  2.107522
               ES FFF  0.000597  0.401426  1.048395  2.566151  145.7604  2.173559
               ES SXS  0.000597  0.375438  1.005603  2.450490  53.45041  1.964322
statsforecast AutoETS  0.000228  0.374821  1.015205  2.434952  53.61359  1.938682
   skforecast AutoETS  0.000993  0.457066  1.217751  2.954003  59.84443  2.409900
       sktime AutoETS  0.000433  0.392802  1.029433  2.571555 385286900  72931.90

Finally, I also measured the computational time and got the following summary. Note that I moved ES flavours below because they are not directly comparable with the others (they have special pools of models):

================================================================================
EVALUATION RESULTS for Computational Time in seconds (All Series)
================================================================================
               Method       Min        Q1       Med        Q3        Max      Mean
        ADAM ETS Back  0.008679  0.086219  0.140929  0.241768   0.923601  0.181546
         ADAM ETS Opt  0.051302  0.229400  0.315379  0.792827   2.638193  0.550378
         ADAM ETS Two  0.051314  0.287382  0.455149  1.080276   3.535120  0.715090
              ES Back  0.009274  0.085299  0.139511  0.247114   0.958868  0.182547
               ES Opt  0.053176  0.224293  0.312200  0.772161   2.888662  0.541243
               ES Two  0.048539  0.279598  0.446404  1.058847   3.553156  0.703183
statsforecast AutoETS  0.001770  0.007702  0.081189  0.167040   1.271202  0.102575
   skforecast AutoETS  0.021553  0.243102  0.302078  1.478667   7.482101  0.820576
       sktime AutoETS  0.128021  6.227344 19.170921 41.494513 229.793067 30.712191

================================================================================
               ES XXX  0.008793  0.054139  0.093792  0.144012   0.480976  0.104800
               ES ZZZ  0.010502  0.127297  0.184561  0.447649   1.794031  0.313157
               ES FFF  0.046921  0.215119  1.110657  1.557724   3.489375  1.071004
               ES SXS  0.024909  0.116427  0.434594  0.564973   1.251257  0.403988

Things to note:

Nixtla’s implementation is actually the fastest and very hard to beat. As far as I understand, they did great work implementing some code in C++ and then using numba (I do not know what that means yet);
Our implementation does not use numba, but our ETS with backcasting is still faster than the skforecast and sktime implementations;
ADAM ETS with backcasting also has the lowest maximum time, implying that in difficult situations it finds a solution relatively quickly compared with the others;

So, overall, I would argue that the smooth implementation of ETS is competitive with other implementations. But it has one important benefit: it supports more features. And we plan to expand it further to make it even more useful across a wider variety of cases.

What’s next

There are still a lot of features that we have not managed to implement yet. Here is a non-exhaustive list:

Explanatory variables to have ETSX/ARIMAX;
ARIMA;
Occurrence model for intermittent demand forecasting;
Scale model for ADAM;
Simulation functions;
Model diagnostics;
CES, MSARIMA, SSARIMA, GUM, and SMA – functions that are available in R and not yet ported to Python.

So, lots of work to do. I am sure we will be quite busy well into 2026.

Summary

It has been a long and winding road, but Filotas and Leo did an amazing job to make this happen. The existing ETS implementation in smooth already works quite well and quite fast. It does not fail as some other implementations do, and it is quite reliable. I have actually spent many years testing the R version on different time series to make sure that it produces something sensible no matter what. The code was translated to Python one-to-one, so I am fairly confident that the function will work as expected in 99.9% of cases (there is always a non-zero probability that something will go wrong). Both ADAM and ES already support a variety of features that you might find useful.

One thing I will kindly ask of you is that if you find a bug or an issue when running experiments on your datasets, please file it in our GitHub repo here – we will try to find the time to fix it. Also, if you would like to contribute by translating some features from R to Python or implementing something additional, please get in touch with me.

Finally, I am always glad to hear success stories. If you find the smooth package useful in your work, please let us know. One way of doing that is via the Discussions on GitHub, or you can simply send me an email.

Message smooth forecasting with the smooth package in Python first appeared on Open Forecasting.

The real Dunning-Kruger effect

Ivan Svetunkov — Mon, 23 Mar 2026 09:03:35 +0000

Many of you have seen this image on the Internet — I’ve seen it myself a few times on LinkedIn lately. People say it depicts the “Dunning-Kruger” effect… But did you know this is actually an internet meme with little to do with the original paper?

Here is one of the recent examples, a screenshot of the post of Fotios Petropoulos about the effect.

A LinkedIn post by Fotios Petropoulos

In the original paper, Kruger and Dunning (1999) ran experiments with undergraduates on humour, logical reasoning, and grammar. Participants completed a test and estimated their percentile rank. The authors then sorted participants into four quartiles by actual performance and computed averages for actual and self-assessed performance for each quartile. The plots in their paper – the real Dunning–Kruger effect – are just four data points per line, not a smooth curve over a learning journey (second image).

What did they find? People in the bottom quartile substantially overestimated their performance, often believing they were average or above. Top performers slightly underestimated their standing. The key finding is an asymmetry in miscalibration: low performers overestimate, high performers slightly underestimate.

This has almost nothing to do with the popular “experience vs. confidence” image. The original X‑axis is performance quartile at a single point in time; the meme’s X‑axis is a vague notion of “experience” through time. The original Y‑axis is the assessed test percentile; the meme’s is a free‑floating “confidence” construct. In the actual data, perceived performance increases with actual performance – there is no early spike, no “valley of despair,” no “slope of enlightenment.” That swooping curve is an internet-era graphic never reported by Kruger and Dunning, and it misleadingly frames the effect as a personal development trajectory the paper never studied.

There is also a serious critique of the original paper from statistical point of view. For example, Gignac and Zajenkowski (2020) showed that sorting people into quartiles and plotting average self-assessment against average performance can, by itself, generate the characteristic pattern – purely as a statistical artefact. In their own empirical data, miscalibration was roughly constant across ability levels, consistent with measurement noise rather than a special cognitive deficit in low performers. You can actually reproduce the pattern using two random uncorrelated variables. Here is a simple example in R:

set.seed(41)

x <- rnorm(10000, 100, 10)
y <- rnorm(10000, 100, 10)
plot(x,y)
xQ <- quantile(x)
yQ <- quantile(y)

yMeans <- xMeans <- vector("numeric",4)

for(i in 1:4){
    xMeans[i] <- mean(x[xxQ[i]])
    yMeans[i] <- mean(y[xxQ[i]])
}

plot(1:4, xMeans, type="b", ylim=range(xMeans,yMeans),
     xlab="Real performance", ylab="Assessed performance",
     lwd=2)
lines(yMeans, lwd=2, lty=2)
points(yMeans, lwd=2)
legend("topleft",
       legend=c("Actual performance", "Assessed performance"),
       lwd=2, lty=c(1,2), pch=1)

Which produces the image like this:

Dunning-Kruger plot reproduction

If you introduce a correlation between the two variables, the images starts looking even more similar to the ones from the original paper.

So there might be a real effect – many follow-up studies have measured it with more rigorous tools – but Dunning and Kruger’s method was not the right one to establish it. And that image with experience vs confidence is just a meme and a serious misconception that should not be used.

P.S. If you wonder who the “leading expert” that Fotios Petropoulos refers to in his post is – it’s me. Not sure why he doesn’t tag me properly.

Message The real Dunning-Kruger effect first appeared on Open Forecasting.

There’s no such thing as “deterministic forecast”

Ivan Svetunkov — Mon, 02 Mar 2026 22:45:31 +0000

Sometimes I see people referring to a “deterministic” forecast, and I have some personal issues with this. Because if you apply a model to data then there is nothing deterministic about your forecasts!

In many contexts, “deterministic” has a precise meaning: no randomness, no uncertainty. A deterministic solution to an optimisation problem (e.g. linear programming) implies that there are no random inputs or outputs once the model and its parameters are fixed. Forecasting is different. As Chatfield and many others have pointed out, forecasting has multiple sources of uncertainty, and there is essentially zero chance that the future will unfold exactly as any single number suggests.

Yes, some people use “deterministic” as a synonym for “point forecast”. But that label is still misleading, because a point forecast is not uncertainty-free – it is just one summary of a predictive distribution (often the conditional mean, sometimes the median or another functional).

Here’s a quick reality check you can do yourself. Take a dataset, apply your model, and write down the point forecast for the next few observations. Now add one new observation, re-estimate, and forecast again (the image in this post depicts exactly that, but with 50 forecasts produced on different subsamples of data). The point forecast will change unless you are dealing with an exotic situation with non-random data (e.g. every day, you sell exactly 100 units). So, which of the two was the “deterministic” forecast? If forecasts were truly deterministic in the strict sense, you would not get multiple plausible values from small, reasonable changes in the sample.

This happens because any forecasting method (statistical or ML) depends on data and on modelling choices: parameter estimation, feature selection, splitting rules, tuning, even decisions like “use α=0.1”. Those choices can be fixed across samples of data, but fixing them does not remove uncertainty – it only hides it. The randomness is still there in the data and in the fact that we only observe a sample of it.

So when you see someone mentioning “deterministic forecast”, it’s worth translating it mentally to: “a point forecast, probably a conditional mean”. If you care about decisions and risk, you should know that there is an uncertainty associated with this so called “deterministic forecast”, and that it should not be ignored. But this is a topic for another discussion in another post.

Message There’s no such thing as “deterministic forecast” first appeared on Open Forecasting.

Scaling of error measures

Ivan Svetunkov — Mon, 23 Feb 2026 13:36:12 +0000

Apparently, we need to talk about scaling of error measures because this is not as obvious as it seems.

In forecasting literature, since early days of the area, there has been a general consensus that the forecast errors from the individual time series should not be analysed and aggregated as is. This is because you can have very different time series capturing dynamics of very different processes.

Indeed, if you forecast sales of apples in kilograms, your actual value would be apples in kilograms, and your point forecast would also be in the same units. Subtracting one from another tells us how many kilograms of apples we missed with the forecast we produced. But if we then take the average between forecast errors for apples and beer, we would be aggregating things in different units, which contradicts some basic aggregating principles.

Furthermore, if the company sells thousands of kilograms of apples and jet engines, aggregating forecast errors on those (e.g. 3000 vs 3) might introduce all types of issues, because the models performance on apples might mask the performance of the model on jet engines. Still, the jet engines are much more expensive than apples and getting them accurately might be more important for the company than forecasting apples.

So, forecasting literature has agreed that the forecast errors need to be somehow scaled to make the errors unitless and not to distort performance of models on time series with different volumes. There are several ways of doing that, including the poor ones and reasonable ones. The state of the art at the moment is to divide error measures by some in-sample statistics to avoid potential holdout-sample distortion. Using mean absolute differences (MAD) for this (thus ending up with MASE or RMSSE) is considered as a standard. A couple of years ago, I have written a post about advantages and disadvantages of several scaling methods.

But there is one method that I haven’t looked at and which is not very well discussed in the forecasting literature. It relies on the monetary value of forecasts. We could multiply each individual forecast error “e” by the price of the product “p” (thus moving to the missed income per product) and then divide everything by the overall income (price times quantity) from different products. This can be written as:

\begin{equation}
\text{monetary Mean Error} = \frac{\sum_{j=1}^n (p_j \times e_j)} {\sum_{j=1}^n (p_j \times q_j)}
\end{equation}

(the above formula can be modified to have squares or absolute values of the error). This way we switch from the original units to the monetary values and each error would tell you the percentage of the missed income in the overall one. This is a useful measure because it connects models performance with some managerial decisions and it takes the value of product into account (thus we do not mask the expensive jet engines with cheap apples).

However, it might have a potential issue similar to what the MAE/Mean or wMAPE has: if the sales of the product are not stationary, the denominator would change, thus driving the proportion either up or down, irrespective of how good the forecast is. I am not sure whether this needs to be addressed, because there is an argument that if the income from a product has increased and the error hasn’t changed, then this means that the proportion of the missed income decreased, which makes sense. But if we need to address this, we can switch to the MAD multiplied by price in the denominator to address this issue. In fact, this was sort of done in M5 competition that used a weighted RMSSE, relying on the income from each product over the last 4 weeks of data.

But here is one more interesting thing about this error measure. If we assume that prices for all products are exactly the same, they will disappear from the numerator and the denominator, leaving us with just sum of errors divided by the overall sales of all products. This still maintains the original idea of the proportion of the missed income, but now has a very strong assumption, which is probably not correct in the real life (apples and engines for the same price?). Furthermore, this would mask the performance of the model for the expensive products again. I personally don’t like this measure and find the assumption unrealistic and potentially misleading. Having said that, I can see some cases where this could still be acceptable and useful (e.g. similar products with similar dynamics and similar prices).

Summarising:

If you are conducting a forecasting experiment without a specific context, I’d recommend using RMSSE or some other similar measure with scaling.
If you have prices of products, income-based scaling might be more informative.
Setting all prices to the same value does not sound appealing to me, but I understand that there is a context where this might work.

Message Scaling of error measures first appeared on Open Forecasting.

smooth v4.4.0

Ivan Svetunkov — Mon, 09 Feb 2026 09:02:21 +0000

Great news, everyone! smooth package for R version 4.4.0 is now on CRAN. Why is this a great news? Let me explain!

On this page:

What’s new?
Evaluation

Setup
Results

What’s next?

Here is what’s new since 4.3.0:

First, I have worked on tuning the initialisation in adam() in case of backcasting, and improved the msdecompose() function a bit to get more robust results. This was necessary to make sure that when the smoothing parameters are close to zero, initial values would still make sense. This is already in adam (use smoother="global" to test), but will become the default behaviour in the next version of the package, when we iron everything out. This is all a part of a larger work with Kandrika Pritularga on a paper about the initialisation of dynamic models.

Second, I have fixed a long standing issue of the eigenvalues calculation inside the dynamic models, which is applicable only in case of bounds="admissible" and might impact ARIMA, CES and GUM. The parameter restriction are now done consistently across all functions, guaranteeing that they will not fail and will produce stable/invertible estimates of parameters.

Third, I have added the Sparse ARMA function, which constructs ARMA(p,q) of the specific orders, dropping all the elements from 1 to those. e.g. SpARMA(2,3) would have the following form:
\begin{equation*}
y_t = \phi_2 y_{t-2} + \theta_3 \epsilon_{t-3} + \epsilon_{t}
\end{equation*}
This weird model is needed for a project I am working on together with Devon Barrow, Nikos Kourentzes and Yves Sagaert. I’ll explain more when we get the final draft of the paper.

And something very important, which you will not notice: I refactored the C++ code in the package so that it is available not only for R, but also for Python… Why? I’ll explain in the next post :). But this also means that the old functions that relied on the previous generation of the C++ code are now discontinued, and all the smooth functions use the new core. This applies to es(), ssarima(), msarima(), ces(), gum() and sma(). You will not notice any change, except that some of them should become a bit faster and probably more robust. And this also means that all of them will now be able to use methods for the adam() function. For example, the summary() will produce the proper output with standard errors and confidence intervals for all estimated parameters.

Evaluation

DISCLAIMER: The previous evaluation was for smooth v4.3.0, you can find it here. I have changed one of error measures (sCE to SAME), but the rest is the same, so the results are widely comparable between the versions.

The setup

As usual, in situations like this, I have run the evaluation on the M1, M3 and Tourism competition data. This time, I have added more flavours of the ETS model selection so that you can see how the models pool impacts the forecasting accuracy. Short description:

XXX – select between pure additive ETS models only;
ZZZ – select from the pool of all 30 models, but use branch-and-bound to kick out the less suitable models;
ZXZ – same as (2), but without the multiplicative trend models. This is used in the smooth functions by default;
FFF – select from the pool of all 30 models (exhaustive search);
SXS – the pool of models that is used by default in ets() from the forecast package in R.

I also tested three types of the ETS initialisation:

Back – initial="backcasting"
Opt – initial="optimal"
Two – initial="two-stage"

Backcasting is now the default method of initialisation, and does well in many cases, but I found that optimal initials (if done correctly) help in some difficult situations, as long a you have enough of computational time.

I used two error measures and computational time to check how functions work. The first error measure is called RMSSE (Root Mean Squared Scaled Error) from M5 competition, motivated by Athanasopoulos & Kourentzes (2023):

\begin{equation*}
\mathrm{RMSSE} = \frac{1}{\sqrt{\frac{1}{T-1} \sum_{t=1}^{T-1} \Delta_t^2}} \mathrm{RMSE},
\end{equation*}
where \(\mathrm{RMSE} = \sqrt{\frac{1}{h} \sum_{j=1}^h e^2_{t+j}}\) is the Root Mean Squared Error of the point forecasts, and \(\Delta_t\) is the first differences of the in-sample actual values.

The second measure does not have a standard name in the literature, but the idea of it is to the measure the bias of forecasts and to get rid of the sign to make sure that positively biased forecasts on some time series are not cancelled out by the negative ones on the other ones. I call this measure “Scaled Absolute Mean Error” (SAME):

\begin{equation*}
\mathrm{SAME} = \frac{1}{\frac{1}{T-1} \sum_{t=1}^{T-1} |\Delta_t|} \mathrm{AME},
\end{equation*}
where \(\mathrm{AME}= \left| \frac{1}{h} \sum_{j=1}^h e_{t+j} \right|\).

For both of these measures, the lower value is better than the higher one. As for the computational time, I have measured it for each model and each series, and this time I provided distribution of times to better see how methods perform.

Boring code in R

library(Mcomp)
library(Tcomp)
library(forecast)
library(smooth)

library(doMC)
registerDoMC(detectCores())

# Create a small but neat function that will return a vector of error measures
errorMeasuresFunction <- function(object, holdout, insample){
        holdout <- as.vector(holdout);
        insample <- as.vector(insample);
	# RMSSE and SAME are defined in greybox v2.0.7
        return(c(RMSSE(holdout, object$mean, mean(diff(insample^2)),
                 SAME(holdout, object$mean, mean(abs(diff(insample)))),
                 object$timeElapsed))
}

datasets <- c(M1,M3,tourism)
datasetLength <- length(datasets)

# Method configuration list
# Each method specifies: fn (function name), pkg (package), model, initial,
methodsConfig <- list(
	# ETS and Auto ARIMA from the forecast package in R
	"ETS" = list(fn = "ets", pkg = "forecast", use_x_only = TRUE),
	"Auto ARIMA" = list(fn = "auto.arima", pkg = "forecast", use_x_only = TRUE),
	# ADAM with different initialisation schemes
	"ADAM ETS Back" = list(fn = "adam", pkg = "smooth", model = "ZXZ", initial = "back"),
	"ADAM ETS Opt" = list(fn = "adam", pkg = "smooth", model = "ZXZ", initial = "opt"),
	"ADAM ETS Two" = list(fn = "adam", pkg = "smooth", model = "ZXZ", initial = "two"),
	# ES, which is a wrapper of ADAM. Should give very similar results to ADAM on regular data
	"ES Back" = list(fn = "es", pkg = "smooth", model = "ZXZ", initial = "back"),
	"ES Opt" = list(fn = "es", pkg = "smooth", model = "ZXZ", initial = "opt"),
	"ES Two" = list(fn = "es", pkg = "smooth", model = "ZXZ", initial = "two"),
	# Several flavours for model selection in ES
	"ES XXX" = list(fn = "es", pkg = "smooth", model = "XXX", initial = "back"),
	"ES ZZZ" = list(fn = "es", pkg = "smooth", model = "ZZZ", initial = "back"),
	"ES FFF" = list(fn = "es", pkg = "smooth", model = "FFF", initial = "back"),
	"ES SXS" = list(fn = "es", pkg = "smooth", model = "SXS", initial = "back"),
	# ARIMA implementations in smooth
	"MSARIMA" = list(fn = "auto.msarima", pkg = "smooth", initial = "back"),
	"SSARIMA" = list(fn = "auto.ssarima", pkg = "smooth", initial = "back"),
	# Complex Exponential Smoothing
	"CES" = list(fn = "auto.ces", pkg = "smooth", initial = "back"),
	# Generalised Univeriate Model (experimental)
	"GUM" = list(fn = "auto.gum", pkg = "smooth", initial = "back")
)

methodsNames <- names(methodsConfig)
methodsNumber <- length(methodsNames)

measuresNames <- c("RMSSE","SAME","Time")
measuresNumber <- length(measuresNames)

testResults <- array(NA, c(methodsNumber, datasetLength, measuresNumber),
                     dimnames = list(methodsNames, NULL, measuresNames))

# Unified loop over all methods
for(j in seq_along(methodsConfig)){
	cfg <- methodsConfig[[j]]
	cat("Running method:", methodsNames[j], "\n")

	result <- foreach(i = 1:datasetLength, .combine = "cbind",
	                  .packages = c("smooth", "forecast")) %dopar% {
		startTime <- Sys.time()

		# Build model call based on method type
		if(isTRUE(cfg$use_x_only)){
			# forecast package methods: ets, auto.arima
			test <- do.call(cfg$fn, list(datasets[[i]]$x))
		}else if(cfg$fn %in% c("adam", "es")) {
			# adam and es take dataset and model
			test <- do.call(cfg$fn, list(datasets[[i]], model=cfg$model, initial = cfg$initial))
		}else{
			# auto.msarima, auto.ssarima, auto.ces, auto.gum
			test <- do.call(cfg$fn, list(datasets[[i]], initial = cfg$initial))
		}

		# Build forecast call
		forecast_args <- list(test, h = datasets[[i]]$h)
		testForecast <- do.call(forecast, forecast_args)
		testForecast$timeElapsed <- Sys.time() - startTime

		return(errorMeasuresFunction(testForecast, datasets[[i]]$xx, datasets[[i]]$x))
	}
	testResults[j,,] <- t(result)
}

Results

And here are the results for the smooth functions in v4.4.0 for R. First, we summarise the RMSSEs. I produce quartiles of distribution of RMSSE together with the mean.

cbind(t(apply(testResults[,,"RMSSE"],1,quantile, na.rm=T)),
      mean=apply(testResults[,,"RMSSE"],1,mean)) |> round(4)

                  0%    25%    50%    75%      100%   mean
ETS           0.0245 0.6772 1.1806 2.3765   51.6160 1.9697
Auto ARIMA    0.0246 0.6802 1.1790 2.3583   51.6160 1.9864
ADAM ETS Back 0.0183 0.6647 1.1620 2.3023   50.2585 1.9283
ADAM ETS Opt  0.0242 0.6714 1.1868 2.3623   51.6160 1.9432
ADAM ETS Two  0.0246 0.6690 1.1875 2.3374   51.6160 1.9480
ES Back       0.0183 0.6674 1.1647 2.3164   50.2585 1.9292
ES Opt        0.0242 0.6740 1.1858 2.3644   51.6160 1.9469
ES Two        0.0245 0.6717 1.1874 2.3463   51.6160 1.9538
ES XXX        0.0183 0.6777 1.1708 2.3062   50.2585 1.9613
ES ZZZ        0.0108 0.6682 1.1816 2.3611  201.4959 2.0841
ES FFF        0.0145 0.6795 1.2170 2.4575 5946.1858 3.3033
ES SXS        0.0183 0.6754 1.1709 2.3539   50.2585 1.9448
MSARIMA       0.0278 0.6988 1.1898 2.4208   51.6160 2.0750
SSARIMA       0.0277 0.7371 1.2544 2.4425   51.6160 2.0625
CES Back      0.0450 0.6761 1.1741 2.3205   51.0571 1.9650
GUM Back      0.0333 0.7077 1.2073 2.4533   51.6184 2.0461

The worst performing models are the ETS with the multiplicative trend (ES ZZZ and ES FFF). This is because there are outliers in some time series, and the multiplicative trend reacts to them by amending the trend value to something large (e.g. 2, i.e. twice increase in level for each step), and then can never return to a reasonable level (see explanation of this phenomenon in Section 6.6 of ADAM book). As expected, ADAM ETS does very similar to the ES, and we can see that the default initialisation (backcasting) is pretty good in terms of RMSSE values. To be fair, if the models are tested on a different dataset, it might be the case that the optimal initialisation would do better.

Here is a table with the SAME results:

cbind(t(apply(testResults[,,"SAME"],1,quantile, na.rm=T)),
      mean=apply(testResults[,,"SAME"],1,mean)) |> round(4)

                 0%    25%    50%    75%      100%   mean
ETS           8e-04 0.3757 1.0203 2.5097   54.6872 1.9983
Auto ARIMA    0e+00 0.3992 1.0429 2.4565   53.2710 2.0446
ADAM ETS Back 1e-04 0.3752 0.9965 2.4047   52.3418 1.9518
ADAM ETS Opt  5e-04 0.3733 1.0212 2.4848   55.1018 1.9618
ADAM ETS Two  8e-04 0.3780 1.0316 2.4511   55.1019 1.9712
ES Back       0e+00 0.3733 0.9945 2.4122   53.4504 1.9485
ES Opt        2e-04 0.3727 1.0255 2.4756   54.6860 1.9673
ES Two        1e-04 0.3855 1.0323 2.4535   54.6856 1.9799
ES XXX        1e-04 0.3733 1.0050 2.4257   53.1697 1.9927
ES ZZZ        3e-04 0.3824 1.0135 2.4885  229.7626 2.1376
ES FFF        3e-04 0.3972 1.0489 2.6042 3748.4268 2.9501
ES SXS        6e-04 0.3750 1.0125 2.4627   53.4504 1.9725
MSARIMA       1e-04 0.3960 1.0094 2.5409   54.7916 2.1227
SSARIMA       1e-04 0.4401 1.1222 2.5673   52.5023 2.1248
CES Back      6e-04 0.3767 1.0079 2.4085   54.9026 2.0052
GUM Back      0e+00 0.3803 1.0575 2.6259   63.0637 2.0858

In terms of bias, smooth implementations of ETS are doing well again, and we can see the same issue with the multiplicative trend here as before. Another thing to note is that MSARIMA and SSARIMA are not as good as the Auto ARIMA from the forecast package on these datasets in terms of RMSSE and SAME (at least, in terms of mean error measures). And actually, GUM and CES are now better than those in terms of both error measures.

Finally, here is a table with the computational time:

cbind(t(apply(testResults[,,"Time"],1,quantile, na.rm=T)),
      mean=apply(testResults[,,"Time"],1,mean)) |> round(4)

                  0%    25%    50%     75%    100%   mean
ETS           0.0032 0.0117 0.1660  0.6728  1.6400 0.3631
Auto ARIMA    0.0100 0.1184 0.3618  1.0548 54.3652 1.4760
ADAM ETS Back 0.0162 0.1062 0.1854  0.4022  2.5109 0.2950
ADAM ETS Opt  0.0319 0.1920 0.3103  0.6792  3.8933 0.5368
ADAM ETS Two  0.0427 0.2548 0.4035  0.8567  3.7178 0.6331
ES Back       0.0153 0.0896 0.1521  0.3335  2.1128 0.2476
ES Opt        0.0303 0.1667 0.2565  0.5910  3.5887 0.4522
ES Two        0.0483 0.2561 0.4016  0.8626  3.5892 0.6309
MSARIMA Back  0.0614 0.3418 0.6947  0.9868  3.9677 0.7534
SSARIMA Back  0.0292 0.2963 0.8988  2.1729 13.7635 1.6581
CES Back      0.0146 0.0400 0.1834  0.2298  1.2099 0.1713
GUM Back      0.0165 0.2101 1.5221  3.0543  9.5380 1.9506

# Separate table for special pools of ETS.
# The time is proportional to the number of models here
=========================================================
                  0%    25%    50%     75%    100%   mean
ES XXX        0.0114 0.0539 0.0782  0.1110  0.8163 0.0859
ES ZZZ        0.0147 0.1371 0.2690  0.4947  2.2049 0.3780
ES FFF        0.0529 0.2775 1.1539  1.5926  3.8552 1.1231
ES SXS        0.0323 0.1303 0.4491  0.6013  2.2170 0.4581

I have manually moved the specific ES model pools flavours below because there is no point in comparing their computational time with the time of the others (they have different pools of models and thus are not really comparable with the rest).

What we can see from this, is that the ES with backcasting is faster in comparison with the other models in this setting (in terms of mean and median computational time). CES is very fast in terms of mean computational time, which is probably because of the very short pool of models to choose from (only four). SSARIMA is pretty slow, which is due to the nature of its order selection algorithm (I don't plan to update it any time soon, but if someone wants to contribute - let me know). But the interesting thing is that Auto ARIMA, while being relatively fine in terms of median time, has the highest maximum one, meaning that for some time series, it failed for some unknown reason. The series that caused the biggest issue for Auto ARIMA is N389 from the M1 competition. I'm not sure what the issue was, and I don't have time to investigate this.

Mean computational time vs mean RMSSE

Comparing the mean computational time with mean RMSSE value (image above), it looks like the overall tendency in the smooth + forecast functions for the M1, M3 and Tourism datasets is that additional computational time does not improve the accuracy. But it also looks like a simpler pool of pure additive models (ETS(X,X,X)) harms the accuracy in comparison with the branch-and-bound based one of the default model="ZXZ". There seems to be a sweet spot in terms of the pool of models to choose from (no multiplicative trend, allow mixed models). This aligns well with the papers of Petropoulos et al. (2025), who investigated the accuracy of arbitrary short pools of models and Kourentzes et al. (2019), who showed how pooling (if done correctly) can improve the accuracy on average.

What's next?

For R, the main task now is to rewrite the oes() function and substitute it with the om() one - "Occurrence Model". This should be equivalent to adam() in functionality, allowing to introduce ETS, ARIMA and explanatory variables for the occurrence part of the model. This is a huge work, which I hope to progress slowly throughout the 2026 and finish by the end of the year. Doing that will also allow me removing the last bits of the old C++ code and switch to the ADAM core completely, introducing more functionality for capturing patterns on intermittent demand. The minor task, is to test the smoother="global" more for the ETS initialisation and roll it out as the default in the next release for both R and Python.

For Python,... What Python? Ah! You'll see soon :)

Message smooth v4.4.0 first appeared on Open Forecasting.

Forecasting Competitions Datasets in Python

Ivan Svetunkov — Mon, 26 Jan 2026 09:29:25 +0000

Here is one small, unexpected piece of news: I now have my first package on PyPI! It’s called fcompdata, and let me tell you a little bit about it.

When I test my functions in R, I usually use the M1, M3, and tourism competition datasets because they are diverse enough, containing seasonal, non-seasonal, trended, and non-trended time series of different frequencies (yearly, quarterly, monthly). The total number of these series is 5,315, which is large enough but not too heavy for my PC. So, when I run something on those datasets, it becomes like a stress test for the forecasting approach, and I can see where it fails and how it can be improved. I consider this type of test a toy experiment — something to do before applying anything to real-world data.

In R, there are the Mcomp and Tcomp packages that contain these datasets, and I like how they are organised. You can do something like this:

series <- Mcomp::M3[[2568]]
ourModel <- adam(series$x)
ourForecast <- forecast(model, h=series$h)
ourError <- series$xx - ourForecast$mean

Each series from the dataset contains all the necessary attributes to run the experiment without trouble. This is easy and straightforward. Plus, I don’t need to download or organise any data — I just use the installed package.

When I started vibe coding in Python, I realised that I missed this functionality. So, with the help of Claude AI, I created a Python script to download the data from the Monash repository and organise it the way I liked. But then I realised two things, which motivated me to package it:

I needed to drag this script with me to every project I worked on. It would be much easier to just run "pip install fcompdata" and forget about everything else.
Some series in the Monash repository differ from those in the R package.

Wait, what?! Really?

Yes. The difference is tiny — it’s a matter of rounding. For example, series N350 from the M1 competition data (T169 from the quarterly data subset) has three digits in the R package and only two if downloaded from the Monash repository (Zenodo website).

Who cares?! It's just one digit difference, right?

Well, if you want to reproduce results across different languages, this tiny difference might become your nightmare. So, I care (and probably nobody else in the world), and I decided to create a proper Python package. You can now do this in Python and relax:

pip install fcompdata

from fcompdata import M1, M3, Tourism
series = M3[2568]

The "series" object is now an instance of the MCompSeries class that has the same attributes as in R: series.x, series.h, series.xx, etc.

As simple as that!

One more thing: I’ve added support for the M4 competition data, which — when imported — will be downloaded and formatted properly. The dataset is large (100k time series), and I personally don’t like it. I even wrote a post about it back in 2020. But if I want the package to be useful to a wider audience, I shouldn’t impose my personal preferences — you should decide for yourselves whether to use it or not.

P.S. Submitting to PyPI gave me a good understanding of the submission process for Python and why it can be such a mess. My package was published just a few seconds after submission — nobody looked at it, nobody ran any tests. CRAN does a variety of checks to ensure you don’t submit garbage. PyPI doesn’t care. So, I’ve gained more respect for CRAN after submitting this package to PyPI.

Message Forecasting Competitions Datasets in Python first appeared on Open Forecasting.

Risky business: how to select your model based on risk preferences

Ivan Svetunkov — Mon, 19 Jan 2026 11:28:04 +0000

What do you use for model selection? Do you select the best model based on its cross-validated performance, or do you use in-sample measures like AIC? If so, there is a way to improve your selection process further.

JORS recently published the paper of Nikos Kourentzes and I based on a simple but powerful idea: instead of using summary statistics (like the mean RMSE of cross-validated errors), you should consider the entire distribution and choose a specific quantile. This aligns with my previous post on error measures, but here is the core intuition:

The distribution of error measures is almost always asymmetric. If you only look at the average, you end up with a “mean temperature in the hospital” statistic, which doesn’t reflect how models actually behave. Some models perform great on most series but fail miserably on a few.

What can we do in this case? We can look at quantiles of distribution.

For example, if we use 84th quantile, we compare the models based on their “bad” performance, situations where they fail and produce less accurate forecasts. If you choose the best performing model there, you will end up with something that does not fail as much. So your preferences for the model become risk-averse in this situation.

If you focus on the lower quantile (e.g. 16th), you are looking at models that do well on the well-behaved series and ignore how they do on the difficult ones. So, your model selection preferences can be described as risk-tolerant, because you are accept that the best performing model might fail on a difficult time series.

Furthermore, the median (50th quantile, the middle of sample), corresponds to the risk-neutral situation, because it ignores the tails of the distribution.

What about the mean? This is a risk-agnostic strategy, because it says nothing about the performance on the difficult or easy time series – it takes everything and nothing in it at the same time, hiding the true risk profile.

So what?

In the paper, we show that using a risk-averse strategy tends to improve overall forecasting accuracy in day-to-day situations. Conversely, a risk-tolerant strategy can be beneficial when disruptions are anticipated, as standard models are likely to fail anyway.

So, next time you select a model, think about the measure you are using. If it’s just the mean RMSE, keep in mind that you might be ignoring the inherent risks of that selection.

P.S. While the discussion above applies to the distribution of error measures, our paper specifically focused on point AIC (in-sample performance). But it is a distance measure as well, so the logic explained above holds.

P.P.S. Nikos wrote a post about this paper here.

P.P.P.S. And here is the link to the paper.

Message Risky business: how to select your model based on risk preferences first appeared on Open Forecasting.

Teaching Statistics and Descriptive Analytics in the world of AI

Ivan Svetunkov — Wed, 07 Jan 2026 17:32:28 +0000

Teaching statistics as a flipped classroom with the help of AI? You heard that right! That’s exactly what I tried this year – and here are the results.

Attached to this post is the student evaluation score for the module. Yes, the number of responses is quite low (only 50% of the cohort), but it should still give a sense of how students perceived Statistics and Descriptive Analytics. Of course, this reflects only their impression – coursework submissions are yet to come – but it’s still an encouraging sign that some things worked well.

I’ve taught this module since 2018, first with Dave Worthington and later with Alisa Yusupova. Normally, I focused on the second half, covering regression through lectures and workshops. But this year, I took on the full module and realised I didn’t want to teach probability theory and statistics in the traditional way – long monologues in lectures followed by awkward silence in workshops. That format, I believe, no longer works. After all, students can always ask their favourite LLM to explain concepts they don’t understand. And some don’t even do that – they just ask to solve problems without understanding them. So, what can be done in this brave new world?

I don’t yet have a definitive answer – only the results of an experiment.

Lectures. This year, I used Google Notebook ML to prepare lecture materials. I provided my existing notes, slides, and relevant texts, then asked it to produce podcasts on specific topics. This took more time than expected, as I had to review the generated content, adjust prompts, and refine focus areas, listening to the podcasts over and over again. Once ready, I uploaded the materials on Moodle and asked students to listen beforehand. In class, we skipped formal lectures and instead had whiteboard & marker discussions. I asked questions, showed derivations, and encouraged debate. With a class of 26 students, it was possible to create much more interaction than in previous years.

Workshops. We still had problem-solving sessions, but I allowed (actually encouraged) students to use LLMs to solve tasks and explain why the solutions were correct. The aim was to emphasise reasoning and assumptions over simply obtaining the right number. This worked with mixed success, and I still need to think how it can be improved further.

Did it work overall?

I’m not entirely sure. Not all students engaged with the materials in advance, but those who did seemed to benefit and appreciated the approach. What I do know is that the “two-hour monologue while everyone tries not to fall asleep” format does not work any more. For universities, and for the (very!) expensive UK education, to remain relevant, we must innovate and rethink how we teach.

What would you change if you were teaching a technical subject at university in the era of AI? I’d love to hear your ideas.

Message Teaching Statistics and Descriptive Analytics in the world of AI first appeared on Open Forecasting.

AID paper rejected from the IJPR

Ivan Svetunkov — Fri, 14 Nov 2025 11:12:16 +0000

So, our paper with Anna Sroginis got rejected from a special issue of the International Journal of Production Research after a second round of revision. And here is what I think about this!

First things first, why am I writing this post? I want to share failures with the community, because I am tired of all the success stories. It is okay not to win, and this happens much more often than it seems.

Now, about the paper. In the first round of revisions, four reviewers looked at it and provided their comments. We expanded the paper accordingly, making it now 46 pages long (ouch!). We introduced inventory simulations and showed how using some basic principles improves forecasting accuracy and can lead to a reduction in inventory costs.

In the second round, the AE added one more reviewer. After careful consideration, two of the reviewers recommended major revisions, while the other two suggested a strong rejection, claiming that the paper does not make new and significant contributions to the production research literature.

Obviously, I disagree with this evaluation. Based on the reviewers’ comments, I have a feeling they didn’t read the paper in full (their main concerns relate to Section 3, and some of these could have been resolved if they had reached Section 5). But this probably also means that the paper in its current state is too big and needs to be rewritten to become more focused. Maybe this is what confused the reviewers.

So, what’s next?

We will amend it to address the reviewers’ comments, shorten it a bit to make it more focused, and then submit to another OR-related journal.

And while we are doing that, I have updated the arXiv version of the paper to show what we did after the first round, and here is a brief summary of the main findings:

Using a stockout dummy variable and capturing the level of data correctly (removing the effect of stockouts) improves the accuracy of forecasting approaches;
The stockouts detection should be done for both the training and the test sets. If the series with stockouts are not removed from the test set, the forecasts would be evaluated incorrectly;
Splitting the demand into demand sizes and demand occurrence, producing forecasts for each of the parts and then combining the result substantially improves the accuracy;
Using the feature for regular/intermittent demand improves the forecasting accuracy, but does not seem to impact the inventory performance. Note that this separation is straightforward in AID: if after removing the stockouts, there are some zeroes left, the demand is identified as intermittent;
The further split into smooth/lumpy leads to slight improvements in terms of accuracy, without a substantial impact on the inventory;
The split into count/fractional demand does not bring value in terms of forecasting accuracy or inventory performance.

Message AID paper rejected from the IJPR first appeared on Open Forecasting.

Several crucial steps in demand forecasting with ML

Ivan Svetunkov — Wed, 08 Oct 2025 10:32:41 +0000

When I read posts written by some ML experts, I sometimes notice that they either overlook or do not clearly explain a few crucial steps in demand forecasting. In this post, I want to highlight the three most important ones based on my personal experience.

First and foremost, stationarity (see the proper definition of the term here). Time series often exhibit either a changing level or clear trends. While some models (such as ETS) take care of that directly via specific components, others need some preliminary steps before being applied. The simplest thing one can do is to take differences of the original data if it shows any form of non-stationarity and model the rates of change instead of just demand. For ML, this is extremely important because typical approaches (such as decision trees, k-NN, neural networks) cannot extrapolate. So, getting rid of the trend and/or ensuring that the level does not change over time will help ML approaches do the job they are supposed to do. And don’t use the global trend (see image in the post), as time series rarely exhibit a constant increase or decrease. In real life, the trend is usually stochastic, implying that the average sales change at a varying rate, not a fixed one.

Second, real time series often exhibit heteroscedasticity (i.e. the variance of the data increases with the level). The simplest way to stabilise the variance is to take logarithms of the data. This is not a universal solution, but it works in many cases. This way, the error in the model should have a constant variance. A more advanced approach is to use the Box-Cox transformation, but this requires estimating the parameter lambda, which is not always straightforward. The main issue with this arises when working with intermittent demand, where some unpredictable zeroes occur (the logarithm of zero equals -infinity). In that case, you might want to take logarithms of demand sizes (non-zero values) instead of the demand itself and switch to a mixture model. Another simple but inelegant trick is to add one to every observation and then take logarithms. This works but breaks my heart.

Third, seasonality is extremely important in time series. There are many ways one can capture it. The simplest is by introducing dummy variables, but this might cause issues because, in reality, seasonality often changes over time (see this recent post). So, a better way of capturing it is either by extracting the seasonal component from STL or ETS and using it as a feature or by using the lagged values of your data. Depending on the specific situation and approach, there can be many other ways of capturing seasonality, and frankly, I struggle to come up with a universal one.

Bonus: if you work with intermittent demand, splitting it into demand sizes and demand occurrence might boost the accuracy of your approach if the underlying levels are captured correctly. Anna Sroginis and I show this improvement for LightGBM, regression, and simple local level approaches in our paper (currently under revision).

P.S. Kandrika and I will deliver a course on “Demand Forecasting Principles with examples in R” in November, where we will discuss these and other important aspects of demand forecasting. You can still sign up for it here

Message Several crucial steps in demand forecasting with ML first appeared on Open Forecasting.