matstat: Integer mathematical statistics library by jnohlgard · Pull Request #8733 · RIOT-OS/RIOT

jnohlgard · 2018-03-05T07:30:45Z

Contribution description

Introduce a new library for estimating parameters of sampled random distributions. The library uses only integer operations and the algorithms only require a single pass over the input data, which reduces the memory usage to O(1).

Includes unit tests of all library functions and for some basic verification of the accuracy of the computed values.

Issues/PRs references

Required by #8531

jnohlgard · 2018-03-05T07:31:56Z

I didn't know who to assign, @kYc0o could you reassign to someone you think is suitable for reviewing this PR?

kYc0o · 2018-03-07T12:11:47Z

I can take a look but I don't feel super competent on this matters. Maybe @cladmi ? BTW, we somehow are getting into a policy of not assigning people, but instead assign ourselves (as maintainers) to PRs which we feel we can review. Of course we can still assign someone if we know the person would agree.

jnohlgard · 2018-03-07T12:38:21Z

Yeah, preferably for the review it would be someone with a little experience in numeric algorithms and mathematical statistics, to validate the overall design.

miri64 · 2018-03-07T12:42:30Z

sys/include/matstat.h

+ *
+ * @return  sample variance
+ */
+uint64_t matstat_variance(const matstat_state_t *state, int32_t mean);


Why has the mean to be provided? We are able to calculate it from state so why not do it in the function?

Saves an expensive 64 bit division if you need both mean and variance

A change of algorithm to improve numerical robustness required that the mean be computed for every value added, so I removed the mean function argument

jnohlgard · 2018-03-07T12:52:54Z

I found a specific failure case which causes the variance to explode, working on a unit test case for that particular issue

jnohlgard · 2018-03-07T22:48:33Z

Added workaround for a specific truncation error which yielded negative variance in certain situations. A precondition for the fault to occur is that the real variance of the input values is very small, and that a bigger absolute value comes before a smaller value. Also added a test vector for this particular failure mode to catch any future regressions.

cladmi

A few suggestions inline, non blocking.
I understand what a variance is but I would like some reference for the algorithm using offset to review properly. The same reference put in the implementation would be great too.

cladmi · 2018-03-08T11:13:17Z

sys/matstat/matstat.c

+    /* Assuming the differences will be small on average and the number of
+     * samples is reasonably limited, to prevent overflow in sum_sq */
+    state->sum_sq += (int64_t)value * value;
+    ++state->count;


I would prefer ++state->count just after the state->sum += value.
This way, the first part of the function handles the min/max/mean and the second part the variance and so the offset and sum_sq values.

cladmi · 2018-03-08T11:13:59Z

sys/matstat/matstat.c

+int32_t matstat_mean(const matstat_state_t *state)
+{
+    if (state->count == 0) {
+        /* We don't have any way of returning an error */


I prefer a note in the documentation than the comment.

cladmi · 2018-03-08T11:14:11Z

sys/matstat/matstat.c

+uint64_t matstat_variance(const matstat_state_t *state, int32_t mean)
+{
+    if (state->count < 2) {
+        /* We don't have any way of returning an error */


I prefer a note in the documentation than the comment.

cladmi · 2018-03-08T11:16:29Z

sys/include/matstat.h

+ */
+typedef struct {
+    int64_t sum;        /**< Sum of values added */
+    uint64_t sum_sq;    /**< Sum of squared values added */


Same as in the functions, I would prefer sum_sq defined close to offset as they are used together.

cladmi · 2018-03-08T11:32:17Z

sys/matstat/matstat.c

+    value -= state->offset;
+    /* Assuming the differences will be small on average and the number of
+     * samples is reasonably limited, to prevent overflow in sum_sq */
+    state->sum_sq += (int64_t)value * value;


Maybe replace the comment by an assert and a note in the documentation ?

jnohlgard · 2018-03-08T15:27:10Z

Changing the variance algorithm to a better method, will post back when it is done.

jnohlgard · 2018-03-12T09:35:07Z

Updated the implementation algorithm to Welford's algorithm for variance

kaspar030 · 2018-04-11T20:49:54Z

Looks nice!

(not a bug, thus I'm untagging the release)

bergzand · 2018-04-12T12:54:35Z

@gebart So far the module looks great. I've been playing around a bit and noticed that it is possible to remove the mean variable from the matstat_t struct by reordering the variance calculations a bit, making use of the fact that the sum and the count are already stored. This does however add a division in the add function and two more divisions in the merge function. Have you noticed the same and do you have any opinion on this? I've opened jnohlgard#17 with this code

jnohlgard · 2018-04-12T13:08:46Z

The reason I decided to leave the mean in the struct was to avoid having to perform the same division sum / count twice, but maybe it could be worth it to save on RAM. Do you have any benchmarks for the CPU overhead? The 64 bit divisions are quite expensive on Cortex M and other constrained architectures.

bergzand · 2018-04-12T13:12:05Z

Then lets leave it as it is. As you say, 64bit division is not fast, and I think you're right that it's not worth it to reduce the struct size by a int32_t.

bergzand · 2018-04-12T13:12:28Z

But no, I can't back this with actual measurements :(

jnohlgard · 2018-04-13T11:27:42Z

there are some corner cases where the computed variance becomes negative, due to limited range of small values. The true variance is close to zero but becomes negative after a number of truncations and roundings make the fractions flip over to the negative side. This can be detected by the sum_sq variable suddenly becoming very big.
I am considering changing the variance and sum_sq variables to signed int and letting it show negative variance when this happens, it can simply be discarded as "close to zero" when interpreting the results. It kind of messes up the printouts in many cases when the variance becomes extremely large when using unsigned ints.
A different approach is to check for underflow and truncate the sum to zero when it becomes negative.

jnohlgard · 2018-04-13T11:44:08Z

@bergzand sorry, I was maybe a bit too quick when I replied before. The mean is only a 32 bit division, maybe it is worth the RAM savings to recompute on every add? I don't know. Perhaps we should revisit this after merging the basic implementation?

bergzand · 2018-04-13T13:25:14Z

there are some corner cases where the computed variance becomes negative, due to limited range of small values. The true variance is close to zero but becomes negative after a number of truncations and roundings make the fractions flip over to the negative side. This can be detected by the sum_sq variable suddenly becoming very big.

Can you provide a test case for this?

Is it possible to wrap the sum_sq addition in an absolute operator:
state->sum_sq = abs (sum_sq + (value - state->mean) * (value - new_mean));, or does this screw too much with the results?

bergzand · 2018-04-13T13:26:34Z

A different approach is to check for underflow and truncate the sum to zero when it becomes negative.

Oh, my suggestion is essentially something like this :)

jnohlgard · 2018-04-13T13:31:43Z

OK to squash?

bergzand · 2018-04-13T13:32:34Z

OK to squash?

👍

jnohlgard · 2018-04-13T13:33:51Z

For future reference, here is a test input for the negative variance tests.
Start with an empty matstat_state_t and use matstat_merge to successively merge each of the given states in order and you should end up with a negative variance like I did below.

2018-03-22 10:44:07,610 - INFO #    interval    count       sum       sum_sq    min   max  mean  variance
2018-03-22 10:44:07,612 - INFO #   16 -   17:    2686      5414         1380      1     3     2      0
2018-03-22 10:44:07,615 - INFO #   18 -   19:    2643      5272         3263      1     3     1      1
2018-03-22 10:44:07,627 - INFO #   20 -   23:    2650      5328          719      1     3     2      0
2018-03-22 10:44:07,629 - INFO #   24 -   31:    2562      5117         2756      1     3     1      1
2018-03-22 10:44:07,641 - INFO #   32 -   47:    2579      5157          635      1     3     1      0
2018-03-22 10:44:07,644 - INFO #   48 -   79:    2533      5050         2944      1     3     1      1
2018-03-22 10:44:07,646 - INFO #   80 -  143:    2630      5276         1078      1     3     2      0
2018-03-22 10:44:07,658 - INFO #  144 -  271:    2667      5333          974      1     3     1      0
2018-03-22 10:44:07,661 - INFO #  272 -  527:    2414      4859         1074      1     3     2      0
2018-03-22 10:44:07,673 - INFO #       TOTAL    23364     46806       -11106      1     3     2 789570863061659  <=== SIC!

2018-03-22 10:58:57,213 - INFO #    interval    count       sum       sum_sq    min   max  mean  variance
2018-03-22 10:58:57,215 - INFO #   16 -   17:   55988    294157        63150      3     7     5      1
2018-03-22 10:58:57,227 - INFO #   18 -   19:   55785    274856        99726      3     6     4      1
2018-03-22 10:58:57,230 - INFO #   20 -   23:   55833    237601        25047      3     6     4      0
2018-03-22 10:58:57,233 - INFO #   24 -   31:   55919    216642        48885      3     4     3      0
2018-03-22 10:58:57,244 - INFO #   32 -   47:   56151    217615        49162      3     4     3      0
2018-03-22 10:58:57,248 - INFO #   48 -   79:   55743    216097        48857      3     4     3      0
2018-03-22 10:58:57,251 - INFO #   80 -  143:   56212    217884        49247      3     4     3      0
2018-03-22 10:58:57,261 - INFO #  144 -  271:   56090    217362        49088      3     4     3      0
2018-03-22 10:58:57,265 - INFO #  272 -  527:   52853    204947        46381      3     4     3      0
2018-03-22 10:58:57,276 - INFO #       TOTAL   500574   2097161      -516847      3     7     4 36851256607346  <=== SIC!

bergzand · 2018-04-13T13:49:09Z

@gebart To get this merged, maybe add a small warning in the header about rounding because of integers and to expect weird behaviour when using it with many small valued numbers. This can be added to the headers file here

Is this okay with you or do you want to hold this until you have a fix for this corned case?

jnohlgard · 2018-04-16T12:08:16Z

I think I will just add a check for underflow in the sum_sq addition

jnohlgard · 2018-04-16T12:08:37Z

and add that log file data as a unit test

cladmi

Tests pass with iotlab-m3 and wsn430-v1_3b.

I read-reviewed the add, mean, variance and merge. I did not reviewed tests.

When reviewing the variance part.

I understand and agree for the update and finalize taken from https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm
For the merge I am almost ok, but maybe it's only me having trouble juggling with values in my head. I think it assumes that m + n == count is similar enough to count -1.
Our variance definition is sum_sq / (count -1), and the algorithm assumes sum_sq = sigma2 * count.
Non blocking suggestion: If that's the case, I would prefer to have a it more explicitely in the comment like:
```
(using sum_sq = sigma2 * n, instead of sum_sq = sigma2 * (n-1) to simplify algorithm)
```

jnohlgard · 2018-04-25T05:28:40Z

Added a regression test for negative variance after merge
Add safeguard against negative variance to fix the failing test.
Update comment on sigma2 as suggested by @cladmi
Rebased

kaspar030 · 2018-04-26T13:46:08Z

Seems like this just needs squashing.

jnohlgard · 2018-04-28T06:07:04Z

Squashed

jnohlgard · 2018-04-28T07:44:01Z

@cladmi @kaspar030 @bergzand @miri64 @kYc0o Thanks for the help getting this merged!

matstat: Integer mathematical statistics library

jnohlgard added Type: new feature The issue requests / The PR implemements a new feature for RIOT CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR labels Mar 5, 2018

jnohlgard added this to the Release 2018.04 milestone Mar 5, 2018

jnohlgard assigned kYc0o Mar 5, 2018

jnohlgard force-pushed the pr/matstat branch from 1da61fd to b2a7e95 Compare March 7, 2018 05:56

miri64 reviewed Mar 7, 2018

View reviewed changes

jnohlgard force-pushed the pr/matstat branch from b2a7e95 to 481174e Compare March 7, 2018 22:43

cladmi reviewed Mar 8, 2018

View reviewed changes

jnohlgard force-pushed the pr/matstat branch 2 times, most recently from 2313926 to f347c05 Compare March 17, 2018 09:11

jnohlgard unassigned kYc0o Mar 23, 2018

jnohlgard force-pushed the pr/matstat branch 3 times, most recently from f6fcd33 to 01a5b26 Compare March 29, 2018 17:50

jnohlgard force-pushed the pr/matstat branch from 01a5b26 to c82c1a4 Compare April 5, 2018 10:20

jnohlgard mentioned this pull request Apr 11, 2018

tests/bench_timers: A comprehensive benchmark for periph_timer #8531

Merged

kaspar030 removed this from the Release 2018.04 milestone Apr 11, 2018

bergzand mentioned this pull request Apr 12, 2018

matstat: rework to remove mean variable jnohlgard/RIOT#17

Closed

jnohlgard force-pushed the pr/matstat branch from c82c1a4 to 8fce774 Compare April 13, 2018 13:36

cladmi approved these changes Apr 16, 2018

View reviewed changes

jnohlgard force-pushed the pr/matstat branch from 8fce774 to c4ac8cf Compare April 25, 2018 05:26

jnohlgard added this to the Release 2018.07 milestone Apr 25, 2018

Joakim Nohlgård added 2 commits April 28, 2018 08:03

sys/matstat: Integer mathematical statistics library

6927260

unittests: Add tests for matstat library

5c59f6a

jnohlgard force-pushed the pr/matstat branch from a51749f to 5c59f6a Compare April 28, 2018 06:03

jnohlgard merged commit 9c85ce1 into RIOT-OS:master Apr 28, 2018

maxvankessel pushed a commit to maxvankessel/RIOT that referenced this pull request May 8, 2018

Merge pull request RIOT-OS#8733 from gebart/pr/matstat

13966d3

matstat: Integer mathematical statistics library

jnohlgard deleted the pr/matstat branch September 24, 2018 09:37

Conversation

jnohlgard commented Mar 5, 2018

Contribution description

Issues/PRs references

Uh oh!

jnohlgard commented Mar 5, 2018

Uh oh!

kYc0o commented Mar 7, 2018

Uh oh!

jnohlgard commented Mar 7, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnohlgard commented Mar 7, 2018

Uh oh!

jnohlgard commented Mar 7, 2018

Uh oh!

cladmi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnohlgard commented Mar 8, 2018

Uh oh!

jnohlgard commented Mar 12, 2018

Uh oh!

kaspar030 commented Apr 11, 2018

Uh oh!

bergzand commented Apr 12, 2018

Uh oh!

jnohlgard commented Apr 12, 2018

Uh oh!

bergzand commented Apr 12, 2018

Uh oh!

bergzand commented Apr 12, 2018

Uh oh!

jnohlgard commented Apr 13, 2018

Uh oh!

jnohlgard commented Apr 13, 2018

Uh oh!

bergzand commented Apr 13, 2018

Uh oh!

bergzand commented Apr 13, 2018

Uh oh!

jnohlgard commented Apr 13, 2018

Uh oh!

bergzand commented Apr 13, 2018

Uh oh!

jnohlgard commented Apr 13, 2018

Uh oh!

bergzand commented Apr 13, 2018

Uh oh!

jnohlgard commented Apr 16, 2018

Uh oh!

jnohlgard commented Apr 16, 2018

Uh oh!

cladmi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnohlgard commented Apr 25, 2018

Uh oh!

kaspar030 commented Apr 26, 2018

Uh oh!

jnohlgard commented Apr 28, 2018

Uh oh!

jnohlgard commented Apr 28, 2018

Uh oh!

cladmi left a comment •

edited

Loading