Skip to content

matstat: Integer mathematical statistics library#8733

Merged
jnohlgard merged 2 commits intoRIOT-OS:masterfrom
jnohlgard:pr/matstat
Apr 28, 2018
Merged

matstat: Integer mathematical statistics library#8733
jnohlgard merged 2 commits intoRIOT-OS:masterfrom
jnohlgard:pr/matstat

Conversation

@jnohlgard
Copy link
Copy Markdown
Member

Contribution description

Introduce a new library for estimating parameters of sampled random distributions. The library uses only integer operations and the algorithms only require a single pass over the input data, which reduces the memory usage to O(1).

Includes unit tests of all library functions and for some basic verification of the accuracy of the computed values.

Issues/PRs references

Required by #8531

@jnohlgard jnohlgard added Type: new feature The issue requests / The PR implemements a new feature for RIOT CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR labels Mar 5, 2018
@jnohlgard jnohlgard added this to the Release 2018.04 milestone Mar 5, 2018
@jnohlgard
Copy link
Copy Markdown
Member Author

I didn't know who to assign, @kYc0o could you reassign to someone you think is suitable for reviewing this PR?

@kYc0o
Copy link
Copy Markdown
Contributor

kYc0o commented Mar 7, 2018

I can take a look but I don't feel super competent on this matters. Maybe @cladmi ? BTW, we somehow are getting into a policy of not assigning people, but instead assign ourselves (as maintainers) to PRs which we feel we can review. Of course we can still assign someone if we know the person would agree.

@jnohlgard
Copy link
Copy Markdown
Member Author

Yeah, preferably for the review it would be someone with a little experience in numeric algorithms and mathematical statistics, to validate the overall design.

*
* @return sample variance
*/
uint64_t matstat_variance(const matstat_state_t *state, int32_t mean);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has the mean to be provided? We are able to calculate it from state so why not do it in the function?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saves an expensive 64 bit division if you need both mean and variance

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A change of algorithm to improve numerical robustness required that the mean be computed for every value added, so I removed the mean function argument

@jnohlgard
Copy link
Copy Markdown
Member Author

I found a specific failure case which causes the variance to explode, working on a unit test case for that particular issue

@jnohlgard
Copy link
Copy Markdown
Member Author

Added workaround for a specific truncation error which yielded negative variance in certain situations. A precondition for the fault to occur is that the real variance of the input values is very small, and that a bigger absolute value comes before a smaller value. Also added a test vector for this particular failure mode to catch any future regressions.

Copy link
Copy Markdown
Contributor

@cladmi cladmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few suggestions inline, non blocking.
I understand what a variance is but I would like some reference for the algorithm using offset to review properly. The same reference put in the implementation would be great too.

/* Assuming the differences will be small on average and the number of
* samples is reasonably limited, to prevent overflow in sum_sq */
state->sum_sq += (int64_t)value * value;
++state->count;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer ++state->count just after the state->sum += value.
This way, the first part of the function handles the min/max/mean and the second part the variance and so the offset and sum_sq values.

int32_t matstat_mean(const matstat_state_t *state)
{
if (state->count == 0) {
/* We don't have any way of returning an error */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer a note in the documentation than the comment.

uint64_t matstat_variance(const matstat_state_t *state, int32_t mean)
{
if (state->count < 2) {
/* We don't have any way of returning an error */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer a note in the documentation than the comment.

*/
typedef struct {
int64_t sum; /**< Sum of values added */
uint64_t sum_sq; /**< Sum of squared values added */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in the functions, I would prefer sum_sq defined close to offset as they are used together.

value -= state->offset;
/* Assuming the differences will be small on average and the number of
* samples is reasonably limited, to prevent overflow in sum_sq */
state->sum_sq += (int64_t)value * value;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe replace the comment by an assert and a note in the documentation ?

@jnohlgard
Copy link
Copy Markdown
Member Author

Changing the variance algorithm to a better method, will post back when it is done.

@jnohlgard
Copy link
Copy Markdown
Member Author

Updated the implementation algorithm to Welford's algorithm for variance

@kaspar030
Copy link
Copy Markdown
Contributor

Looks nice!

(not a bug, thus I'm untagging the release)

@bergzand
Copy link
Copy Markdown
Member

@gebart So far the module looks great. I've been playing around a bit and noticed that it is possible to remove the mean variable from the matstat_t struct by reordering the variance calculations a bit, making use of the fact that the sum and the count are already stored. This does however add a division in the add function and two more divisions in the merge function. Have you noticed the same and do you have any opinion on this? I've opened jnohlgard#17 with this code

@jnohlgard
Copy link
Copy Markdown
Member Author

The reason I decided to leave the mean in the struct was to avoid having to perform the same division sum / count twice, but maybe it could be worth it to save on RAM. Do you have any benchmarks for the CPU overhead? The 64 bit divisions are quite expensive on Cortex M and other constrained architectures.

@bergzand
Copy link
Copy Markdown
Member

Then lets leave it as it is. As you say, 64bit division is not fast, and I think you're right that it's not worth it to reduce the struct size by a int32_t.

@bergzand
Copy link
Copy Markdown
Member

But no, I can't back this with actual measurements :(

@jnohlgard
Copy link
Copy Markdown
Member Author

there are some corner cases where the computed variance becomes negative, due to limited range of small values. The true variance is close to zero but becomes negative after a number of truncations and roundings make the fractions flip over to the negative side. This can be detected by the sum_sq variable suddenly becoming very big.
I am considering changing the variance and sum_sq variables to signed int and letting it show negative variance when this happens, it can simply be discarded as "close to zero" when interpreting the results. It kind of messes up the printouts in many cases when the variance becomes extremely large when using unsigned ints.
A different approach is to check for underflow and truncate the sum to zero when it becomes negative.

@jnohlgard
Copy link
Copy Markdown
Member Author

@bergzand sorry, I was maybe a bit too quick when I replied before. The mean is only a 32 bit division, maybe it is worth the RAM savings to recompute on every add? I don't know. Perhaps we should revisit this after merging the basic implementation?

@bergzand
Copy link
Copy Markdown
Member

there are some corner cases where the computed variance becomes negative, due to limited range of small values. The true variance is close to zero but becomes negative after a number of truncations and roundings make the fractions flip over to the negative side. This can be detected by the sum_sq variable suddenly becoming very big.

Can you provide a test case for this?

Is it possible to wrap the sum_sq addition in an absolute operator:
state->sum_sq = abs (sum_sq + (value - state->mean) * (value - new_mean));, or does this screw too much with the results?

@bergzand
Copy link
Copy Markdown
Member

A different approach is to check for underflow and truncate the sum to zero when it becomes negative.

Oh, my suggestion is essentially something like this :)

@jnohlgard
Copy link
Copy Markdown
Member Author

OK to squash?

@bergzand
Copy link
Copy Markdown
Member

OK to squash?

👍

@jnohlgard
Copy link
Copy Markdown
Member Author

For future reference, here is a test input for the negative variance tests.
Start with an empty matstat_state_t and use matstat_merge to successively merge each of the given states in order and you should end up with a negative variance like I did below.

2018-03-22 10:44:07,610 - INFO #    interval    count       sum       sum_sq    min   max  mean  variance
2018-03-22 10:44:07,612 - INFO #   16 -   17:    2686      5414         1380      1     3     2      0
2018-03-22 10:44:07,615 - INFO #   18 -   19:    2643      5272         3263      1     3     1      1
2018-03-22 10:44:07,627 - INFO #   20 -   23:    2650      5328          719      1     3     2      0
2018-03-22 10:44:07,629 - INFO #   24 -   31:    2562      5117         2756      1     3     1      1
2018-03-22 10:44:07,641 - INFO #   32 -   47:    2579      5157          635      1     3     1      0
2018-03-22 10:44:07,644 - INFO #   48 -   79:    2533      5050         2944      1     3     1      1
2018-03-22 10:44:07,646 - INFO #   80 -  143:    2630      5276         1078      1     3     2      0
2018-03-22 10:44:07,658 - INFO #  144 -  271:    2667      5333          974      1     3     1      0
2018-03-22 10:44:07,661 - INFO #  272 -  527:    2414      4859         1074      1     3     2      0
2018-03-22 10:44:07,673 - INFO #       TOTAL    23364     46806       -11106      1     3     2 789570863061659  <=== SIC!

2018-03-22 10:58:57,213 - INFO #    interval    count       sum       sum_sq    min   max  mean  variance
2018-03-22 10:58:57,215 - INFO #   16 -   17:   55988    294157        63150      3     7     5      1
2018-03-22 10:58:57,227 - INFO #   18 -   19:   55785    274856        99726      3     6     4      1
2018-03-22 10:58:57,230 - INFO #   20 -   23:   55833    237601        25047      3     6     4      0
2018-03-22 10:58:57,233 - INFO #   24 -   31:   55919    216642        48885      3     4     3      0
2018-03-22 10:58:57,244 - INFO #   32 -   47:   56151    217615        49162      3     4     3      0
2018-03-22 10:58:57,248 - INFO #   48 -   79:   55743    216097        48857      3     4     3      0
2018-03-22 10:58:57,251 - INFO #   80 -  143:   56212    217884        49247      3     4     3      0
2018-03-22 10:58:57,261 - INFO #  144 -  271:   56090    217362        49088      3     4     3      0
2018-03-22 10:58:57,265 - INFO #  272 -  527:   52853    204947        46381      3     4     3      0
2018-03-22 10:58:57,276 - INFO #       TOTAL   500574   2097161      -516847      3     7     4 36851256607346  <=== SIC!

@bergzand
Copy link
Copy Markdown
Member

@gebart To get this merged, maybe add a small warning in the header about rounding because of integers and to expect weird behaviour when using it with many small valued numbers. This can be added to the headers file here

Is this okay with you or do you want to hold this until you have a fix for this corned case?

@jnohlgard
Copy link
Copy Markdown
Member Author

I think I will just add a check for underflow in the sum_sq addition

@jnohlgard
Copy link
Copy Markdown
Member Author

and add that log file data as a unit test

Copy link
Copy Markdown
Contributor

@cladmi cladmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests pass with iotlab-m3 and wsn430-v1_3b.

I read-reviewed the add, mean, variance and merge. I did not reviewed tests.

When reviewing the variance part.

  • I understand and agree for the update and finalize taken from https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm

  • For the merge I am almost ok, but maybe it's only me having trouble juggling with values in my head. I think it assumes that m + n == count is similar enough to count -1.
    Our variance definition is sum_sq / (count -1), and the algorithm assumes sum_sq = sigma2 * count.
    Non blocking suggestion: If that's the case, I would prefer to have a it more explicitely in the comment like:

    (using sum_sq = sigma2 * n, instead of sum_sq = sigma2 * (n-1) to simplify algorithm)
    

@jnohlgard
Copy link
Copy Markdown
Member Author

  • Added a regression test for negative variance after merge
  • Add safeguard against negative variance to fix the failing test.
  • Update comment on sigma2 as suggested by @cladmi
  • Rebased

@jnohlgard jnohlgard added this to the Release 2018.07 milestone Apr 25, 2018
@kaspar030
Copy link
Copy Markdown
Contributor

Seems like this just needs squashing.

@jnohlgard
Copy link
Copy Markdown
Member Author

Squashed

@jnohlgard jnohlgard merged commit 9c85ce1 into RIOT-OS:master Apr 28, 2018
@jnohlgard
Copy link
Copy Markdown
Member Author

@cladmi @kaspar030 @bergzand @miri64 @kYc0o Thanks for the help getting this merged!

maxvankessel pushed a commit to maxvankessel/RIOT that referenced this pull request May 8, 2018
matstat: Integer mathematical statistics library
@jnohlgard jnohlgard deleted the pr/matstat branch September 24, 2018 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Type: new feature The issue requests / The PR implemements a new feature for RIOT

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants