matstat: Integer mathematical statistics library#8733
matstat: Integer mathematical statistics library#8733jnohlgard merged 2 commits intoRIOT-OS:masterfrom
Conversation
|
I didn't know who to assign, @kYc0o could you reassign to someone you think is suitable for reviewing this PR? |
|
I can take a look but I don't feel super competent on this matters. Maybe @cladmi ? BTW, we somehow are getting into a policy of not assigning people, but instead assign ourselves (as maintainers) to PRs which we feel we can review. Of course we can still assign someone if we know the person would agree. |
|
Yeah, preferably for the review it would be someone with a little experience in numeric algorithms and mathematical statistics, to validate the overall design. |
sys/include/matstat.h
Outdated
| * | ||
| * @return sample variance | ||
| */ | ||
| uint64_t matstat_variance(const matstat_state_t *state, int32_t mean); |
There was a problem hiding this comment.
Why has the mean to be provided? We are able to calculate it from state so why not do it in the function?
There was a problem hiding this comment.
Saves an expensive 64 bit division if you need both mean and variance
There was a problem hiding this comment.
A change of algorithm to improve numerical robustness required that the mean be computed for every value added, so I removed the mean function argument
|
I found a specific failure case which causes the variance to explode, working on a unit test case for that particular issue |
|
Added workaround for a specific truncation error which yielded negative variance in certain situations. A precondition for the fault to occur is that the real variance of the input values is very small, and that a bigger absolute value comes before a smaller value. Also added a test vector for this particular failure mode to catch any future regressions. |
cladmi
left a comment
There was a problem hiding this comment.
A few suggestions inline, non blocking.
I understand what a variance is but I would like some reference for the algorithm using offset to review properly. The same reference put in the implementation would be great too.
| /* Assuming the differences will be small on average and the number of | ||
| * samples is reasonably limited, to prevent overflow in sum_sq */ | ||
| state->sum_sq += (int64_t)value * value; | ||
| ++state->count; |
There was a problem hiding this comment.
I would prefer ++state->count just after the state->sum += value.
This way, the first part of the function handles the min/max/mean and the second part the variance and so the offset and sum_sq values.
sys/matstat/matstat.c
Outdated
| int32_t matstat_mean(const matstat_state_t *state) | ||
| { | ||
| if (state->count == 0) { | ||
| /* We don't have any way of returning an error */ |
There was a problem hiding this comment.
I prefer a note in the documentation than the comment.
| uint64_t matstat_variance(const matstat_state_t *state, int32_t mean) | ||
| { | ||
| if (state->count < 2) { | ||
| /* We don't have any way of returning an error */ |
There was a problem hiding this comment.
I prefer a note in the documentation than the comment.
sys/include/matstat.h
Outdated
| */ | ||
| typedef struct { | ||
| int64_t sum; /**< Sum of values added */ | ||
| uint64_t sum_sq; /**< Sum of squared values added */ |
There was a problem hiding this comment.
Same as in the functions, I would prefer sum_sq defined close to offset as they are used together.
sys/matstat/matstat.c
Outdated
| value -= state->offset; | ||
| /* Assuming the differences will be small on average and the number of | ||
| * samples is reasonably limited, to prevent overflow in sum_sq */ | ||
| state->sum_sq += (int64_t)value * value; |
There was a problem hiding this comment.
Maybe replace the comment by an assert and a note in the documentation ?
|
Changing the variance algorithm to a better method, will post back when it is done. |
|
Updated the implementation algorithm to Welford's algorithm for variance |
2313926 to
f347c05
Compare
f6fcd33 to
01a5b26
Compare
|
Looks nice! (not a bug, thus I'm untagging the release) |
|
@gebart So far the module looks great. I've been playing around a bit and noticed that it is possible to remove the |
|
The reason I decided to leave the mean in the struct was to avoid having to perform the same division |
|
Then lets leave it as it is. As you say, 64bit division is not fast, and I think you're right that it's not worth it to reduce the struct size by a |
|
But no, I can't back this with actual measurements :( |
|
there are some corner cases where the computed variance becomes negative, due to limited range of small values. The true variance is close to zero but becomes negative after a number of truncations and roundings make the fractions flip over to the negative side. This can be detected by the sum_sq variable suddenly becoming very big. |
|
@bergzand sorry, I was maybe a bit too quick when I replied before. The mean is only a 32 bit division, maybe it is worth the RAM savings to recompute on every add? I don't know. Perhaps we should revisit this after merging the basic implementation? |
Can you provide a test case for this? Is it possible to wrap the |
Oh, my suggestion is essentially something like this :) |
|
OK to squash? |
👍 |
|
For future reference, here is a test input for the negative variance tests. |
|
@gebart To get this merged, maybe add a small warning in the header about rounding because of integers and to expect weird behaviour when using it with many small valued numbers. This can be added to the headers file here Is this okay with you or do you want to hold this until you have a fix for this corned case? |
|
I think I will just add a check for underflow in the sum_sq addition |
|
and add that log file data as a unit test |
There was a problem hiding this comment.
Tests pass with iotlab-m3 and wsn430-v1_3b.
I read-reviewed the add, mean, variance and merge. I did not reviewed tests.
When reviewing the variance part.
-
I understand and agree for the update and finalize taken from https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm
-
For the
mergeI am almost ok, but maybe it's only me having trouble juggling with values in my head. I think it assumes thatm + n == countis similar enough tocount -1.
Our variance definition issum_sq / (count -1), and the algorithm assumessum_sq = sigma2 * count.
Non blocking suggestion: If that's the case, I would prefer to have a it more explicitely in the comment like:(using sum_sq = sigma2 * n, instead of sum_sq = sigma2 * (n-1) to simplify algorithm)
|
|
Seems like this just needs squashing. |
|
Squashed |
|
@cladmi @kaspar030 @bergzand @miri64 @kYc0o Thanks for the help getting this merged! |
matstat: Integer mathematical statistics library
Contribution description
Introduce a new library for estimating parameters of sampled random distributions. The library uses only integer operations and the algorithms only require a single pass over the input data, which reduces the memory usage to O(1).
Includes unit tests of all library functions and for some basic verification of the accuracy of the computed values.
Issues/PRs references
Required by #8531