Allow controlling the random number generator for RTrees training#16251
Allow controlling the random number generator for RTrees training#16251opencv-pushbot merged 1 commit intoopencv:3.4from pwuertz:rtrees_set_rng
Conversation
|
Ok, so the way |
Please follow the existed API. Pass Also there is ABI checker seems broken / misconfigured ... |
modules/ml/src/rtrees.cpp
Outdated
| vector<float> varImportance; | ||
| vector<int> allVars, activeVars; | ||
| RNG rng; | ||
| RNG initRng = RNG((uint64) - 1); |
There was a problem hiding this comment.
What is about this: alalek@pr16251_r ?
Lets re-use global theRNG() (it allows to save / restore RNG state too).
cv::theRNG() = RNG(your train seed);
... call train() ...
There was a problem hiding this comment.
Personally, I'd be ok with this. But it would break the default behavior of RTrees.
People might be relying on deterministic RTrees results.
Honestly, I was actually surprised that OpenCV does this ^^.
Applied the suggested change. The method now uses RNG instances instead of RNG states. This makes the method inaccessible for Python though :/. Also I think this sort of hides the true intent of the function - setting a fixed, specific seed for deterministic training.
I think the current idea of having a custom, per-algorithm RNG is very handy. It makes training reproducible by default and independent from other parts of your application. With a per-algorithm seed-setter the current behavior is maintained, yet it enables use-cases where (deterministic) randomness is required (batched or continuous training). |
|
I think your solution (shared RNG) looks way better. You just have to decide if dropping the deterministic default behavior is acceptable. It shouldn't be a massive shock to anyone, random forests are called "random" for a reason ;). |
|
Thank you! Looks good to me. This patch should go into 3.4 branch first. We will merge changes from 3.4 into master regularly (weekly/bi-weekly). So, please:
Note: no needs to re-open PR, apply changes "inplace". |
|
Thanks for your help! |
Currently the random number generator for training
cv::ml::RTreesis seeded with a hardcoded value, which ensures that every forest trained with the same data yields identical results.This PR adds a method for modifying the training seed per instance in order to allow training new or
additional solutions if desired.
Ultimately, this enables parallelized training / evaluation via forest subdivision. I.e. use
Nthreads withMtrees each instead of training a singleN*Mwide forest.There are other benefits to this as well, like determining the variance of a model for a given set of hyperparameters.
Update: PR modified to make use of global
theRNG()forRTreestraining, making random forests random by default. Previous deterministic behavior is achieved by callingcv::setRNGSeedbefore training.