Skip to content

Allow controlling the random number generator for RTrees training#16251

Merged
opencv-pushbot merged 1 commit intoopencv:3.4from
pwuertz:rtrees_set_rng
Dec 29, 2019
Merged

Allow controlling the random number generator for RTrees training#16251
opencv-pushbot merged 1 commit intoopencv:3.4from
pwuertz:rtrees_set_rng

Conversation

@pwuertz
Copy link
Copy Markdown
Contributor

@pwuertz pwuertz commented Dec 28, 2019

Currently the random number generator for training cv::ml::RTrees is seeded with a hardcoded value, which ensures that every forest trained with the same data yields identical results.

This PR adds a method for modifying the training seed per instance in order to allow training new or
additional solutions if desired.

Ultimately, this enables parallelized training / evaluation via forest subdivision. I.e. use N threads with M trees each instead of training a single N*M wide forest.

There are other benefits to this as well, like determining the variance of a model for a given set of hyperparameters.

Update: PR modified to make use of global theRNG() for RTrees training, making random forests random by default. Previous deterministic behavior is achieved by calling cv::setRNGSeed before training.

@pwuertz
Copy link
Copy Markdown
Contributor Author

pwuertz commented Dec 28, 2019

Ok, so the way RTrees and RTreesImpl are designed there is no way of adding something without breaking the ABI. This is ok when switching from, say 4.2 to 4.3, right?

@alalek
Copy link
Copy Markdown
Member

alalek commented Dec 28, 2019

uint64_t seed

Please follow the existed API. Pass RNG instead of seed. See setAnnealEnergyRNG().

Also there is theRNG() (thread-local object). Perhaps it make sense to use it instead of custom per-algorithm).


ABI checker seems broken / misconfigured ...

vector<float> varImportance;
vector<int> allVars, activeVars;
RNG rng;
RNG initRng = RNG((uint64) - 1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is about this: alalek@pr16251_r ?
Lets re-use global theRNG() (it allows to save / restore RNG state too).

cv::theRNG() = RNG(your train seed);
... call train() ...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I'd be ok with this. But it would break the default behavior of RTrees.
People might be relying on deterministic RTrees results.
Honestly, I was actually surprised that OpenCV does this ^^.

@pwuertz
Copy link
Copy Markdown
Contributor Author

pwuertz commented Dec 28, 2019

Pass RNG instead of seed. See setAnnealEnergyRNG().

Applied the suggested change. The method now uses RNG instances instead of RNG states.

This makes the method inaccessible for Python though :/.

Also I think this sort of hides the true intent of the function - setting a fixed, specific seed for deterministic training.

Please follow the existed API.

cv::setRNGSeed seems to be a prominent API example using integer seed values and is closely related to the use-case of the function I suggested. Also I'd argue that setting integer seeds for quasi-random processes is a more common API pattern.

Also there is theRNG() (thread-local object). Perhaps it make sense to use it instead of custom per-algorithm).

I think the current idea of having a custom, per-algorithm RNG is very handy. It makes training reproducible by default and independent from other parts of your application.
Sharing a global or thread local RNG for training would break this behavior.

With a per-algorithm seed-setter the current behavior is maintained, yet it enables use-cases where (deterministic) randomness is required (batched or continuous training).

@pwuertz
Copy link
Copy Markdown
Contributor Author

pwuertz commented Dec 28, 2019

I think your solution (shared RNG) looks way better. You just have to decide if dropping the deterministic default behavior is acceptable.

It shouldn't be a massive shock to anyone, random forests are called "random" for a reason ;).

@alalek
Copy link
Copy Markdown
Member

alalek commented Dec 29, 2019

Thank you! Looks good to me.


This patch should go into 3.4 branch first. We will merge changes from 3.4 into master regularly (weekly/bi-weekly).

So, please:

  • change "base" branch of this PR: master => 3.4 (use "Edit" button near PR title)
  • rebase your commits from master onto 3.4 branch. For example:
    git rebase -i --onto upstream/3.4 upstream/master
    (check list of your commits, save and quit (Esc + "wq" + Enter)
    where upstream is configured by following this GitHub guide and fetched (git fetch upstream).
  • push rebased commits into source branch of your fork (with --force option)

Note: no needs to re-open PR, apply changes "inplace".

@pwuertz pwuertz changed the base branch from master to 3.4 December 29, 2019 21:01
Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done 👍

@pwuertz
Copy link
Copy Markdown
Contributor Author

pwuertz commented Dec 29, 2019

Thanks for your help!

@pwuertz pwuertz changed the title Allow setting RNG seed for RTrees training. Allow controlling the random number generator for RTrees training Dec 29, 2019
@opencv-pushbot opencv-pushbot merged commit 8aebef2 into opencv:3.4 Dec 29, 2019
@pwuertz pwuertz deleted the rtrees_set_rng branch December 29, 2019 22:49
@alalek alalek mentioned this pull request Dec 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants