Allow `RandomForest*` and `ExtraTrees*` to have a higher max_samples than 1.0 when `bootstrap=True`

### Describe the workflow you want to enable

Currently, random/extra forests can bootstrap sample the data such that `max_samples \in (0.0, 1.0]`. This enables an out-of-bag sample estimate in forests.

However, this only allows you to sample in principle up to at most 63% unique samples and then 37% of unique samples are for out-of-bag estimation. However, you should be able to control this parameter to a proportion greater. For instance, perhaps I want to leverage 80% of my data to fit each tree, and 20% to estimate oob performance. This requires one to set `max_samples=1.6`. 

Beyond that, no paper suggests that 63% is required cutoff for bootstrapping the samples in Random/Extra forest. I am happy to submit a PR if the core-dev team thinks the propose solution is simple and reasonable.

See https://stats.stackexchange.com/questions/126107/expected-proportion-of-the-sample-when-bootstrapping for a good reference and explanation.

### Describe your proposed solution

The proposed solution is actually backwards-compatable and adds minimal complexity to the codebase.

1. We change https://github.com/scikit-learn/scikit-learn/blob/38c8cc3bab151b76ed890a4b690871e0fa404426/sklearn/ensemble/_forest.py#L95-L125 to the following LOC:
2.
```Python

def _get_n_samples_bootstrap(n_samples, max_samples):
    """
    Get the number of samples in a bootstrap sample.

    The expected total number of unique samples in a bootstrap sample is
    required to be at most ``n_samples - 1``.
    This is equivalent to the expected number of out-of-bag samples being at
    least 1.

    Parameters
    ----------
    n_samples : int
        Number of samples in the dataset.
    max_samples : int or float
        The maximum number of samples to draw from the total available:
            - if float, this indicates a fraction of the total;
            - if int, this indicates the exact number of samples;
            - if None, this indicates the total number of samples.

    Returns
    -------
    n_samples_bootstrap : int
        The total number of samples to draw for the bootstrap sample.
    """
    if max_samples is None:
        return n_samples

    if isinstance(max_samples, Integral):
        expected_oob_samples = (1 - np.exp(-max_samples / n_samples)) * n_samples
        if expected_oob_samples >= n_samples - 1:
            raise ValueError(
                "The expected number of unique samples in the bootstrap sample"
                f" must be at most {n_samples - 1}. It is: {expected_oob_samples}"
            )
        return max_samples

    if isinstance(max_samples, Real):
        expected_oob_samples = (1 - np.exp(-max_samples)) * n_samples
        if expected_oob_samples >= n_samples - 1:
            raise ValueError(
                "The expected number of unique samples in the bootstrap sample"
                f" must be at most {n_samples - 1}. It is: {expected_oob_samples}"
            )
        return max(round(n_samples * max_samples), 1)
```

Note, we probably want some reasonable control over how large `max_samples` can be relative to `n_samples`. For instance if `max_samples = 10*n_samples`, this results in you pretty much sampling all unique samples per tree and almost no samples for oob computation. Thus a reasonable cutoff is we always allow at least 1 sample to be oob.

- if `max_samples` is an integer -> then it must be `(1 - e^(-max_samples/n_samples)) * n_samples > 1`
- if `max_samples` is a float -> then it must be that `(1 - e^(-max_samples)) * n_samples > 1` (i.e. you are expected to sample at least 1 sample out of bag). 

Alternatively, we can impose a reasonable heuristic of 5 samples. I think regardless it works for most use-cases because people would typically want to change the in-bag percentage from 63% to say 80% or 90% at most, but not 99.99%

### Describe alternatives you've considered, if relevant

There is no other way of allowing this functionality without forking the code.

### Additional context

This allows flexibility in terms of the trees and may help in supporting other issues that require more fine-grained control over what is in-bag vs oob such as #19710.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow `RandomForest` and `ExtraTrees` to have a higher max_samples than 1.0 when `bootstrap=True` #28507

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	def _get_n_samples_bootstrap(n_samples, max_samples):
	"""
	Get the number of samples in a bootstrap sample.

	Parameters
	----------
	n_samples : int
	Number of samples in the dataset.
	max_samples : int or float
	The maximum number of samples to draw from the total available:
	- if float, this indicates a fraction of the total and should be
	the interval `(0.0, 1.0]`;
	- if int, this indicates the exact number of samples;
	- if None, this indicates the total number of samples.

	Returns
	-------
	n_samples_bootstrap : int
	The total number of samples to draw for the bootstrap sample.
	"""
	if max_samples is None:
	return n_samples

	if isinstance(max_samples, Integral):
	if max_samples > n_samples:
	msg = "`max_samples` must be <= n_samples={} but got value {}"
	raise ValueError(msg.format(n_samples, max_samples))
	return max_samples

	if isinstance(max_samples, Real):
	return max(round(n_samples * max_samples), 1)

Uh oh!

Allow RandomForest* and ExtraTrees* to have a higher max_samples than 1.0 when bootstrap=True #28507

Description

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Allow `RandomForest` and `ExtraTrees` to have a higher max_samples than 1.0 when `bootstrap=True` #28507