GroupShuffleSplit does not work as how it's described in the documentation.

#### Description

So, I need to produce test/train/validation splits with predefined groups. I don't want to use LeavePGroupsOut since I need to separate data according my desired percantages into training and validation sets. In the documentation of GroupShuffleSplit, for `test_size` parameter, it's said that:
> test_size : float, int, None, optional
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. **If int, represents the absolute number of test samples**. If None, the value is set to the complement of the train size. By default, the value is set to 0.2. The default will change in version 0.21. It will remain 0.2 only if train_size is unspecified, otherwise it will complement the specified train_size.

However, this is indeed not the case as in the following code:

#### Steps/Code to Reproduce


- (1)

```python
    tr, ts = next(GroupShuffleSplit(n_splits=1, test_size=3).split(TR_set, groups=tr_groups))
    print(tr)
    print(ts)
```

- (2)
```python
    tr, ts = next(GroupShuffleSplit(n_splits=1, test_size=0.1).split(TR_set, groups=tr_groups))
    print(len(tr))
    print(len(ts))
```



#### Actual Results

which prints out for instance:

- (1)
`[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  28  29  30  31  32  33  34  35  36  37
  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55
  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73
  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  91  92  93
  99 101 102 103 104 105 106 107]
[ 26  27  89  90  94  95  96  97  98 100]`

- (2)
`70
38
`

As you see above from (1), test size is not 3 but more than 3. This almost always the case. I checked the groups of the indices. Apparently, if test_size is an integer, it represents the absolute number of test groups, **not samples**. I think you need to fix the documentation since it's misleading.

Also, when test_size is a float, it mostly does not consider the ratio specified. It may be due to unequal sample sizes in the groups but then there must be a note/warning to specify what kind of behaviour it follows under unequal group sizes combined with test_size ratio. From (2), test size is 35% of the whole set where it supposed to be 10%.

So, either I'm missing something or the documentation is nothing but erroneous descriptions.

Thanks.

#### Versions





System:
    python: 3.7.1 | packaged by conda-forge | (default, Nov 13 2018, 18:15:35)  [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
executable: /home/burak/anaconda3/bin/python
   machine: Linux-4.15.0-45-generic-x86_64-with-debian-buster-sid

BLAS:
    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None
  lib_dirs: /home/burak/anaconda3/lib
cblas_libs: mkl_rt, pthread

Python deps:
       pip: 18.1
setuptools: 40.2.0
   sklearn: 0.20.1
     numpy: 1.15.4
     scipy: 1.1.0
    Cython: 0.28.5
    pandas: 0.23.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GroupShuffleSplit does not work as how it's described in the documentation. #13369

Description

Steps/Code to Reproduce

Actual Results

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

GroupShuffleSplit does not work as how it's described in the documentation. #13369

Description

Description

Steps/Code to Reproduce

Actual Results

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions