Skip to content

StratifiedShuffleSplit is producing incorrect class proportions in test split #8913

@arafat-al-mahmud

Description

@arafat-al-mahmud

I have 20 samples from 2 classes. 18 from class 0 and 2 from class 1. Class 0 has proportion 18/20 = .9 or 90% whereas class 1 has proportion 2/20 = .1 or 10%. So as this is stratified, we expect 10% of the samples should be from class 1 in the test split. But we see 0% data is drawn from class 1.

from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [5,6], [1, 2], [3, 4], [1, 2], [3, 4], [5,6], [7,8], [1, 2], [3, 4], [1, 2], [3, 4], [5,6], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0 ,0 , 0, 0, 0, 0, 0, 1,  1, 0, 0, 0 ,0, 0, 0, 0, 0, 0, 0])
sss = StratifiedShuffleSplit(n_splits=3, test_size=0.2, random_state=0)

for train_index, test_index in sss.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

Output:

TRAIN: [10  9  3  7  1 12 16 15 13 11  4  6 19  8 18  2] TEST: [ 5 17  0 14]
TRAIN: [ 5 11  8 14 12 18  4 19 13  9 17  6  1 10  2 15] TEST: [ 0 16  3  7]
TRAIN: [14  2  0 16 10  8 15  1  5 18 12 19 17 11  9  4] TEST: [ 3 13  6  7]

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions