-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
StratifiedShuffleSplit is producing incorrect class proportions in test split #8913
Copy link
Copy link
Closed
Description
I have 20 samples from 2 classes. 18 from class 0 and 2 from class 1. Class 0 has proportion 18/20 = .9 or 90% whereas class 1 has proportion 2/20 = .1 or 10%. So as this is stratified, we expect 10% of the samples should be from class 1 in the test split. But we see 0% data is drawn from class 1.
from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [5,6], [1, 2], [3, 4], [1, 2], [3, 4], [5,6], [7,8], [1, 2], [3, 4], [1, 2], [3, 4], [5,6], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0 ,0 , 0, 0, 0, 0, 0, 1, 1, 0, 0, 0 ,0, 0, 0, 0, 0, 0, 0])
sss = StratifiedShuffleSplit(n_splits=3, test_size=0.2, random_state=0)
for train_index, test_index in sss.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Output:
TRAIN: [10 9 3 7 1 12 16 15 13 11 4 6 19 8 18 2] TEST: [ 5 17 0 14]
TRAIN: [ 5 11 8 14 12 18 4 19 13 9 17 6 1 10 2 15] TEST: [ 0 16 3 7]
TRAIN: [14 2 0 16 10 8 15 1 5 18 12 19 17 11 9 4] TEST: [ 3 13 6 7]
Reactions are currently unavailable