Skip to content

StratifiedShuffleSplit generates overlapping train and test indices #6121

@XuesongYang

Description

@XuesongYang

Why there is overlap between dev_idx and t_idx in the following code? It should have been no overlap.

```
train_test_split = StratifiedShuffleSplit(labels, n_iter=1, test_size=0.2, random_state=0)
for train_idx, test_idx in train_test_split:
    train_tmp = set(train_idx)
    test_tmp = set(test_idx)
    assert_equal(train_tmp.intersection(test_tmp), set())
    X_train = np.copy(feats[train_idx])
    y_train = np.copy(labels[train_idx])
    trans_train = np.copy(trans[train_idx])
    X_valid = np.copy(feats[test_idx])
    y_valid = np.copy(labels[test_idx])
    trans_valid = np.copy(trans[test_idx])
del feats
del labels
del trans
dev_test_split = StratifiedShuffleSplit(y_valid, n_iter=1, test_size=0.5, random_state=0)
for dev_idx, t_idx in dev_test_split:
    dev_tmp = set(dev_idx)
    t_tmp = set(t_idx)
    assert_equal(dev_tmp.intersection(t_tmp), set())
    X_dev = np.copy(X_valid[dev_idx])
    y_dev = np.copy(y_valid[dev_idx])
    trans_dev = np.copy(trans_valid[dev_idx])
    X_test = np.copy(X_valid[t_idx])
    y_test = np.copy(y_valid[t_idx])
    trans_test = np.copy(trans_valid[t_idx])
del X_valid
del y_valid
del trans_valid
```

The second assert_equal() test prompted a error as follows:

    assert_equal(dev_tmp.intersection(t_tmp), set())
  File "/home/xyang45/miniconda2/lib/python2.7/unittest/case.py", line 513, in assertEqual
    assertion_func(first, second, msg=msg)
  File "/home/xyang45/miniconda2/lib/python2.7/unittest/case.py", line 796, in assertSetEqual
    self.fail(self._formatMessage(msg, standardMsg))
  File "/home/xyang45/miniconda2/lib/python2.7/unittest/case.py", line 410, in fail
    raise self.failureException(msg)
AssertionError: Items in the first set but not the second:
1160
1161
907
1070
1747
2232

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions