Skip to content

IndexError("Too many indices for array") raised when attempting to run the K-Means|| example #803

@hristog

Description

@hristog

What happened:
Exception raised: IndexError("Too many indices for array") as a direct result of following the steps from the K-Means|| example.

What you expected to happen:
The K-Means algorithm to successfully complete the fitting stage.

Minimal Complete Verifiable Example:

Reproducible via following the steps from the K-Means|| example.

import dask_ml.datasets
import dask_ml.cluster


X, y = dask_ml.datasets.make_blobs(n_samples=10000000,
                                   chunks=1000000,
                                   random_state=0,
                                   centers=3)
X = X.persist()

km = dask_ml.cluster.KMeans(n_clusters=3, init_max_iter=2, oversampling_factor=10)
km.fit(X)

Anything else we need to know?:
Traceback:

In []: km.fit(X)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-4-e41873a7fbf2> in <module>
----> 1 km.fit(X)

~/anaconda3/lib/python3.8/site-packages/dask_ml/cluster/k_means.py in fit(self, X, y)
    194     def fit(self, X, y=None):
    195         X = self._check_array(X)
--> 196         labels, centroids, inertia, n_iter = k_means(
    197             X,
    198             self.n_clusters,

~/anaconda3/lib/python3.8/site-packages/dask_ml/cluster/k_means.py in k_means(X, n_clusters, init, precompute_distances, n_init, max_iter, verbose, tol, random_state, copy_x, n_jobs, algorithm, return_n_iter, oversampling_factor, init_max_iter)
    266     * n_jobs=-1
    267     """
--> 268     labels, inertia, centers, n_iter = _kmeans_single_lloyd(
    269         X,
    270         n_clusters,

~/anaconda3/lib/python3.8/site-packages/dask_ml/cluster/k_means.py in _kmeans_single_lloyd(X, n_clusters, max_iter, init, verbose, x_squared_norms, random_state, tol, precompute_distances, oversampling_factor, init_max_iter)
    569             # Require at least one per bucket, to avoid division by 0.
    570             counts = da.maximum(counts, 1)
--> 571             new_centers = new_centers / counts[:, None]
    572             (new_centers,) = compute(new_centers)
    573 

~/anaconda3/lib/python3.8/site-packages/dask/array/core.py in __getitem__(self, index)
   1694         )
   1695 
-> 1696         index2 = normalize_index(index, self.shape)
   1697         dependencies = {self.name}
   1698         for i in index2:

~/anaconda3/lib/python3.8/site-packages/dask/array/slicing.py in normalize_index(idx, shape)
    895     idx = idx + (slice(None),) * (len(shape) - n_sliced_dims)
    896     if len([i for i in idx if i is not None]) > len(shape):
--> 897         raise IndexError("Too many indices for array")
    898 
    899     none_shape = []

IndexError: Too many indices for array

Inspection of values via ipdb:

In []: km.fit(X)
> /root/anaconda3/lib/python3.8/site-packages/dask/array/slicing.py(898)normalize_index()
    897         import ipdb; ipdb.set_trace()
--> 898         raise IndexError("Too many indices for array")
    899 

ipdb> list
    893             n_sliced_dims += 1
    894 
    895     idx = idx + (slice(None),) * (len(shape) - n_sliced_dims)
    896     if len([i for i in idx if i is not None]) > len(shape):
    897         import ipdb; ipdb.set_trace()
--> 898         raise IndexError("Too many indices for array")
    899 
    900     none_shape = []
    901     i = 0
    902     for ind in idx:
    903         if ind is not None:

ipdb> pp idx
(slice(None, None, None), None)
ipdb> pp n_sliced_dims
1
ipdb> pp len([i for i in idx if i is not None])
1
ipdb> pp shape
()

Possibly related to #802.

Environment:

  • Dask version: 2021.03.0
  • Dask_ml version: 1.8.0
  • Python version: 3.8.5 (default, Sep 4 2020, 07:30:14), [GCC 7.3.0]
  • Operating System: Debian Buster
  • Install method (conda, pip, source): pip

If this is not reproducible on the maintainers' side, I'll be happy to provide a Dockerfile with a bit more concrete details about the underlying environment.

Also, please, let me know if this should be moved to the Dask repository instead.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions