-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Hi,
I've run across what I think is a small bug with concatenating dask arrays of strings in which the dtypes of the arrays to be concatenated are different:
In [106]: a = np.array(['CA-0', 'CA-1'])
In [107]: b = np.array(['TX-0', 'TX-10', 'TX-101', 'TX-102'])
In [108]: a = da.from_array(a, chunks=2)
In [109]: b = da.from_array(b, chunks=4)
In [110]: da.concatenate([a, b]).compute()
Out[110]:
array(['CA-0', 'CA-1', 'TX-0', 'TX-1', 'TX-1', 'TX-1'],
dtype='|S4')
In [111]: da.concatenate([b, a]).compute()
Out[111]:
array(['TX-0', 'TX-10', 'TX-101', 'TX-102', 'CA-0', 'CA-1'],
dtype='|S6')
If the array with the "smaller" dtype (in this case, S4) is the first array in the sequence to be concatenated, then this "smaller" dtype is used for the end result, truncating the entries in the array with the "larger" dtype (in this case, S6). If the order of the arrays is swapped so that the array with the "larger" dtype comes first, then the concatenation works properly.
It looks to me like the error occurs in the dask.array.core.concatenate3 function where the dtype of the result is inferred from the first array in the sequence, rather than using the dtype computed in the concatenate function itself.
Todd