BUG: `Dataframe.astype(...)` Fails With `'bool' object has no attribute 'all'`
Modin version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest released version of Modin.
-
[ ] I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)
Reproducible Example
import modin.pandas as pd
import numpy as np
# This will cast to a float type due to NaNs
df = pd.DataFrame({"a": [1, 2, 3, np.nan]})
# This passes
df.astype({"a": pd.Int64Dtype()})
>> a
0 1
1 2
2 3
3 <NA>
# This fails
df.astype(pd.Int64Dtype())
>> AttributeError: 'bool' object has no attribute 'all'
Issue Description
I think this is the same error as #7276, but that issue was closed. When using astype on a Dataframe it fails at this check
File .../modin/core/dataframe/pandas/dataframe/dataframe.py:
1736, in PandasDataframe.astype(self, col_dtypes, errors)
1730 return df.astype(
1731 {k: v for k, v in col_dtypes.items() if k in df}, errors=errors
1732 )
1734 else:
1735 # Assume that the dtype is a scalar.
-> 1736 if not (col_dtypes == self_dtypes).all():
1737 new_dtypes = self_dtypes.copy()
1738 new_dtype = pandas.api.types.pandas_dtype(col_dtypes)
AttributeError: 'bool' object has no attribute 'all'
When the type argument is a single value (i.e. astype(pd.Int64Dtype())) then it seems that col_dtypes == self_dtypes works out as a single bool value (hence no all attribute).
Note that this works Ok if the argument is a dictionary of column namess to dtypes.
This also seems to be the same for Series, i.e.:
df["a"].astype(pd.Int64Dtype())
Fails with the same error
Expected Behavior
In native Pandas
df.astype(pd.Int64Dtype())
casts the DataFrame/series to the argument type
Error Logs
No response
Installed Versions
'0.31.0'
In fact this seems to work ok if in line 1736 in .../modin/core/dataframe/pandas/dataframe/dataframe.py
1735 # Assume that the dtype is a scalar.
-> 1736 if not (col_dtypes == self_dtypes).all():
1737 new_dtypes = self_dtypes.copy()
the ordering of the arguments is swappe, i.e.:
1735 # Assume that the dtype is a scalar.
-> 1736 if not (self_dtypes == col_dtypes).all():
1737 new_dtypes = self_dtypes.copy()
I guess some ordering dependency in how the result is calculated?
I can open a PR to change, but not sure if there is some deeper reasoning here.
I have encountered this issue while working on a separate project and looking into this as the code suggests
self_dtypes is of type pandas.Series
col_dtypes in the else branch is a scalar object
col_dtypes == self_dtypes uses the col_dtypes.__eq__(obj) method first, which for pandas dtypes is implemented unconditionally, so it will always return False as the left-hand side is a dtype and the left-hand side is a pandas.Series
for strings it works because the direct equality is not implemented, so it tires self_dtypes.__eq__(obj)
flipping the comparison, self_dtypes == col_dtypes notices that the lhs is a series and the rhs is a scalar so it brodcasts to a pandas.Series of type bool, which has a member all to call
@zombie-einstein, to me the solution seems correct, but I am new to modin
hope one of the maintainters of modin can take a further look into this issue
More general context below
here is an example of why the flipping works:
import pandas as pd
s = pd.Series(['a', 'b', 'c'])
int_64 = pd.Int64Dtype()
string = 'a'
print(s == string)
# 0 True
# 1 False
# 2 False
# dtype: bool
print(string == s)
# 0 True
# 1 False
# 2 False
# dtype: bool
print(int_64 == s)
# False
print(s == int_64)
# 0 False
# 1 False
# 2 False
# dtype: bool
note on usage and thoughts on it:
The documentation of astype of modin itself does say the following
def astype(self, col_dtypes, errors: str = "raise"):
"""
Convert the columns dtypes to given dtypes.
Parameters
----------
col_dtypes : dictionary of {col: dtype,...} or str
Where col is the column name and dtype is a NumPy dtype.
errors : {'raise', 'ignore'}, default: 'raise'
Control raising of exceptions on invalid data for provided dtype.
Returns
-------
BaseDataFrame
Dataframe with updated dtypes.
"""
so assert isinstance(col_dtypes, dict | str) should pass and we should not be passing dtype objects, however I would argue this is bad UX, for reference pandas accepts data type objects for astype, documentation here