Clean up special casing in as_column for non-typed input#14636
Clean up special casing in as_column for non-typed input#14636mroeschke wants to merge 21 commits intorapidsai:branch-24.04from
Conversation
…mn/misc_list_like_handling
…mn/misc_list_like_handling
…mn/misc_list_like_handling
wence-
left a comment
There was a problem hiding this comment.
I think this is a good direction!
| def can_memoryview(arbitrary: Any) -> bool: | ||
| try: | ||
| memoryview(arbitrary) | ||
| return True | ||
| except TypeError: | ||
| return False |
There was a problem hiding this comment.
suggestion
| def can_memoryview(arbitrary: Any) -> bool: | |
| try: | |
| memoryview(arbitrary) | |
| return True | |
| except TypeError: | |
| return False | |
| def as_memoryview(arbitrary: Any) -> Optional[memoryview]: | |
| try: | |
| return memoryview(arbitrary) | |
| except TypeError: | |
| return None |
| elif can_memoryview(arbitrary): | ||
| data = as_column( | ||
| memoryview(arbitrary), dtype=dtype, nan_as_null=nan_as_null | ||
| ) |
There was a problem hiding this comment.
| elif can_memoryview(arbitrary): | |
| data = as_column( | |
| memoryview(arbitrary), dtype=dtype, nan_as_null=nan_as_null | |
| ) | |
| elif (view := as_memoryview(arbitrary)) is not None: | |
| data = as_column(view, dtype=dtype, nan_as_null=nan_as_null) |
Then one could merge this handler into the isinstance(arbitrary, memoryview) case as well with:
if isinstance((view := as_memoryview(arbitrary)), memoryview):
data = as_column(np.asarray(view), dtype=dtype, nan_as_null=nan_as_null)
Which brings me to a further suggestion for this function. One of the issues is that it's very difficult to read the control flow. AIUI, we basically have lots of different branches to check if we can produce a Column, assign that to data and then eventually return data. But it's unclear (for example here) when we do data = as_column(memoryview(arb), ...) whether the is any further processing to be done, or if we are now ready to return the final answer.
I think I would prefer if all of these branches return ... rather than assigning to a variable and going to fallthrough.
There was a problem hiding this comment.
I think I would prefer if all of these branches return ... rather than assigning to a variable and going to fallthrough.
Good point. I'll try to have every branch do return ... for clarity. One aspect I'm not sure if when dtype= is specified that an "astype" is done if the input is typed for all branches
| ) | ||
| np_type = np_dtype.type | ||
| pa_type = np_to_pa_dtype(np_dtype) | ||
| from_pandas = True if nan_as_null is None else nan_as_null |
There was a problem hiding this comment.
| from_pandas = True if nan_as_null is None else nan_as_null | |
| from_pandas = nan_as_null is None or nan_as_null |
| ): | ||
| arbitrary = arbitrary.cast(typ) | ||
| except (pa.ArrowInvalid, pa.ArrowTypeError): | ||
| if any(is_column_like(val) for val in arbitrary): |
There was a problem hiding this comment.
question: are we guaranteed that arbitrary is iterable at this point?
There was a problem hiding this comment.
I believe so since an initial pa.array call proxy-checks that, but this would fail here for generators for example so I'll make sure this branch casts the input to list or similar
| data = as_column(arbitrary, nan_as_null=nan_as_null) | ||
| else: | ||
| try: | ||
| arbitrary = pa.array(arbitrary, from_pandas=from_pandas) |
There was a problem hiding this comment.
Do we want to go via arrow first and only then try and fall back to pandas? Or should we just go via pandas? I note this PR does not fix #14627 because I think we still go down the same (different) arrow vs pandas codepath.
question I still wonder if this is more complicated than it needs to be. If we're in these branches we have an "arbitrary" object that wasn't a Series/array/arrow-array. Is there a disadvantage to (in all cases) doing: Concretely: Are there objects that arrow can convert that pandas cannot? |
…mn/misc_list_like_handling
Parsing via Arrow first is beneficial for
In [1]: import pandas as pd
In [2]: pd.Series([1, None])
Out[2]:
0 1.0
1 NaN
dtype: float64
In [3]: import pyarrow as pa
In [4]: pa.array([1, None])
Out[4]:
<pyarrow.lib.Int64Array object at 0x7face4728580>
[
1,
null
]We could at this point do data introspection for these two cases and parse with arrow if we see either case otherwise always parse with pandas. |
Implemented @wence- 's suggestion in #14636 (comment) Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Bradley Dice (https://github.com/bdice) URL: #14643
…mn/misc_list_like_handling
…mn/misc_list_like_handling
…mn/misc_list_like_handling
…mn/misc_list_like_handling
…14961) Broken off of #14636, these cases are strict about a `dtype` being set so no need to be in a try except Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #14961
|
Going to close in favor of #15276 |
Redo at #14636 Clean up special casing for non-typed inputs to essentially do: ``` try: arbitrary = pa.array(arbitrary) except: arbitrary = pd.Series(arbitrary) return as_column(arbitrary) ``` Additionally, this change matches a behavior with pandas that will parse string data with `dtype=datetime64` type similar to the 2.2 behavior (fail if the resolution of the type doesn't match the string data) Authors: - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #15276
Description
Cleans up a lot of special casing for array-like inputs in
as_columnthat are not typed.The logic now is essentially
Allows sharing of logic in the pandas/arrow code paths for consistency for different inputs and kinda aligns with the comment in #14627 (comment) cc @wence-
This also fixes a bug where
as_column(list-like, dtype=cudf.CategoricalDtype|IntervalDtype(...))did not keep its specific metadataChecklist