Clean up special casing in as_column for non-typed input#14636

Closed

mroeschke wants to merge 21 commits intorapidsai:branch-24.04from

mroeschke:ref/as_column/misc_list_like_handling

Contributor

mroeschke commented Dec 14, 2023

Description

Cleans up a lot of special casing for array-like inputs in as_column that are not typed.

The logic now is essentially

if dtype is not None:
    if a pandas type or object type:
        # parse with pandas
    else
        # parse with arrow
else:
    try:
        # parse with arrow
    except:
        # parse with pandas
# create column

Allows sharing of logic in the pandas/arrow code paths for consistency for different inputs and kinda aligns with the comment in #14627 (comment) cc @wence-

This also fixes a bug where as_column(list-like, dtype=cudf.CategoricalDtype|IntervalDtype(...)) did not keep its specific metadata

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

mroeschke added 9 commits

December 12, 2023 17:24


          Ensure cudf dtype is gotten first, fix bug in categorical interval logic

b583163


          move more logic

43d5942


          Merge remote-tracking branch 'upstream/branch-24.02' into ref/as_colu…

b8a2f5c

…mn/misc_list_like_handling


          Start running tests

71377b7


          Merge remote-tracking branch 'upstream/branch-24.02' into ref/as_colu…

d5852b6

…mn/misc_list_like_handling


          Merge remote-tracking branch 'upstream/branch-24.02' into ref/as_colu…

dfd0ac0

…mn/misc_list_like_handling


          Pass series tests

109cbe1


          Clean up special casing in for non-typed input

897320a


          Remove not needed

6c7a5e4

mroeschke added bug Python non-breaking labels

mroeschke requested a review from a team as a code owner

December 14, 2023 22:47

mroeschke requested review from brandon-b-miller and shwina

December 14, 2023 22:47


          Add fallback from arrow parsing

43abd0b

wence- reviewed

View reviewed changes

Contributor

wence- left a comment

I think this is a good direction!

python/cudf/cudf/core/column/column.py Outdated

Comment on lines +1914 to +1919

+              def can_memoryview(arbitrary: Any) -> bool:
+                  try:
+                      memoryview(arbitrary)
+                      return True
+                  except TypeError:
+                      return False

Contributor

wence- Dec 15, 2023

suggestion

Suggested change

      
            def can_memoryview(arbitrary: Any) -> bool:
          
                try:
          
                    memoryview(arbitrary)
          
                    return True
          
                except TypeError:
          
                    return False
          
            def as_memoryview(arbitrary: Any) -> Optional[memoryview]:
          
                try:
          
                    return memoryview(arbitrary)
          
                except TypeError:
          
                    return None

python/cudf/cudf/core/column/column.py Outdated

Comment on lines +2336 to +2339

+                  elif can_memoryview(arbitrary):
+                      data = as_column(
+                          memoryview(arbitrary), dtype=dtype, nan_as_null=nan_as_null
+                      )

Contributor

wence- Dec 15, 2023

Suggested change

      
                elif can_memoryview(arbitrary):
          
                    data = as_column(
          
                        memoryview(arbitrary), dtype=dtype, nan_as_null=nan_as_null
          
                    )
          
                elif (view := as_memoryview(arbitrary)) is not None:
          
                    data = as_column(view, dtype=dtype, nan_as_null=nan_as_null)

Then one could merge this handler into the isinstance(arbitrary, memoryview) case as well with:

if isinstance((view := as_memoryview(arbitrary)), memoryview):
    data = as_column(np.asarray(view), dtype=dtype, nan_as_null=nan_as_null)

Which brings me to a further suggestion for this function. One of the issues is that it's very difficult to read the control flow. AIUI, we basically have lots of different branches to check if we can produce a Column, assign that to data and then eventually return data. But it's unclear (for example here) when we do data = as_column(memoryview(arb), ...) whether the is any further processing to be done, or if we are now ready to return the final answer.

I think I would prefer if all of these branches return ... rather than assigning to a variable and going to fallthrough.

Contributor Author

mroeschke Dec 15, 2023

I think I would prefer if all of these branches return ... rather than assigning to a variable and going to fallthrough.

Good point. I'll try to have every branch do return ... for clarity. One aspect I'm not sure if when dtype= is specified that an "astype" is done if the input is typed for all branches

python/cudf/cudf/core/column/column.py Outdated

-                                          )
-                                  np_type = np_dtype.type
-                                  pa_type = np_to_pa_dtype(np_dtype)
+                      from_pandas = True if nan_as_null is None else nan_as_null

Contributor

wence- Dec 15, 2023

Suggested change

      
                    from_pandas = True if nan_as_null is None else nan_as_null
          
                    from_pandas = nan_as_null is None or nan_as_null

python/cudf/cudf/core/column/column.py Outdated

-                              ):
+                                  arbitrary = arbitrary.cast(typ)
+                          except (pa.ArrowInvalid, pa.ArrowTypeError):
+                              if any(is_column_like(val) for val in arbitrary):

Contributor

wence- Dec 15, 2023

question: are we guaranteed that arbitrary is iterable at this point?

Contributor Author

mroeschke Dec 15, 2023

I believe so since an initial pa.array call proxy-checks that, but this would fail here for generators for example so I'll make sure this branch casts the input to list or similar

python/cudf/cudf/core/column/column.py

+                          data = as_column(arbitrary, nan_as_null=nan_as_null)
+                      else:
+                          try:
+                              arbitrary = pa.array(arbitrary, from_pandas=from_pandas)

Contributor

wence- Dec 15, 2023

Do we want to go via arrow first and only then try and fall back to pandas? Or should we just go via pandas? I note this PR does not fix #14627 because I think we still go down the same (different) arrow vs pandas codepath.

Contributor

wence- commented Dec 15, 2023

The logic now is essentially

if dtype is not None:
    if a pandas type or object type:
        # parse with pandas
    else
        # parse with arrow
else:
    try:
        # parse with arrow
    except:
        # parse with pandas
# create column

question I still wonder if this is more complicated than it needs to be. If we're in these branches we have an "arbitrary" object that wasn't a Series/array/arrow-array.

Is there a disadvantage to (in all cases) doing: data = parse_with_pandas(arbitrary). It seems like the logic as shown above could have different behaviour passing a pandas scalar object in (that arrow recognises) depending on whether one additionally specifies dtype.

Concretely:

as_column(pd_scalar_obj) # parsed with arrow
as_column(pd_scalar_obj, dtype=pd_scalar_obj.dtype) # parsed with pandas

Are there objects that arrow can convert that pandas cannot?


          Merge remote-tracking branch 'upstream/branch-24.02' into ref/as_colu…

076cdf0

…mn/misc_list_like_handling

Contributor Author

mroeschke commented Dec 15, 2023

Are there objects that arrow can convert that pandas cannot?

Parsing via Arrow first is beneficial for

Returning List or Struct typed arrays. (Though looks like we need a carve-out for a list of Series for example which we support as List type)
nan_as_null keyword flexibility and not having an NA like value always cast the result to float/object

In [1]: import pandas as pd

In [2]: pd.Series([1, None])
Out[2]: 
0    1.0
1    NaN
dtype: float64

In [3]: import pyarrow as pa

In [4]: pa.array([1, None])
Out[4]: 
<pyarrow.lib.Int64Array object at 0x7face4728580>
[
  1,
  null
]

We could at this point do data introspection for these two cases and parse with arrow if we see either case otherwise always parse with pandas.


          temp refactor of inference case

01b6aee

mroeschke mentioned this pull request

Consolidate memoryview handling in as_column #14643

Merged

3 tasks

rapids-bot bot pushed a commit that referenced this pull request


          Consolidate memoryview handling in as_column (#14643)

90cccef

Implemented @wence- 's suggestion in #14636 (comment)

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #14643

mroeschke added 7 commits

December 18, 2023 16:39


          Merge remote-tracking branch 'upstream/branch-24.02' into ref/as_colu…

be536a0

…mn/misc_list_like_handling


          Undo merge

261acda


          Code suggestion

20e8bc0


          Ensure arrow astypes

a9dcd67


          Merge remote-tracking branch 'upstream/branch-24.02' into ref/as_colu…

4eeb6bb

…mn/misc_list_like_handling


          Add carveouts at beginning


          Merge remote-tracking branch 'upstream/branch-24.02' into ref/as_colu…

7f96476

…mn/misc_list_like_handling


          isinstance cupy.ndarray

d592060

mroeschke marked this pull request as draft

January 11, 2024 20:04

mroeschke changed the base branch from branch-24.02 to branch-24.04

January 31, 2024 22:05


          Merge remote-tracking branch 'upstream/branch-24.04' into ref/as_colu…

69d9f48

…mn/misc_list_like_handling

mroeschke mentioned this pull request

Respect IntervalDtype and CategoricalDtype objects passed by users #14961

Merged

3 tasks

rapids-bot bot pushed a commit that referenced this pull request


          Respect IntervalDtype and CategoricalDtype objects passed by users (#…

6c18729

…14961)

Broken off of #14636, these cases are strict about a `dtype` being set so no need to be in a try except

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #14961

mroeschke mentioned this pull request

Clean up special casing in as_column for non-typed input #15276

Merged

3 tasks

Contributor Author

mroeschke commented Mar 12, 2024

Going to close in favor of #15276

mroeschke closed this

mroeschke deleted the ref/as_column/misc_list_like_handling branch

March 12, 2024 00:55

rapids-bot bot pushed a commit that referenced this pull request


          Clean up special casing in as_column for non-typed input (#15276)

e928c4a

Redo at #14636

Clean up special casing for non-typed inputs to essentially do:

```
try:
     arbitrary = pa.array(arbitrary)
except:
     arbitrary = pd.Series(arbitrary)
return as_column(arbitrary)
```

Additionally, this change matches a behavior with pandas that will parse string data with `dtype=datetime64` type similar to the 2.2 behavior (fail if the resolution of the type doesn't match the string data)

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #15276

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug non-breaking Python