Reading tables with a dask-cudf DataFrame#224
Conversation
Codecov Report
@@ Coverage Diff @@
## main #224 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 64 64
Lines 2589 2590 +1
Branches 362 361 -1
=========================================
+ Hits 2589 2590 +1
Continue to review full report at Codecov.
|
nils-braun
left a comment
There was a problem hiding this comment.
Thanks @sarahyurick!
I have only two comments before we can merge:
- there are two additional input methods,
hive.pyanddask.py. The latter is trivial (I guess a Dask-cudf data frame is also a Dask data frame, so we can just keep the logic), but you should also add a check like in the intake plugin to not allow for GPUs inhive.py(or we also re-write it to allow GPUs, but maybe that is something for the next step). I am actually wondering why the tests did not fail for hive... - can you make sure the coverage is again 100%? On the pandas-like-PR I did already ask, how we can best test the CPU behaviour via GitHub actions. I think for the beginning, we need to have
# pragma: no covercomments in all gpu-only places. I would like to keep the 100% coverage if possible (even if this means we will need some coverage exceptions).
|
Sounds good - I've updated |
| if gpu: # pragma: no cover | ||
| import dask_cudf | ||
|
|
||
| return dask_cudf.from_cudf( | ||
| cudf.from_pandas(input_item), npartitions=npartitions, **kwargs, | ||
| ) | ||
| else: | ||
| return dd.from_pandas(input_item, npartitions=npartitions, **kwargs) |
There was a problem hiding this comment.
Given that this input util accepts both cudf and pandas dataframes as valid inputs, you'd probably need an additional check here to check if input_item is a pandas dataframe or not, and call the from_pandas function only for that case.
|
I like this! LGTM! |
Updated version of #219. Also tagging @ayushdg if you have time to double check the
pandaslike.pychanges specifically?