Skip to content

Reading tables with a dask-cudf DataFrame#224

Merged
nils-braun merged 3 commits intodask-contrib:mainfrom
sarahyurick:read_with_gpu
Aug 25, 2021
Merged

Reading tables with a dask-cudf DataFrame#224
nils-braun merged 3 commits intodask-contrib:mainfrom
sarahyurick:read_with_gpu

Conversation

@sarahyurick
Copy link
Collaborator

Updated version of #219. Also tagging @ayushdg if you have time to double check the pandaslike.py changes specifically?

@codecov-commenter
Copy link

codecov-commenter commented Aug 23, 2021

Codecov Report

Merging #224 (9aade51) into main (4dab949) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main      #224   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           64        64           
  Lines         2589      2590    +1     
  Branches       362       361    -1     
=========================================
+ Hits          2589      2590    +1     
Impacted Files Coverage Δ
dask_sql/context.py 100.00% <ø> (ø)
dask_sql/input_utils/convert.py 100.00% <ø> (ø)
dask_sql/input_utils/hive.py 100.00% <100.00%> (ø)
dask_sql/input_utils/intake.py 100.00% <100.00%> (ø)
dask_sql/input_utils/location.py 100.00% <100.00%> (ø)
dask_sql/input_utils/pandaslike.py 100.00% <100.00%> (ø)
dask_sql/physical/rel/custom/create_table.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4dab949...9aade51. Read the comment docs.

Copy link
Collaborator

@nils-braun nils-braun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sarahyurick!
I have only two comments before we can merge:

  • there are two additional input methods, hive.py and dask.py. The latter is trivial (I guess a Dask-cudf data frame is also a Dask data frame, so we can just keep the logic), but you should also add a check like in the intake plugin to not allow for GPUs in hive.py (or we also re-write it to allow GPUs, but maybe that is something for the next step). I am actually wondering why the tests did not fail for hive...
  • can you make sure the coverage is again 100%? On the pandas-like-PR I did already ask, how we can best test the CPU behaviour via GitHub actions. I think for the beginning, we need to have # pragma: no cover comments in all gpu-only places. I would like to keep the 100% coverage if possible (even if this means we will need some coverage exceptions).

@sarahyurick
Copy link
Collaborator Author

Sounds good - I've updated hive.py and added some # pragma: no covers - let me know if I missed any!

Comment on lines +30 to +37
if gpu: # pragma: no cover
import dask_cudf

return dask_cudf.from_cudf(
cudf.from_pandas(input_item), npartitions=npartitions, **kwargs,
)
else:
return dd.from_pandas(input_item, npartitions=npartitions, **kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this input util accepts both cudf and pandas dataframes as valid inputs, you'd probably need an additional check here to check if input_item is a pandas dataframe or not, and call the from_pandas function only for that case.

@nils-braun
Copy link
Collaborator

I like this! LGTM!

@nils-braun nils-braun merged commit ece7ec7 into dask-contrib:main Aug 25, 2021
@charlesbluca charlesbluca mentioned this pull request Oct 1, 2021
@sarahyurick sarahyurick deleted the read_with_gpu branch September 21, 2022 23:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants