Skip to content

Dispatch iloc calls to getitem#6355

Merged
TomAugspurger merged 4 commits intodask:masterfrom
gforsyth:iloc_to_getitem
Jun 30, 2020
Merged

Dispatch iloc calls to getitem#6355
TomAugspurger merged 4 commits intodask:masterfrom
gforsyth:iloc_to_getitem

Conversation

@gforsyth
Copy link
Contributor

  • Tests added / passed
  • Passes black dask / flake8 dask

Alternate solution to speeding up certain iloc calls xref #6345 #6264

This is a much smaller change and also provides a nice speedup for my simple local benchmarks:
timeseries parquet file created with dask.dataframe.demo with 4 columns and 34447680 rows

current master:

In [3]: %time len(ddf)
CPU times: user 11.8 s, sys: 1.89 s, total: 13.7 s
Wall time: 7.1 s
Out[3]: 34447680

In [4]: %time len(ddf)
CPU times: user 10.1 s, sys: 1.6 s, total: 11.7 s
Wall time: 5.2 s
Out[4]: 34447680

this branch:

In [3]: %time len(ddf)
CPU times: user 5.9 s, sys: 777 ms, total: 6.68 s
Wall time: 3.79 s
Out[3]: 34447680

In [4]: %time len(ddf)
CPU times: user 4.01 s, sys: 504 ms, total: 4.51 s
Wall time: 1.72 s
Out[4]: 34447680

Comment on lines +66 to +70
return self._getitem(cindexer)

def _getitem(self, cindexer):
col_names = self.obj.columns[cindexer]
return self.obj.__getitem__(col_names)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on inlining this method? If it's not going to be used elsewhere I wouldn't mind just sticking the two lines directly into the method above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to inline it, and it's unlikely to be used elsewhere.

@mrocklin
Copy link
Member

This PR may win the compute-hours-saved / lines-of-code contest

Co-authored-by: Tom Augspurger <TomAugspurger@users.noreply.github.com>
Copy link
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a couple sanity tests that the result of a df.iloc[:, ...] is correct with duplicate columns?

  • df.iloc[:, 0] with duplicate columns and different values for the columns with the same name
  • df.iloc[:, list_of_integers] selecting some duplicates and some non-duplicate columns
  • df.iloc[:, negative_slice]

@gforsyth
Copy link
Contributor Author

Can you add a couple sanity tests that the result of a df.iloc[:, ...] is correct with duplicate columns?

  • df.iloc[:, 0] with duplicate columns and different values for the columns with the same name
  • df.iloc[:, list_of_integers] selecting some duplicates and some non-duplicate columns
  • df.iloc[:, negative_slice]

Added these tests with duplicate columns and without.

@TomAugspurger TomAugspurger merged commit 7aaef9f into dask:master Jun 30, 2020
@TomAugspurger
Copy link
Member

Thanks Gil

kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020
* Dispatch `iloc` calls to `getitem`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants