Try to make divisions behavior clearer#8379
Merged
Conversation
gjoseph92
reviewed
Nov 15, 2021
Collaborator
gjoseph92
left a comment
There was a problem hiding this comment.
Thanks @jsignell! This does help clarify things quite a lot.
I have one other suggestion I couldn't comment on the lines for (noting that npartitions is ignored when divisions is given):
npartitions: int, None, or 'auto'
The ideal number of output partitions. If None, use the same as
the input. If 'auto' then decide by memory use.
Only used when ``divisions`` is not given. If ``divisions`` is given,
the number of output partitions will be ``len(divisions) - 1``.
dask/dataframe/core.py
Outdated
Comment on lines
+4342
to
+4351
| >>> divisions = pd.date_range(start="2021-01-01", end="2021-01-07", freq='1D') | ||
| ... divisions | ||
| DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', | ||
| '2021-01-05', '2021-01-06', '2021-01-07'], | ||
| dtype='datetime64[ns]', freq='D') | ||
|
|
||
| Note that ``len(divisons)`` is equal to ``npartitions + 1``. This is because ``divisions`` | ||
| represents the upper and lower bounds of each partition. | ||
|
|
||
| >>> ddf2 = ddf.set_index("timestamp", sorted=True, divisions=divisions) |
Collaborator
There was a problem hiding this comment.
I wonder if the example would be easier to read if we used the name column and wrote the divisions by hand:
>>> divisions = ["Alice", "Frank", "Laura", "Quinn", "Ursula", "Zelda"]
>>> ddf2 = ddf.set_index("name", divisions=divisions)This doesn't illustrate sorted=True, but it does show a probably-more-common case of using a different unsorted column, and that writing divisions yourself is not that scary.
dask/dataframe/core.py
Outdated
| @@ -4310,7 +4310,7 @@ def set_index( | |||
| See https://docs.dask.org/en/latest/dataframe-design.html#partitions | |||
Collaborator
There was a problem hiding this comment.
divisions: list, optional
The "dividing lines" used to split the new index into partitions.
For ``divisions=[0, 10, 50, 100]``, there would be three output partitions,
where the new index contained [0, 10), [10, 50), and [50, 100), respectively.
See https://docs.dask.org/en/latest/dataframe-design.html#partitions.
If not given (default), good divisions are calculated by immediately computing
the data and looking at the distribution of its values. For large datasets,
this can be expensive.
Note that if ``sorted=True``, specified divisions are assumed to match
the existing partitions in the data; if this is untrue you should
leave divisions empty and call ``repartition`` after ``set_index``.
Something like this might make the divisions parameter easier to understand for me, don't know if it would help anyone else?
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DataFrame.divisions#8264pre-commit run --all-filesI think this makes things clearer and it makes the example runnable without being huge.