Try to make divisions behavior clearer by jsignell · Pull Request #8379 · dask/dask

jsignell · 2021-11-15T17:41:41Z

Closes Improve documentation of DataFrame.divisions #8264
Tests added / passed
Passes pre-commit run --all-files

I think this makes things clearer and it makes the example runnable without being huge.

gjoseph92

Thanks @jsignell! This does help clarify things quite a lot.

I have one other suggestion I couldn't comment on the lines for (noting that npartitions is ignored when divisions is given):

        npartitions: int, None, or 'auto'
            The ideal number of output partitions. If None, use the same as
            the input. If 'auto' then decide by memory use.
            Only used when ``divisions`` is not given. If ``divisions`` is given,
            the number of output partitions will be ``len(divisions) - 1``.

dask/dataframe/core.py

gjoseph92 · 2021-11-15T22:00:04Z

dask/dataframe/core.py

+        >>> divisions = pd.date_range(start="2021-01-01", end="2021-01-07", freq='1D')
+        ... divisions
+        DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
+                    '2021-01-05', '2021-01-06', '2021-01-07'],
+                    dtype='datetime64[ns]', freq='D')
+
+        Note that ``len(divisons)`` is equal to ``npartitions + 1``. This is because ``divisions``
+        represents the upper and lower bounds of each partition.
+
+        >>> ddf2 = ddf.set_index("timestamp", sorted=True, divisions=divisions)


I wonder if the example would be easier to read if we used the name column and wrote the divisions by hand:

>>> divisions = ["Alice", "Frank", "Laura", "Quinn", "Ursula", "Zelda"] >>> ddf2 = ddf.set_index("name", divisions=divisions)

This doesn't illustrate sorted=True, but it does show a probably-more-common case of using a different unsorted column, and that writing divisions yourself is not that scary.

Yeah I like that idea.

dask/dataframe/core.py

gjoseph92 · 2021-11-15T22:26:56Z

dask/dataframe/core.py

@@ -4310,7 +4310,7 @@ def set_index(
            See https://docs.dask.org/en/latest/dataframe-design.html#partitions


divisions: list, optional The "dividing lines" used to split the new index into partitions. For ``divisions=[0, 10, 50, 100]``, there would be three output partitions, where the new index contained [0, 10), [10, 50), and [50, 100), respectively. See https://docs.dask.org/en/latest/dataframe-design.html#partitions. If not given (default), good divisions are calculated by immediately computing the data and looking at the distribution of its values. For large datasets, this can be expensive. Note that if ``sorted=True``, specified divisions are assumed to match the existing partitions in the data; if this is untrue you should leave divisions empty and call ``repartition`` after ``set_index``.

Something like this might make the divisions parameter easier to understand for me, don't know if it would help anyone else?

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

Try to make divisions behavior clearer

eec4995

github-actions bot added the dataframe label Nov 15, 2021

gjoseph92 reviewed Nov 15, 2021

View reviewed changes

charlesbluca mentioned this pull request Nov 16, 2021

Standardizing type for divisions #8388

Closed

jsignell and others added 4 commits November 17, 2021 09:49

Apply suggestions from code review

05de734

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

More input from Gabe

17faa9e

Add import

51b9ebf

Fix doctest

83d3f1d

jsignell merged commit fd88539 into dask:main Nov 17, 2021

jsignell deleted the divisions branch November 17, 2021 17:16

gjoseph92 mentioned this pull request Nov 30, 2021

[Discussion] Don't compute divisions by default in set_index? #8435

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Try to make divisions behavior clearer#8379

Try to make divisions behavior clearer#8379
jsignell merged 5 commits intodask:mainfrom
jsignell:divisions

jsignell commented Nov 15, 2021

Uh oh!

gjoseph92 left a comment

Uh oh!

Uh oh!

gjoseph92 Nov 15, 2021

Uh oh!

jsignell Nov 17, 2021

Uh oh!

Uh oh!

gjoseph92 Nov 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -4310,7 +4310,7 @@ def set_index(
		See https://docs.dask.org/en/latest/dataframe-design.html#partitions

Uh oh!

Conversation

jsignell commented Nov 15, 2021

Uh oh!

gjoseph92 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gjoseph92 Nov 15, 2021

Choose a reason for hiding this comment

Uh oh!

jsignell Nov 17, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gjoseph92 Nov 15, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants