Skip to content

Docs - Update page on creating and storing Dask DataFrames#9025

Merged
mrocklin merged 3 commits intodask:mainfrom
scharlottej13:docs-create-df
May 10, 2022
Merged

Docs - Update page on creating and storing Dask DataFrames#9025
mrocklin merged 3 commits intodask:mainfrom
scharlottej13:docs-create-df

Conversation

@scharlottej13
Copy link
Copy Markdown
Contributor

  • Closes #xxxx
  • Tests added / passed
  • Passes pre-commit run --all-files

This is a PR for updating the docs page on creating + storing Dask DataFrames. For context, this was driven by taking a look at google analytics and noticing this page is in the top 10 in terms of page views, but there was some outdated content on it (screenshot below for the past month):

Screen Shot 2022-05-04 at 5 06 26 PM

@github-actions github-actions bot added the documentation Improve or add to documentation label May 4, 2022

Changing the ``blocksize`` parameter will change the number of partitions (see the explanation on
:ref:`partitions <dataframe-design-partitions>`). A good rule of thumb when working with
Dask DataFrames is to keep your partitions under 100MB in size.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make a recommendation on maximum partition size?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your < 100MB recommendation sounds good!

Copy link
Copy Markdown
Contributor

@bryanwweber bryanwweber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @scharlottej13! I love all the additional cross-links to existing resources this page surfaces. Just a couple of small clarifying comments.

Comment on lines +66 to +70
It supports loading multiple files at once:

.. code-block:: python

>>> df = dd.read_csv('myfiles.*.csv')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naive user hat on... It may not be clear that the * in the filename string indicates that Dask should load multiple files. Perhaps there's some context here that's missing in the diff view 😄 If not, can the mechanism by which Dask knows that this should result in multiple files be explained briefly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you! added that this is using a globstring


>>> df = dd.read_csv('myfiles.*.csv')

Or you can break up a single large file with the ``blocksize`` parameter:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Break up" here refers to loading a single large file into multiple partitions, but given the context from the previous example about loading multiple files, this implies (to me, anyways) that using blocksize will result in multiple files on disk. This is clarified in the following paragraph, so maybe nothing needs to be done, but I thought I'd mention it in case it was easy to rearrange.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for flagging! I left it as-is, my logic being it's simpler to keep the example short and then people who want to use blocksize will read the sentences below, but others can skip past.

>>> df = dd.read_csv('my-data-*.csv')
>>> df = dd.read_csv('hdfs:///path/to/my-data-*.csv')
>>> df = dd.read_csv('s3://bucket-name/my-data-*.csv')
>>> df = dd.read_parquet("path/to/my/parquet/")
Copy link
Copy Markdown
Contributor

@bryanwweber bryanwweber May 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to (naive user) me how this points to multiple Parquet files, since the CSV example used the *. Can this be clarified? On a related note, Dask usually writes a folder called <filename>.parquet which contains files called part.0.parquet, which may also be confusing here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! I added that this is a directory of parquet files


.. code-block:: python

>>> df = dd.read_csv('largefile.csv', blocksize=25e6) # 25MB chunks
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think blocksize="25MB" works and is a bit easier to read.

@MrPowers
Copy link
Copy Markdown
Contributor

MrPowers commented May 7, 2022

PR looks great and it's awesome you're making the docs better!

@mrocklin
Copy link
Copy Markdown
Member

This seems like a strict improvement. There is probably more that can be done here, but let's merge this in and do more in a follow-up if we have time.

@mrocklin mrocklin merged commit 00776d2 into dask:main May 10, 2022
@scharlottej13 scharlottej13 deleted the docs-create-df branch May 10, 2022 15:36
erayaslan pushed a commit to erayaslan/dask that referenced this pull request May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improve or add to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants