Docs - Update page on creating and storing Dask DataFrames by scharlottej13 · Pull Request #9025 · dask/dask

scharlottej13 · 2022-05-04T22:09:40Z

Closes #xxxx
Tests added / passed
Passes pre-commit run --all-files

This is a PR for updating the docs page on creating + storing Dask DataFrames. For context, this was driven by taking a look at google analytics and noticing this page is in the top 10 in terms of page views, but there was some outdated content on it (screenshot below for the past month):

scharlottej13 · 2022-05-04T22:11:29Z

docs/source/dataframe-create.rst

+
+Changing the ``blocksize`` parameter will change the number of partitions (see the explanation on
+:ref:`partitions <dataframe-design-partitions>`). A good rule of thumb when working with
+Dask DataFrames is to keep your partitions under 100MB in size.


Can we make a recommendation on maximum partition size?

I think your < 100MB recommendation sounds good!

bryanwweber

Thanks @scharlottej13! I love all the additional cross-links to existing resources this page surfaces. Just a couple of small clarifying comments.

bryanwweber · 2022-05-04T22:49:23Z

docs/source/dataframe-create.rst

+It supports loading multiple files at once:
+
+.. code-block:: python
+
+   >>> df = dd.read_csv('myfiles.*.csv')


Naive user hat on... It may not be clear that the * in the filename string indicates that Dask should load multiple files. Perhaps there's some context here that's missing in the diff view 😄 If not, can the mechanism by which Dask knows that this should result in multiple files be explained briefly?

thank you! added that this is using a globstring

bryanwweber · 2022-05-04T22:50:47Z

docs/source/dataframe-create.rst

+
+   >>> df = dd.read_csv('myfiles.*.csv')
+
+Or you can break up a single large file with the ``blocksize`` parameter:


"Break up" here refers to loading a single large file into multiple partitions, but given the context from the previous example about loading multiple files, this implies (to me, anyways) that using blocksize will result in multiple files on disk. This is clarified in the following paragraph, so maybe nothing needs to be done, but I thought I'd mention it in case it was easy to rearrange.

Thx for flagging! I left it as-is, my logic being it's simpler to keep the example short and then people who want to use blocksize will read the sentences below, but others can skip past.

bryanwweber · 2022-05-04T22:52:12Z

docs/source/dataframe-create.rst

-   >>> df = dd.read_csv('my-data-*.csv')
-   >>> df = dd.read_csv('hdfs:///path/to/my-data-*.csv')
-   >>> df = dd.read_csv('s3://bucket-name/my-data-*.csv')
+   >>> df = dd.read_parquet("path/to/my/parquet/")


It's not clear to (naive user) me how this points to multiple Parquet files, since the CSV example used the *. Can this be clarified? On a related note, Dask usually writes a folder called <filename>.parquet which contains files called part.0.parquet, which may also be confusing here.

good point! I added that this is a directory of parquet files

MrPowers · 2022-05-07T16:19:37Z

docs/source/dataframe-create.rst

+
+.. code-block:: python
+
+   >>> df = dd.read_csv('largefile.csv', blocksize=25e6)  # 25MB chunks  


I think blocksize="25MB" works and is a bit easier to read.

MrPowers · 2022-05-07T16:25:59Z

PR looks great and it's awesome you're making the docs better!

mrocklin · 2022-05-10T13:12:10Z

This seems like a strict improvement. There is probably more that can be done here, but let's merge this in and do more in a follow-up if we have time.

scharlottej13 added 2 commits May 4, 2022 15:33

update create dataframes

b5ebe46

update create dataframe

8254f80

github-actions bot added the documentation Improve or add to documentation label May 4, 2022

scharlottej13 commented May 4, 2022

View reviewed changes

bryanwweber reviewed May 4, 2022

View reviewed changes

small fixes after review

e4a45f5

MrPowers reviewed May 7, 2022

View reviewed changes

mrocklin merged commit 00776d2 into dask:main May 10, 2022

scharlottej13 deleted the docs-create-df branch May 10, 2022 15:36

erayaslan pushed a commit to erayaslan/dask that referenced this pull request May 12, 2022

Docs - Update page on creating and storing Dask DataFrames (dask#9025)

52bf012

scharlottej13 mentioned this pull request May 18, 2022

Better SEO for docs on creating and storing Dask DataFrames #9098

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Docs - Update page on creating and storing Dask DataFrames#9025

Docs - Update page on creating and storing Dask DataFrames#9025
mrocklin merged 3 commits intodask:mainfrom
scharlottej13:docs-create-df

scharlottej13 commented May 4, 2022

Uh oh!

scharlottej13 May 4, 2022

Uh oh!

pavithraes May 5, 2022

Uh oh!

bryanwweber left a comment

Uh oh!

bryanwweber May 4, 2022

Uh oh!

scharlottej13 May 5, 2022

Uh oh!

bryanwweber May 4, 2022

Uh oh!

scharlottej13 May 5, 2022

Uh oh!

bryanwweber May 4, 2022 •

edited

Loading

Uh oh!

scharlottej13 May 5, 2022

Uh oh!

MrPowers May 7, 2022

Uh oh!

MrPowers commented May 7, 2022

Uh oh!

mrocklin commented May 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants


		>>> df = dd.read_csv('myfiles.*.csv')

		Or you can break up a single large file with the ``blocksize`` parameter:


		.. code-block:: python

		>>> df = dd.read_csv('largefile.csv', blocksize=25e6) # 25MB chunks

Uh oh!

Conversation

scharlottej13 commented May 4, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bryanwweber left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bryanwweber May 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MrPowers commented May 7, 2022

Uh oh!

mrocklin commented May 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bryanwweber May 4, 2022 •

edited

Loading