Docs - Update page on creating and storing Dask DataFrames#9025
Docs - Update page on creating and storing Dask DataFrames#9025
Conversation
|
|
||
| Changing the ``blocksize`` parameter will change the number of partitions (see the explanation on | ||
| :ref:`partitions <dataframe-design-partitions>`). A good rule of thumb when working with | ||
| Dask DataFrames is to keep your partitions under 100MB in size. |
There was a problem hiding this comment.
Can we make a recommendation on maximum partition size?
There was a problem hiding this comment.
I think your < 100MB recommendation sounds good!
bryanwweber
left a comment
There was a problem hiding this comment.
Thanks @scharlottej13! I love all the additional cross-links to existing resources this page surfaces. Just a couple of small clarifying comments.
docs/source/dataframe-create.rst
Outdated
| It supports loading multiple files at once: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| >>> df = dd.read_csv('myfiles.*.csv') |
There was a problem hiding this comment.
Naive user hat on... It may not be clear that the * in the filename string indicates that Dask should load multiple files. Perhaps there's some context here that's missing in the diff view 😄 If not, can the mechanism by which Dask knows that this should result in multiple files be explained briefly?
There was a problem hiding this comment.
thank you! added that this is using a globstring
docs/source/dataframe-create.rst
Outdated
|
|
||
| >>> df = dd.read_csv('myfiles.*.csv') | ||
|
|
||
| Or you can break up a single large file with the ``blocksize`` parameter: |
There was a problem hiding this comment.
"Break up" here refers to loading a single large file into multiple partitions, but given the context from the previous example about loading multiple files, this implies (to me, anyways) that using blocksize will result in multiple files on disk. This is clarified in the following paragraph, so maybe nothing needs to be done, but I thought I'd mention it in case it was easy to rearrange.
There was a problem hiding this comment.
Thx for flagging! I left it as-is, my logic being it's simpler to keep the example short and then people who want to use blocksize will read the sentences below, but others can skip past.
| >>> df = dd.read_csv('my-data-*.csv') | ||
| >>> df = dd.read_csv('hdfs:///path/to/my-data-*.csv') | ||
| >>> df = dd.read_csv('s3://bucket-name/my-data-*.csv') | ||
| >>> df = dd.read_parquet("path/to/my/parquet/") |
There was a problem hiding this comment.
It's not clear to (naive user) me how this points to multiple Parquet files, since the CSV example used the *. Can this be clarified? On a related note, Dask usually writes a folder called <filename>.parquet which contains files called part.0.parquet, which may also be confusing here.
There was a problem hiding this comment.
good point! I added that this is a directory of parquet files
|
|
||
| .. code-block:: python | ||
|
|
||
| >>> df = dd.read_csv('largefile.csv', blocksize=25e6) # 25MB chunks |
There was a problem hiding this comment.
I think blocksize="25MB" works and is a bit easier to read.
|
PR looks great and it's awesome you're making the docs better! |
|
This seems like a strict improvement. There is probably more that can be done here, but let's merge this in and do more in a follow-up if we have time. |
pre-commit run --all-filesThis is a PR for updating the docs page on creating + storing Dask DataFrames. For context, this was driven by taking a look at google analytics and noticing this page is in the top 10 in terms of page views, but there was some outdated content on it (screenshot below for the past month):