Skip to content

Fix multi-gpu documentation#591

Merged
rnyak merged 5 commits intomainfrom
fix_documentation
Jan 5, 2023
Merged

Fix multi-gpu documentation#591
rnyak merged 5 commits intomainfrom
fix_documentation

Conversation

@bbozkaya
Copy link
Copy Markdown
Contributor

This PR fixes user warning and readme documentation about data partitions for multi GPU training. This addresses #550.

@bbozkaya bbozkaya requested a review from sararb December 29, 2022 16:30
@bbozkaya bbozkaya added the documentation Improvements or additions to documentation label Dec 29, 2022
@github-actions
Copy link
Copy Markdown

"so npartitions>=global_size. Cudf or pandas can be used for repartitioning "
"e.g.: df.to_parquet('file.parquet', row_group_size=N_ROWS/NPARTITIONS, engine"
"='pyarrow') as npartitions=nr_rows/row_group_size."
"eg. df.to_parquet('file.parquet', row_group_size=N_ROWS/NPARTITIONS) or "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace df by pandas similar to the example with cudf?

Suggested change
"eg. df.to_parquet('file.parquet', row_group_size=N_ROWS/NPARTITIONS) or "
"eg. pandas.to_parquet('file.parquet', row_group_size=N_ROWS/NPARTITIONS) or "

Comment thread docs/source/multi_gpu_train.md Outdated
<b>Note:</b> When using `DistributedDataParallel`, our data loader splits data between the GPUs based on dataset partitions. For that reason, the number of partitions of the dataset must be equal to or an integer multiple of the number of processes. If the parquet file has a small number of row groups (partitions), try repartitioning and saving it again using cudf or pandas before training. The dataloader checks `dataloader.dataset.npartitions` and will repartition if needed but we advise users to repartition the dataset and save it for better efficiency. Use pandas or cudf for repartitioning. Example of repartitioning a parquet file with cudf:

```df.to_parquet("filename.parquet", row_group_size=10000)```
```cudf.to_parquet("filename.parquet", row_group_size_rows=10000)```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend to say pdf.to_parquet(..) for pandas and gdf.to_parquet() for cudf dataframes.

rnyak and others added 4 commits January 5, 2023 11:38
Modified references to pandas and cudf data objects.
Modified references to pandas and cudf data objects in the documentation.
Fixed documentation line length.
@rnyak
Copy link
Copy Markdown
Contributor

rnyak commented Jan 5, 2023

rerun tests

@rnyak rnyak merged commit 0c491cf into main Jan 5, 2023
@rnyak rnyak deleted the fix_documentation branch January 5, 2023 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants