Fix multi-gpu documentation by bbozkaya · Pull Request #591 · NVIDIA-Merlin/Transformers4Rec

bbozkaya · 2022-12-29T16:30:03Z

This PR fixes user warning and readme documentation about data partitions for multi GPU training. This addresses #550.

github-actions · 2022-12-29T16:40:12Z

Documentation preview

https://nvidia-merlin.github.io/Transformers4Rec/review/pr-591

sararb · 2022-12-29T16:47:10Z

                "so npartitions>=global_size. Cudf or pandas can be used for repartitioning "
-                "e.g.: df.to_parquet('file.parquet', row_group_size=N_ROWS/NPARTITIONS, engine"
-                "='pyarrow') as npartitions=nr_rows/row_group_size."
+                "eg. df.to_parquet('file.parquet', row_group_size=N_ROWS/NPARTITIONS) or "


Can we replace df by pandas similar to the example with cudf?

Suggested change

"eg. df.to_parquet('file.parquet', row_group_size=N_ROWS/NPARTITIONS) or "

"eg. pandas.to_parquet('file.parquet', row_group_size=N_ROWS/NPARTITIONS) or "

rnyak · 2023-01-03T17:07:43Z

+<b>Note:</b> When using `DistributedDataParallel`, our data loader splits data between the GPUs based on dataset partitions. For that reason, the number of partitions of the dataset must be equal to or an integer multiple of the number of processes. If the parquet file has a small number of row groups (partitions), try repartitioning and saving it again using cudf or pandas before training. The dataloader checks `dataloader.dataset.npartitions` and will repartition if needed but we advise users to repartition the dataset and save it for better efficiency. Use pandas or cudf for repartitioning. Example of repartitioning a parquet file with cudf:

-```df.to_parquet("filename.parquet", row_group_size=10000)```
+```cudf.to_parquet("filename.parquet", row_group_size_rows=10000)```


I'd recommend to say pdf.to_parquet(..) for pandas and gdf.to_parquet() for cudf dataframes.

Modified references to pandas and cudf data objects.

Modified references to pandas and cudf data objects in the documentation.

Fixed documentation line length.

rnyak · 2023-01-05T18:18:26Z

rerun tests

Fix multi-gpu documentation

1a4d7d6

bbozkaya requested a review from sararb December 29, 2022 16:30

bbozkaya added the documentation Improvements or additions to documentation label Dec 29, 2022

sararb approved these changes Dec 29, 2022

View reviewed changes

rnyak reviewed Jan 3, 2023

View reviewed changes

rnyak and others added 4 commits January 5, 2023 11:38

Merge branch 'main' into fix_documentation

f4f3c3a

Add files via upload

4c67013

Modified references to pandas and cudf data objects.

Add files via upload

4087de0

Modified references to pandas and cudf data objects in the documentation.

Add files via upload

9744968

Fixed documentation line length.

rnyak approved these changes Jan 5, 2023

View reviewed changes

rnyak merged commit 0c491cf into main Jan 5, 2023

rnyak deleted the fix_documentation branch January 5, 2023 19:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multi-gpu documentation#591

Fix multi-gpu documentation#591
rnyak merged 5 commits intomainfrom
fix_documentation

bbozkaya commented Dec 29, 2022

Uh oh!

github-actions bot commented Dec 29, 2022

Uh oh!

sararb Dec 29, 2022

Uh oh!

rnyak Jan 3, 2023

Uh oh!

rnyak commented Jan 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	"eg. df.to_parquet('file.parquet', row_group_size=N_ROWS/NPARTITIONS) or "
	"eg. pandas.to_parquet('file.parquet', row_group_size=N_ROWS/NPARTITIONS) or "

Conversation

bbozkaya commented Dec 29, 2022

Uh oh!

github-actions bot commented Dec 29, 2022

Documentation preview

Uh oh!

sararb Dec 29, 2022

Choose a reason for hiding this comment

Uh oh!

rnyak Jan 3, 2023

Choose a reason for hiding this comment

Uh oh!

rnyak commented Jan 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants