Skip to content

Documentation on multi-GPU training with DataParallel and DistributedDataParallel #492

@viswa-nvidia

Description

@viswa-nvidia

Provide documentation on how to train a Transformers4Rec model using multiple GPU with DataParallel (DP) and DistributedDataParallel (DDP).

  • Short explanation of DP and DDP with links to PyTorch documentation
  • Describe code snippets, command line examples and environment variables necessary to use the Trainer with DP and DDP
  • Table comparing runtimes x number of GPUs for one of the integration tests for single GPU, DP and DDP
  • Describe that LR needs to be increased when using DP and DDP

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentation

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions