- training works in PyTorch in 3 simple steps:
- compute loss during the forward pass
- compute the gradients during the backward pass
- update the model using the optimizer
- in single-GPU training, all of these steps will take place on the same GPU
- PyTorch DDP: distributed data parallel
- used if you can fit the entire model on one GPU, but want to distribute training across GPUs
- how it works:
- each GPU is initialized with the same initial model and optimizer (the entire model weights are stored on the GPU)