[feature]Multi-GPU DistributedDataParallel Fixed#496
Conversation
Click to view CI ResultsGitHub pull request #496 of commit b97a3eac5535f5419c18bc3836bc925d30a69323, no merge conflicts.
Running as SYSTEM
Setting status of b97a3eac5535f5419c18bc3836bc925d30a69323 to PENDING with url http://10.20.17.181:8080/job/transformers4rec_tests/198/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git
> git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
> git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
> git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10
> git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
> git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
> git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
> git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/496/*:refs/remotes/origin/pr/496/* # timeout=10
> git rev-parse b97a3eac5535f5419c18bc3836bc925d30a69323^{commit} # timeout=10
Checking out Revision b97a3eac5535f5419c18bc3836bc925d30a69323 (detached)
> git config core.sparsecheckout # timeout=10
> git checkout -f b97a3eac5535f5419c18bc3836bc925d30a69323 # timeout=10
Commit message: "Multi-GPU DistributedDataParallel Fixed"
> git rev-list --no-walk f2a1cd5770f0d65274792b7142d4d8fd1b756761 # timeout=10
First time build. Skipping changelog.
[transformers4rec_tests] $ /bin/bash /tmp/jenkins7094064427877866399.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-4.0.0
collected 1 item
|
Click to view CI ResultsGitHub pull request #496 of commit ff2561f94c0f1afcafcd39ef85b89eaf4e2bc7d2, no merge conflicts.
Running as SYSTEM
Setting status of ff2561f94c0f1afcafcd39ef85b89eaf4e2bc7d2 to PENDING with url http://10.20.17.181:8080/job/transformers4rec_tests/199/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git
> git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
> git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
> git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10
> git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
> git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
> git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
> git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/496/*:refs/remotes/origin/pr/496/* # timeout=10
> git rev-parse ff2561f94c0f1afcafcd39ef85b89eaf4e2bc7d2^{commit} # timeout=10
Checking out Revision ff2561f94c0f1afcafcd39ef85b89eaf4e2bc7d2 (detached)
> git config core.sparsecheckout # timeout=10
> git checkout -f ff2561f94c0f1afcafcd39ef85b89eaf4e2bc7d2 # timeout=10
Commit message: "fixed formatting problems"
> git rev-list --no-walk b97a3eac5535f5419c18bc3836bc925d30a69323 # timeout=10
[transformers4rec_tests] $ /bin/bash /tmp/jenkins13360986045226705283.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-4.0.0
collected 1 item
|
Click to view CI ResultsGitHub pull request #496 of commit d787c9ee19c27960c040c416a30e2d0c00a67b89, no merge conflicts.
Running as SYSTEM
Setting status of d787c9ee19c27960c040c416a30e2d0c00a67b89 to PENDING with url http://10.20.17.181:8080/job/transformers4rec_tests/200/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git
> git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
> git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
> git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10
> git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
> git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
> git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
> git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/496/*:refs/remotes/origin/pr/496/* # timeout=10
> git rev-parse d787c9ee19c27960c040c416a30e2d0c00a67b89^{commit} # timeout=10
Checking out Revision d787c9ee19c27960c040c416a30e2d0c00a67b89 (detached)
> git config core.sparsecheckout # timeout=10
> git checkout -f d787c9ee19c27960c040c416a30e2d0c00a67b89 # timeout=10
Commit message: "fixed with black and isort"
> git rev-list --no-walk ff2561f94c0f1afcafcd39ef85b89eaf4e2bc7d2 # timeout=10
[transformers4rec_tests] $ /bin/bash /tmp/jenkins2715036970437953748.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-4.0.0
collected 1 item
|
Documentation previewhttps://nvidia-merlin.github.io/Transformers4Rec/review/pr-496 |
|
|
||
| self.set_dataset(buffer_size, engine, reader_kwargs) | ||
|
|
||
| if (global_rank is not None) and (self.dataset.npartitions < global_size): |
There was a problem hiding this comment.
Add a warning here instructing the user to save the parquet files in multiple partitions (row groups) for better performance. We can include in the warning an example on how to do saving with pandas or cudf.
df.to_parquet("filename.parquet", row_group_size=10000, engine="pyarrow")
The final number of partitions = number of rows / row_group_size
| def compute(self): | ||
| # Computing the mean of the batch metrics (for each cut-off at topk) | ||
| return torch.cat(self.metric_mean, axis=0).mean(0) | ||
| return dim_zero_cat(self.metric_mean).mean(0) |
There was a problem hiding this comment.
I think the fix is ok, as dim_zero_cat might be able to deal with both lists and tensors.
I am curious if you have tried to set self.add_state(..., dist_reduce_fx="mean") and if provides same accuracy but faster compute, bcs if metrics are averaged per GPU before being sync in compute() less data that would have communicated among GPUs.
Click to view CI ResultsGitHub pull request #496 of commit 7e7e5e3e90d0c972e80303dc8f4422e132ac6b2a, no merge conflicts.
Running as SYSTEM
Setting status of 7e7e5e3e90d0c972e80303dc8f4422e132ac6b2a to PENDING with url http://10.20.17.181:8080/job/transformers4rec_tests/201/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git
> git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
> git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
> git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10
> git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
> git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
> git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
> git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/496/*:refs/remotes/origin/pr/496/* # timeout=10
> git rev-parse 7e7e5e3e90d0c972e80303dc8f4422e132ac6b2a^{commit} # timeout=10
Checking out Revision 7e7e5e3e90d0c972e80303dc8f4422e132ac6b2a (detached)
> git config core.sparsecheckout # timeout=10
> git checkout -f 7e7e5e3e90d0c972e80303dc8f4422e132ac6b2a # timeout=10
Commit message: "added user warning to repartition"
> git rev-list --no-walk d787c9ee19c27960c040c416a30e2d0c00a67b89 # timeout=10
[transformers4rec_tests] $ /bin/bash /tmp/jenkins5153853890988784195.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-4.0.0
collected 1 item
|
Click to view CI ResultsGitHub pull request #496 of commit 3dded907d3d42ef32f30c4e77126889d8e1b6af4, no merge conflicts.
Running as SYSTEM
Setting status of 3dded907d3d42ef32f30c4e77126889d8e1b6af4 to PENDING with url http://10.20.17.181:8080/job/transformers4rec_tests/207/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git
> git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
> git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
> git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10
> git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
> git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
> git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
> git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/496/*:refs/remotes/origin/pr/496/* # timeout=10
> git rev-parse 3dded907d3d42ef32f30c4e77126889d8e1b6af4^{commit} # timeout=10
Checking out Revision 3dded907d3d42ef32f30c4e77126889d8e1b6af4 (detached)
> git config core.sparsecheckout # timeout=10
> git checkout -f 3dded907d3d42ef32f30c4e77126889d8e1b6af4 # timeout=10
Commit message: "changed print to logger.warning"
> git rev-list --no-walk cd455a1ab814ca2f6332069cdb673f3e28200306 # timeout=10
[transformers4rec_tests] $ /bin/bash /tmp/jenkins1459850529842728832.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-4.0.0
collected 1 item
|
Fixes #456
Goals ⚽
DistributedDataParalleltraining for next-item-prediction tasks usingtransformers4rec.Trainerclass for training.DataParalleltraining).Implementation Details 🚧
DistributedDataParallelneeds to have number of partitions equal or greater than number of GPUs i.e.dataloader.datatset.npartition>=dataloader.args.global_size. We must check the number of partitions of the dataset and re-partition it if needed.DistributedDataParallelmode,torch.cat(self.metric_mean, 0)failed. It was replaced bytorchmetrics.utilities.data.dim_zero_cat()that has the same functionality but works forDistributedDataParalleltoo.global_rankandglobal_sizemust be passed to its constructor.Testing Details 🔍
DistributedDataParalleltraining make sureCUDA_VISIBLE_DEVICESis set correctly and run the script using torch distributed launch as shown below:python -m torch.distributed.launch --nproc_per_node $N_GPUS$ your_script.py --arguments