Add feature to increase the number of host to device transfer threads by chandrasekhard2 · Pull Request #4693 · pytorch/xla

chandrasekhard2 · 2023-02-24T19:06:36Z

This config improves the ResNet training performance on TPU v4 chips.

Epoch 3 train begin 19:02:08
| Training Device=xla:0/1 Epoch=3 Step=5100 Loss=3.82812 Rate=2014.43 GlobalRate=1220.24 Time=19:02:14
| Training Device=xla:0/3 Epoch=3 Step=5100 Loss=3.39062 Rate=2014.43 GlobalRate=1223.24 Time=19:02:14
| Training Device=xla:0/0 Epoch=3 Step=5100 Loss=3.64062 Rate=2014.44 GlobalRate=1221.29 Time=19:02:14
| Training Device=xla:0/2 Epoch=3 Step=5100 Loss=3.34375 Rate=2014.61 GlobalRate=1221.54 Time=19:02:14
| Training Device=xla:0/0 Epoch=3 Step=5400 Loss=3.51562 Rate=2221.05 GlobalRate=1254.91 Time=19:02:30
| Training Device=xla:0/3 Epoch=3 Step=5400 Loss=3.40625 Rate=2221.03 GlobalRate=1256.85 Time=19:02:30
| Training Device=xla:0/1 Epoch=3 Step=5400 Loss=3.54688 Rate=2221.02 GlobalRate=1253.85 Time=19:02:30
| Training Device=xla:0/2 Epoch=3 Step=5400 Loss=3.43750 Rate=2221.59 GlobalRate=1255.17 Time=19:02:30
| Training Device=xla:0/1 Epoch=3 Step=5700 Loss=2.89062 Rate=2267.96 GlobalRate=1284.59 Time=19:02:47
| Training Device=xla:0/3 Epoch=3 Step=5700 Loss=3.50000 Rate=2267.92 GlobalRate=1287.56 Time=19:02:47
| Training Device=xla:0/0 Epoch=3 Step=5700 Loss=3.42188 Rate=2267.86 GlobalRate=1285.63 Time=19:02:47
| Training Device=xla:0/2 Epoch=3 Step=5700 Loss=3.45312 Rate=2268.25 GlobalRate=1285.89 Time=19:02:47
| Training Device=xla:0/0 Epoch=3 Step=6000 Loss=3.03125 Rate=2323.80 GlobalRate=1315.59 Time=19:03:03
| Training Device=xla:0/3 Epoch=3 Step=6000 Loss=3.32812 Rate=2323.76 GlobalRate=1317.51 Time=19:03:03
| Training Device=xla:0/1 Epoch=3 Step=6000 Loss=3.28125 Rate=2323.71 GlobalRate=1314.55 Time=19:03:03
| Training Device=xla:0/2 Epoch=3 Step=6000 Loss=3.37500 Rate=2323.66 GlobalRate=1315.84 Time=19:03:03
| Training Device=xla:0/1 Epoch=3 Step=6300 Loss=3.51562 Rate=2377.17 GlobalRate=1343.67 Time=19:03:19
| Training Device=xla:0/0 Epoch=3 Step=6300 Loss=3.12500 Rate=2377.16 GlobalRate=1344.70 Time=19:03:19
| Training Device=xla:0/3 Epoch=3 Step=6300 Loss=3.37500 Rate=2377.15 GlobalRate=1346.61 Time=19:03:19
| Training Device=xla:0/2 Epoch=3 Step=6300 Loss=3.12500 Rate=2377.08 GlobalRate=1344.95 Time=19:03:19
| Training Device=xla:0/3 Epoch=3 Step=6600 Loss=3.31250 Rate=2368.34 GlobalRate=1373.45 Time=19:03:35
| Training Device=xla:0/0 Epoch=3 Step=6600 Loss=3.20312 Rate=2368.35 GlobalRate=1371.56 Time=19:03:35
| Training Device=xla:0/1 Epoch=3 Step=6600 Loss=3.10938 Rate=2368.32 GlobalRate=1370.53 Time=19:03:35
| Training Device=xla:0/2 Epoch=3 Step=6600 Loss=3.04688 Rate=2367.79 GlobalRate=1371.79 Time=19:03:35
| Training Device=xla:0/3 Epoch=3 Step=6900 Loss=3.20312 Rate=2376.36 GlobalRate=1399.20 Time=19:03:51
| Training Device=xla:0/1 Epoch=3 Step=6900 Loss=3.20312 Rate=2376.36 GlobalRate=1396.30 Time=19:03:51
| Training Device=xla:0/0 Epoch=3 Step=6900 Loss=3.23438 Rate=2376.29 GlobalRate=1397.32 Time=19:03:51
| Training Device=xla:0/2 Epoch=3 Step=6900 Loss=3.26562 Rate=2376.85 GlobalRate=1397.57 Time=19:03:51
| Training Device=xla:0/2 Epoch=3 Step=7200 Loss=3.03125 Rate=2481.11 GlobalRate=1424.40 Time=19:04:06
| Training Device=xla:0/0 Epoch=3 Step=7200 Loss=3.18750 Rate=2480.62 GlobalRate=1424.14 Time=19:04:06
| Training Device=xla:0/3 Epoch=3 Step=7200 Loss=3.26562 Rate=2480.56 GlobalRate=1426.01 Time=19:04:06
| Training Device=xla:0/1 Epoch=3 Step=7200 Loss=3.17188 Rate=2480.55 GlobalRate=1423.12 Time=19:04:06
Epoch 3 train end 19:04:20
Epoch 3 test begin 19:04:20
| Test Device=xla:0/0 Step=0 Epoch=3 Time=19:04:21
| Test Device=xla:0/3 Step=0 Epoch=3 Time=19:04:22
| Test Device=xla:0/1 Step=0 Epoch=3 Time=19:04:22
| Test Device=xla:0/2 Step=0 Epoch=3 Time=19:04:22
Epoch 3 test end 19:04:28, Accuracy=31.63

JackCaoG · 2023-02-24T19:08:50Z

 DEFAULT_KWARGS = dict(
    batch_size=128,
-    test_set_batch_size=64,
+    test_set_batch_size=128,


does this work on v3 too?

It works on v3 but I haven't checked on v2. I will revert this.

yeounoh · 2023-02-24T21:25:14Z

+  1. It is recommended to use this config in conjuntion with XLA_USE_BF16=1 Flag. 
+  2. Hyperparameters can be tuned to further improve the accuracy.
+'''
+OPTIMIZED_KWARGS = dict(


This is great, thanks @chandrasekhard2

NOTE as future reference: the configs that work best on v4-8 are not expected to work as well on other hardware configs (e.g. v4-128)

@chandrasekhard2 wdyt about re-naming your OPTIMIZED_KWARGS variable? See below:

OPTIMIZED_KWARGS_v4_8 = dict(...
OPTIMIZED_KWARGS = OPTIMIZED_KWARGS_v4_8

I agree with renaming this to clarify the accelerator this was optimized for. If it was optimized for multiple v4 sizes, then we can name it OPTIMIZED_KWARGS_v4.

If we're going to start publishing optimized hyperparameters for specific accelerators, we should move them to some sort of config file system, e.g. gin.

The corresponding Flax and TensorFlow examples both implement something like this.

JackCaoG

mostly lGTM, can we post a more detail fake data + real data benchmark with and without this pr? Does this impact convergence?

miladm

Thanks @chandrasekhard2

Added a couple of comments.

Also, can you please document (in this PR) how much performance gain should we expect to emerge from this PR?

miladm · 2023-02-27T18:45:04Z

+  1. It is recommended to use this config in conjuntion with XLA_USE_BF16=1 Flag. 
+  2. Hyperparameters can be tuned to further improve the accuracy.
+'''
+OPTIMIZED_KWARGS = dict(


NOTE as future reference: the configs that work best on v4-8 are not expected to work as well on other hardware configs (e.g. v4-128)

@chandrasekhard2 wdyt about re-naming your OPTIMIZED_KWARGS variable? See below:

OPTIMIZED_KWARGS_v4_8 = dict(...
OPTIMIZED_KWARGS = OPTIMIZED_KWARGS_v4_8

chandrasekhard2 · 2023-02-27T19:14:19Z

Thanks @chandrasekhard2

Added a couple of comments.

Also, can you please document (in this PR) how much performance gain should we expect to emerge from this PR?

@miladm - This config improves the performance even on v4-128. Its just that there is still a gap if we compare it to Optimized tensorflow ResNet.

JackCaoG · 2023-02-28T19:50:23Z

@chandrasekhard2 Let's see we can address review comments and run some experiments before EOD tmr, we will cut next rc soon and we only have ~ 2 week left before release.

JackCaoG · 2023-03-06T22:06:21Z

+#  Best config to achieve peak performance on TPU v4
+#    1. It is recommended to use this config in conjuntion with XLA_USE_BF16=1 Flag.
+#    2. Hyperparameters can be tuned to further improve the accuracy.
+


remove empty line maybe?

JackCaoG · 2023-03-06T22:06:42Z

+#    2. Hyperparameters can be tuned to further improve the accuracy.
+
+OPTIMIZED_KWARGS_v4 = dict(
+    batch_size=128,


I thought we can do 256 for v4?

Step time would jump from 39ms to 82ms if we increase the batch size from 128 to 256. (3ms more)

JackCaoG · 2023-03-07T03:51:22Z

CI fialure is unrelated, rebase would solve it.

… resnet performance even on v4 slices and pods

…well

miladm

Thanks @chandrasekhard2! LGTM.

…#4693) * Add feature to increase the number of host to device transfer threads * Revert test set batch_size to 64 * Rename the config name to OPTIMIZED_KWARGS_v4 * Change description to v4 instead of just v4-8 as this config imporves resnet performance even on v4 slices and pods * remove extra line * Add flag to switch v4 optimized config * Modify name to more generalized way to keep it open for v5 config as well * Add flexibility to define multiple configs based on TPU versions * Keep the command consistent

…pytorch#4693) * Add feature to increase the number of host to device transfer threads * Revert test set batch_size to 64 * Rename the config name to OPTIMIZED_KWARGS_v4 * Change description to v4 instead of just v4-8 as this config imporves resnet performance even on v4 slices and pods * remove extra line * Add flag to switch v4 optimized config * Modify name to more generalized way to keep it open for v5 config as well * Add flexibility to define multiple configs based on TPU versions * Keep the command consistent

chandrasekhard2 requested review from JackCaoG and miladm February 24, 2023 19:06

JackCaoG requested a review from will-cromar February 24, 2023 19:08

JackCaoG reviewed Feb 24, 2023

View reviewed changes

yeounoh reviewed Feb 24, 2023

View reviewed changes

JackCaoG approved these changes Feb 25, 2023

View reviewed changes

miladm assigned chandrasekhard2 Feb 27, 2023

miladm added the performance label Feb 27, 2023

miladm requested changes Feb 27, 2023

View reviewed changes

will-cromar reviewed Feb 27, 2023

View reviewed changes

Comment thread test/test_train_mp_imagenet.py Outdated

JackCaoG requested review from miladm and will-cromar March 6, 2023 22:01

JackCaoG reviewed Mar 6, 2023

View reviewed changes

Comment thread test/test_train_mp_imagenet.py Outdated

JackCaoG reviewed Mar 7, 2023

View reviewed changes

Comment thread test/test_train_mp_imagenet.py Outdated

JackCaoG approved these changes Mar 7, 2023

View reviewed changes

JackCaoG added a commit that referenced this pull request Mar 7, 2023

backport dataloader thread from #4693

93b8e20

JackCaoG added a commit that referenced this pull request Mar 7, 2023

backport dataloader thread from #4693

d9ed625

JackCaoG mentioned this pull request Mar 7, 2023

backport loader thread to 2.0 #4727

Merged

JackCaoG added a commit that referenced this pull request Mar 7, 2023

backport dataloader thread from #4693 (#4727)

f642baf

chandrasekhard2 added 6 commits March 7, 2023 08:13

Add feature to increase the number of host to device transfer threads

b0c748f

Revert test set batch_size to 64

feeea8e

Rename the config name to OPTIMIZED_KWARGS_v4

ebfc74c

Change description to v4 instead of just v4-8 as this config imporves…

3cde649

… resnet performance even on v4 slices and pods

remove extra line

bb513b7

Add flag to switch v4 optimized config

2e4c7b7

chandrasekhard2 added 3 commits March 7, 2023 08:13

Modify name to more generalized way to keep it open for v5 config as …

66cbf69

…well

Add flexibility to define multiple configs based on TPU versions

e48ea2a

Keep the command consistent

40f55a2

chandrasekhard2 force-pushed the dataloader_change branch from fde638b to 40f55a2 Compare March 7, 2023 08:14

miladm approved these changes Mar 9, 2023

View reviewed changes

miladm merged commit ea135c6 into master Mar 9, 2023

aws-tianquaw mentioned this pull request Jun 28, 2023

Fix race condition when use multi threads to transfer data in parallel Loader #5267

Merged

vanbasten23 mentioned this pull request Jul 21, 2023

Disable cxx_abi when building PyTorch/XLA in r2.0. #5332

Merged

jonb377 mentioned this pull request Aug 1, 2023

Port resnet data loading optimizations to SPMD test script #5386

Merged

Conversation

chandrasekhard2 commented Feb 24, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackCaoG left a comment

Choose a reason for hiding this comment

Uh oh!

miladm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chandrasekhard2 commented Feb 27, 2023

Uh oh!

Uh oh!

JackCaoG commented Feb 28, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JackCaoG commented Mar 7, 2023

Uh oh!

miladm left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants