Skip to content

Add feature to increase the number of host to device transfer threads#4693

Merged
miladm merged 9 commits intomasterfrom
dataloader_change
Mar 9, 2023
Merged

Add feature to increase the number of host to device transfer threads#4693
miladm merged 9 commits intomasterfrom
dataloader_change

Conversation

@chandrasekhard2
Copy link
Copy Markdown
Collaborator

This config improves the ResNet training performance on TPU v4 chips.

Epoch 3 train begin 19:02:08
| Training Device=xla:0/1 Epoch=3 Step=5100 Loss=3.82812 Rate=2014.43 GlobalRate=1220.24 Time=19:02:14
| Training Device=xla:0/3 Epoch=3 Step=5100 Loss=3.39062 Rate=2014.43 GlobalRate=1223.24 Time=19:02:14
| Training Device=xla:0/0 Epoch=3 Step=5100 Loss=3.64062 Rate=2014.44 GlobalRate=1221.29 Time=19:02:14
| Training Device=xla:0/2 Epoch=3 Step=5100 Loss=3.34375 Rate=2014.61 GlobalRate=1221.54 Time=19:02:14
| Training Device=xla:0/0 Epoch=3 Step=5400 Loss=3.51562 Rate=2221.05 GlobalRate=1254.91 Time=19:02:30
| Training Device=xla:0/3 Epoch=3 Step=5400 Loss=3.40625 Rate=2221.03 GlobalRate=1256.85 Time=19:02:30
| Training Device=xla:0/1 Epoch=3 Step=5400 Loss=3.54688 Rate=2221.02 GlobalRate=1253.85 Time=19:02:30
| Training Device=xla:0/2 Epoch=3 Step=5400 Loss=3.43750 Rate=2221.59 GlobalRate=1255.17 Time=19:02:30
| Training Device=xla:0/1 Epoch=3 Step=5700 Loss=2.89062 Rate=2267.96 GlobalRate=1284.59 Time=19:02:47
| Training Device=xla:0/3 Epoch=3 Step=5700 Loss=3.50000 Rate=2267.92 GlobalRate=1287.56 Time=19:02:47
| Training Device=xla:0/0 Epoch=3 Step=5700 Loss=3.42188 Rate=2267.86 GlobalRate=1285.63 Time=19:02:47
| Training Device=xla:0/2 Epoch=3 Step=5700 Loss=3.45312 Rate=2268.25 GlobalRate=1285.89 Time=19:02:47
| Training Device=xla:0/0 Epoch=3 Step=6000 Loss=3.03125 Rate=2323.80 GlobalRate=1315.59 Time=19:03:03
| Training Device=xla:0/3 Epoch=3 Step=6000 Loss=3.32812 Rate=2323.76 GlobalRate=1317.51 Time=19:03:03
| Training Device=xla:0/1 Epoch=3 Step=6000 Loss=3.28125 Rate=2323.71 GlobalRate=1314.55 Time=19:03:03
| Training Device=xla:0/2 Epoch=3 Step=6000 Loss=3.37500 Rate=2323.66 GlobalRate=1315.84 Time=19:03:03
| Training Device=xla:0/1 Epoch=3 Step=6300 Loss=3.51562 Rate=2377.17 GlobalRate=1343.67 Time=19:03:19
| Training Device=xla:0/0 Epoch=3 Step=6300 Loss=3.12500 Rate=2377.16 GlobalRate=1344.70 Time=19:03:19
| Training Device=xla:0/3 Epoch=3 Step=6300 Loss=3.37500 Rate=2377.15 GlobalRate=1346.61 Time=19:03:19
| Training Device=xla:0/2 Epoch=3 Step=6300 Loss=3.12500 Rate=2377.08 GlobalRate=1344.95 Time=19:03:19
| Training Device=xla:0/3 Epoch=3 Step=6600 Loss=3.31250 Rate=2368.34 GlobalRate=1373.45 Time=19:03:35
| Training Device=xla:0/0 Epoch=3 Step=6600 Loss=3.20312 Rate=2368.35 GlobalRate=1371.56 Time=19:03:35
| Training Device=xla:0/1 Epoch=3 Step=6600 Loss=3.10938 Rate=2368.32 GlobalRate=1370.53 Time=19:03:35
| Training Device=xla:0/2 Epoch=3 Step=6600 Loss=3.04688 Rate=2367.79 GlobalRate=1371.79 Time=19:03:35
| Training Device=xla:0/3 Epoch=3 Step=6900 Loss=3.20312 Rate=2376.36 GlobalRate=1399.20 Time=19:03:51
| Training Device=xla:0/1 Epoch=3 Step=6900 Loss=3.20312 Rate=2376.36 GlobalRate=1396.30 Time=19:03:51
| Training Device=xla:0/0 Epoch=3 Step=6900 Loss=3.23438 Rate=2376.29 GlobalRate=1397.32 Time=19:03:51
| Training Device=xla:0/2 Epoch=3 Step=6900 Loss=3.26562 Rate=2376.85 GlobalRate=1397.57 Time=19:03:51
| Training Device=xla:0/2 Epoch=3 Step=7200 Loss=3.03125 Rate=2481.11 GlobalRate=1424.40 Time=19:04:06
| Training Device=xla:0/0 Epoch=3 Step=7200 Loss=3.18750 Rate=2480.62 GlobalRate=1424.14 Time=19:04:06
| Training Device=xla:0/3 Epoch=3 Step=7200 Loss=3.26562 Rate=2480.56 GlobalRate=1426.01 Time=19:04:06
| Training Device=xla:0/1 Epoch=3 Step=7200 Loss=3.17188 Rate=2480.55 GlobalRate=1423.12 Time=19:04:06
Epoch 3 train end 19:04:20
Epoch 3 test begin 19:04:20
| Test Device=xla:0/0 Step=0 Epoch=3 Time=19:04:21
| Test Device=xla:0/3 Step=0 Epoch=3 Time=19:04:22
| Test Device=xla:0/1 Step=0 Epoch=3 Time=19:04:22
| Test Device=xla:0/2 Step=0 Epoch=3 Time=19:04:22
Epoch 3 test end 19:04:28, Accuracy=31.63

Comment thread test/test_train_mp_imagenet.py Outdated
DEFAULT_KWARGS = dict(
batch_size=128,
test_set_batch_size=64,
test_set_batch_size=128,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work on v3 too?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works on v3 but I haven't checked on v2. I will revert this.

Comment thread test/test_train_mp_imagenet.py Outdated
1. It is recommended to use this config in conjuntion with XLA_USE_BF16=1 Flag.
2. Hyperparameters can be tuned to further improve the accuracy.
'''
OPTIMIZED_KWARGS = dict(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks @chandrasekhard2

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE as future reference: the configs that work best on v4-8 are not expected to work as well on other hardware configs (e.g. v4-128)

@chandrasekhard2 wdyt about re-naming your OPTIMIZED_KWARGS variable? See below:

OPTIMIZED_KWARGS_v4_8 = dict(...
OPTIMIZED_KWARGS = OPTIMIZED_KWARGS_v4_8

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with renaming this to clarify the accelerator this was optimized for. If it was optimized for multiple v4 sizes, then we can name it OPTIMIZED_KWARGS_v4.

If we're going to start publishing optimized hyperparameters for specific accelerators, we should move them to some sort of config file system, e.g. gin.

The corresponding Flax and TensorFlow examples both implement something like this.

Copy link
Copy Markdown
Collaborator

@JackCaoG JackCaoG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly lGTM, can we post a more detail fake data + real data benchmark with and without this pr? Does this impact convergence?

Copy link
Copy Markdown
Collaborator

@miladm miladm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chandrasekhard2

Added a couple of comments.

Also, can you please document (in this PR) how much performance gain should we expect to emerge from this PR?

Comment thread test/test_train_mp_imagenet.py Outdated
1. It is recommended to use this config in conjuntion with XLA_USE_BF16=1 Flag.
2. Hyperparameters can be tuned to further improve the accuracy.
'''
OPTIMIZED_KWARGS = dict(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE as future reference: the configs that work best on v4-8 are not expected to work as well on other hardware configs (e.g. v4-128)

@chandrasekhard2 wdyt about re-naming your OPTIMIZED_KWARGS variable? See below:

OPTIMIZED_KWARGS_v4_8 = dict(...
OPTIMIZED_KWARGS = OPTIMIZED_KWARGS_v4_8

@chandrasekhard2
Copy link
Copy Markdown
Collaborator Author

Thanks @chandrasekhard2

Added a couple of comments.

Also, can you please document (in this PR) how much performance gain should we expect to emerge from this PR?

@miladm - This config improves the performance even on v4-128. Its just that there is still a gap if we compare it to Optimized tensorflow ResNet.

Comment thread test/test_train_mp_imagenet.py Outdated
@JackCaoG
Copy link
Copy Markdown
Collaborator

@chandrasekhard2 Let's see we can address review comments and run some experiments before EOD tmr, we will cut next rc soon and we only have ~ 2 week left before release.

@JackCaoG JackCaoG requested review from miladm and will-cromar March 6, 2023 22:01
Comment thread test/test_train_mp_imagenet.py Outdated
# Best config to achieve peak performance on TPU v4
# 1. It is recommended to use this config in conjuntion with XLA_USE_BF16=1 Flag.
# 2. Hyperparameters can be tuned to further improve the accuracy.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove empty line maybe?

Comment thread test/test_train_mp_imagenet.py Outdated
# 2. Hyperparameters can be tuned to further improve the accuracy.

OPTIMIZED_KWARGS_v4 = dict(
batch_size=128,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we can do 256 for v4?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Step time would jump from 39ms to 82ms if we increase the batch size from 128 to 256. (3ms more)

Comment thread test/test_train_mp_imagenet.py Outdated
Comment thread test/test_train_mp_imagenet.py Outdated
JackCaoG added a commit that referenced this pull request Mar 7, 2023
JackCaoG added a commit that referenced this pull request Mar 7, 2023
JackCaoG added a commit that referenced this pull request Mar 7, 2023
@JackCaoG
Copy link
Copy Markdown
Collaborator

JackCaoG commented Mar 7, 2023

CI fialure is unrelated, rebase would solve it.

Copy link
Copy Markdown
Collaborator

@miladm miladm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chandrasekhard2! LGTM.

@miladm miladm merged commit ea135c6 into master Mar 9, 2023
mateuszlewko pushed a commit that referenced this pull request Mar 15, 2023
…#4693)

* Add feature to increase the number of host to device transfer threads

* Revert test set batch_size to 64

* Rename the config name to OPTIMIZED_KWARGS_v4

* Change description to v4 instead of just v4-8 as this config imporves resnet performance even on v4 slices and pods

* remove extra line

* Add flag to switch v4 optimized config

* Modify name to more generalized way to keep it open for v5 config as well

* Add flexibility to define multiple configs based on TPU versions

* Keep the command consistent
ManfeiBai pushed a commit to ManfeiBai/PyTorchXLA that referenced this pull request Mar 29, 2023
…pytorch#4693)

* Add feature to increase the number of host to device transfer threads

* Revert test set batch_size to 64

* Rename the config name to OPTIMIZED_KWARGS_v4

* Change description to v4 instead of just v4-8 as this config imporves resnet performance even on v4 slices and pods

* remove extra line

* Add flag to switch v4 optimized config

* Modify name to more generalized way to keep it open for v5 config as well

* Add flexibility to define multiple configs based on TPU versions

* Keep the command consistent
ManfeiBai pushed a commit to ManfeiBai/PyTorchXLA that referenced this pull request Mar 29, 2023
…pytorch#4693)

* Add feature to increase the number of host to device transfer threads

* Revert test set batch_size to 64

* Rename the config name to OPTIMIZED_KWARGS_v4

* Change description to v4 instead of just v4-8 as this config imporves resnet performance even on v4 slices and pods

* remove extra line

* Add flag to switch v4 optimized config

* Modify name to more generalized way to keep it open for v5 config as well

* Add flexibility to define multiple configs based on TPU versions

* Keep the command consistent
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants