Add feature to increase the number of host to device transfer threads#4693
Add feature to increase the number of host to device transfer threads#4693
Conversation
| DEFAULT_KWARGS = dict( | ||
| batch_size=128, | ||
| test_set_batch_size=64, | ||
| test_set_batch_size=128, |
There was a problem hiding this comment.
does this work on v3 too?
There was a problem hiding this comment.
It works on v3 but I haven't checked on v2. I will revert this.
| 1. It is recommended to use this config in conjuntion with XLA_USE_BF16=1 Flag. | ||
| 2. Hyperparameters can be tuned to further improve the accuracy. | ||
| ''' | ||
| OPTIMIZED_KWARGS = dict( |
There was a problem hiding this comment.
NOTE as future reference: the configs that work best on v4-8 are not expected to work as well on other hardware configs (e.g. v4-128)
@chandrasekhard2 wdyt about re-naming your OPTIMIZED_KWARGS variable? See below:
OPTIMIZED_KWARGS_v4_8 = dict(...
OPTIMIZED_KWARGS = OPTIMIZED_KWARGS_v4_8
There was a problem hiding this comment.
I agree with renaming this to clarify the accelerator this was optimized for. If it was optimized for multiple v4 sizes, then we can name it OPTIMIZED_KWARGS_v4.
If we're going to start publishing optimized hyperparameters for specific accelerators, we should move them to some sort of config file system, e.g. gin.
The corresponding Flax and TensorFlow examples both implement something like this.
JackCaoG
left a comment
There was a problem hiding this comment.
mostly lGTM, can we post a more detail fake data + real data benchmark with and without this pr? Does this impact convergence?
miladm
left a comment
There was a problem hiding this comment.
Thanks @chandrasekhard2
Added a couple of comments.
Also, can you please document (in this PR) how much performance gain should we expect to emerge from this PR?
| 1. It is recommended to use this config in conjuntion with XLA_USE_BF16=1 Flag. | ||
| 2. Hyperparameters can be tuned to further improve the accuracy. | ||
| ''' | ||
| OPTIMIZED_KWARGS = dict( |
There was a problem hiding this comment.
NOTE as future reference: the configs that work best on v4-8 are not expected to work as well on other hardware configs (e.g. v4-128)
@chandrasekhard2 wdyt about re-naming your OPTIMIZED_KWARGS variable? See below:
OPTIMIZED_KWARGS_v4_8 = dict(...
OPTIMIZED_KWARGS = OPTIMIZED_KWARGS_v4_8
@miladm - This config improves the performance even on v4-128. Its just that there is still a gap if we compare it to Optimized tensorflow ResNet. |
|
@chandrasekhard2 Let's see we can address review comments and run some experiments before EOD tmr, we will cut next rc soon and we only have ~ 2 week left before release. |
| # Best config to achieve peak performance on TPU v4 | ||
| # 1. It is recommended to use this config in conjuntion with XLA_USE_BF16=1 Flag. | ||
| # 2. Hyperparameters can be tuned to further improve the accuracy. | ||
|
|
| # 2. Hyperparameters can be tuned to further improve the accuracy. | ||
|
|
||
| OPTIMIZED_KWARGS_v4 = dict( | ||
| batch_size=128, |
There was a problem hiding this comment.
I thought we can do 256 for v4?
There was a problem hiding this comment.
Step time would jump from 39ms to 82ms if we increase the batch size from 128 to 256. (3ms more)
|
CI fialure is unrelated, rebase would solve it. |
… resnet performance even on v4 slices and pods
fde638b to
40f55a2
Compare
miladm
left a comment
There was a problem hiding this comment.
Thanks @chandrasekhard2! LGTM.
…#4693) * Add feature to increase the number of host to device transfer threads * Revert test set batch_size to 64 * Rename the config name to OPTIMIZED_KWARGS_v4 * Change description to v4 instead of just v4-8 as this config imporves resnet performance even on v4 slices and pods * remove extra line * Add flag to switch v4 optimized config * Modify name to more generalized way to keep it open for v5 config as well * Add flexibility to define multiple configs based on TPU versions * Keep the command consistent
…pytorch#4693) * Add feature to increase the number of host to device transfer threads * Revert test set batch_size to 64 * Rename the config name to OPTIMIZED_KWARGS_v4 * Change description to v4 instead of just v4-8 as this config imporves resnet performance even on v4 slices and pods * remove extra line * Add flag to switch v4 optimized config * Modify name to more generalized way to keep it open for v5 config as well * Add flexibility to define multiple configs based on TPU versions * Keep the command consistent
…pytorch#4693) * Add feature to increase the number of host to device transfer threads * Revert test set batch_size to 64 * Rename the config name to OPTIMIZED_KWARGS_v4 * Change description to v4 instead of just v4-8 as this config imporves resnet performance even on v4 slices and pods * remove extra line * Add flag to switch v4 optimized config * Modify name to more generalized way to keep it open for v5 config as well * Add flexibility to define multiple configs based on TPU versions * Keep the command consistent
This config improves the ResNet training performance on TPU v4 chips.