improve performance of unique with inverse indices#16145
improve performance of unique with inverse indices#16145ngimel wants to merge 3 commits intopytorch:masterfrom
Conversation
|
before I land this, could you please locally check correctness as well, against CPU. The unit-tests only cover small_3d test cases. |
|
This is amazing -- thank you so much @ngimel for the quick fix! |
|
@soumith tested for Justin's benchmark's cases, all clean. Do you want some tests added to the test suite? |
|
@ngimel your testing sounds good. and there are smoke tests with small_3d matrices in test_torch / test_cuda which are hopefully sufficient. If you see anything worth adding as a unit-test, do send a follow-up PR, I'll land this now |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: Partial fix for #15804, only w/o dim. For jcjohnson benchmarking script I'm getting the following results on V100: Before: ``` unning with N = 10000, M = 10000 cuda (no inverse): 0.98 ms cpu (no inverse): 0.96 ms cuda (with inverse): 1.07 ms cpu (with inverse): 1.76 ms Running with N = 10000, M = 100000 cuda (no inverse): 0.76 ms cpu (no inverse): 1.53 ms cuda (with inverse): 1.23 ms cpu (with inverse): 3.02 ms Running with N = 100000, M = 100000 cuda (no inverse): 1.28 ms cpu (no inverse): 11.22 ms cuda (with inverse): 69.76 ms cpu (with inverse): 20.28 ms Running with N = 100000, M = 1000000 cuda (no inverse): 0.78 ms cpu (no inverse): 18.78 ms cuda (with inverse): 133.45 ms cpu (with inverse): 34.09 ms Running with N = 500000, M = 500000 cuda (no inverse): 1.43 ms cpu (no inverse): 61.13 ms cuda (with inverse): 3315.18 ms cpu (with inverse): 104.57 ms Running with N = 500000, M = 5000000 cuda (no inverse): 0.86 ms cpu (no inverse): 96.44 ms cuda (with inverse): 5209.93 ms cpu (with inverse): 176.10 ms ``` After ``` Running with N = 10000, M = 10000 cuda (no inverse): 1.04 ms cpu (no inverse): 0.94 ms cuda (with inverse): 0.64 ms cpu (with inverse): 1.76 ms Running with N = 10000, M = 100000 cuda (no inverse): 0.77 ms cpu (no inverse): 1.55 ms cuda (with inverse): 0.58 ms cpu (with inverse): 2.79 ms Running with N = 100000, M = 100000 cuda (no inverse): 1.30 ms cpu (no inverse): 14.15 ms cuda (with inverse): 1.63 ms cpu (with inverse): 20.90 ms Running with N = 100000, M = 1000000 cuda (no inverse): 0.82 ms cpu (no inverse): 18.63 ms cuda (with inverse): 0.61 ms cpu (with inverse): 33.52 ms Running with N = 500000, M = 500000 cuda (no inverse): 1.51 ms cpu (no inverse): 59.81 ms cuda (with inverse): 1.23 ms cpu (with inverse): 110.69 ms Running with N = 500000, M = 5000000 cuda (no inverse): 0.92 ms cpu (no inverse): 104.26 ms cuda (with inverse): 0.84 ms cpu (with inverse): 187.12 ms ``` Pull Request resolved: pytorch/pytorch#16145 Differential Revision: D13738821 Pulled By: soumith fbshipit-source-id: 0811fb4ade47e3b466cebbc124e3f3333a986749
|
@ngimel @jcjohnson Is it possible to have the |
|
Yes, it's possible, my original commit was returning |
Summary: Partial fix for #15804, only w/o dim. For jcjohnson benchmarking script I'm getting the following results on V100: Before: ``` unning with N = 10000, M = 10000 cuda (no inverse): 0.98 ms cpu (no inverse): 0.96 ms cuda (with inverse): 1.07 ms cpu (with inverse): 1.76 ms Running with N = 10000, M = 100000 cuda (no inverse): 0.76 ms cpu (no inverse): 1.53 ms cuda (with inverse): 1.23 ms cpu (with inverse): 3.02 ms Running with N = 100000, M = 100000 cuda (no inverse): 1.28 ms cpu (no inverse): 11.22 ms cuda (with inverse): 69.76 ms cpu (with inverse): 20.28 ms Running with N = 100000, M = 1000000 cuda (no inverse): 0.78 ms cpu (no inverse): 18.78 ms cuda (with inverse): 133.45 ms cpu (with inverse): 34.09 ms Running with N = 500000, M = 500000 cuda (no inverse): 1.43 ms cpu (no inverse): 61.13 ms cuda (with inverse): 3315.18 ms cpu (with inverse): 104.57 ms Running with N = 500000, M = 5000000 cuda (no inverse): 0.86 ms cpu (no inverse): 96.44 ms cuda (with inverse): 5209.93 ms cpu (with inverse): 176.10 ms ``` After ``` Running with N = 10000, M = 10000 cuda (no inverse): 1.04 ms cpu (no inverse): 0.94 ms cuda (with inverse): 0.64 ms cpu (with inverse): 1.76 ms Running with N = 10000, M = 100000 cuda (no inverse): 0.77 ms cpu (no inverse): 1.55 ms cuda (with inverse): 0.58 ms cpu (with inverse): 2.79 ms Running with N = 100000, M = 100000 cuda (no inverse): 1.30 ms cpu (no inverse): 14.15 ms cuda (with inverse): 1.63 ms cpu (with inverse): 20.90 ms Running with N = 100000, M = 1000000 cuda (no inverse): 0.82 ms cpu (no inverse): 18.63 ms cuda (with inverse): 0.61 ms cpu (with inverse): 33.52 ms Running with N = 500000, M = 500000 cuda (no inverse): 1.51 ms cpu (no inverse): 59.81 ms cuda (with inverse): 1.23 ms cpu (with inverse): 110.69 ms Running with N = 500000, M = 5000000 cuda (no inverse): 0.92 ms cpu (no inverse): 104.26 ms cuda (with inverse): 0.84 ms cpu (with inverse): 187.12 ms ``` Pull Request resolved: #16145 Differential Revision: D13738821 Pulled By: soumith fbshipit-source-id: 0811fb4ade47e3b466cebbc124e3f3333a986749
Summary: Partial fix for #15804, only w/o dim. For jcjohnson benchmarking script I'm getting the following results on V100: Before: ``` unning with N = 10000, M = 10000 cuda (no inverse): 0.98 ms cpu (no inverse): 0.96 ms cuda (with inverse): 1.07 ms cpu (with inverse): 1.76 ms Running with N = 10000, M = 100000 cuda (no inverse): 0.76 ms cpu (no inverse): 1.53 ms cuda (with inverse): 1.23 ms cpu (with inverse): 3.02 ms Running with N = 100000, M = 100000 cuda (no inverse): 1.28 ms cpu (no inverse): 11.22 ms cuda (with inverse): 69.76 ms cpu (with inverse): 20.28 ms Running with N = 100000, M = 1000000 cuda (no inverse): 0.78 ms cpu (no inverse): 18.78 ms cuda (with inverse): 133.45 ms cpu (with inverse): 34.09 ms Running with N = 500000, M = 500000 cuda (no inverse): 1.43 ms cpu (no inverse): 61.13 ms cuda (with inverse): 3315.18 ms cpu (with inverse): 104.57 ms Running with N = 500000, M = 5000000 cuda (no inverse): 0.86 ms cpu (no inverse): 96.44 ms cuda (with inverse): 5209.93 ms cpu (with inverse): 176.10 ms ``` After ``` Running with N = 10000, M = 10000 cuda (no inverse): 1.04 ms cpu (no inverse): 0.94 ms cuda (with inverse): 0.64 ms cpu (with inverse): 1.76 ms Running with N = 10000, M = 100000 cuda (no inverse): 0.77 ms cpu (no inverse): 1.55 ms cuda (with inverse): 0.58 ms cpu (with inverse): 2.79 ms Running with N = 100000, M = 100000 cuda (no inverse): 1.30 ms cpu (no inverse): 14.15 ms cuda (with inverse): 1.63 ms cpu (with inverse): 20.90 ms Running with N = 100000, M = 1000000 cuda (no inverse): 0.82 ms cpu (no inverse): 18.63 ms cuda (with inverse): 0.61 ms cpu (with inverse): 33.52 ms Running with N = 500000, M = 500000 cuda (no inverse): 1.51 ms cpu (no inverse): 59.81 ms cuda (with inverse): 1.23 ms cpu (with inverse): 110.69 ms Running with N = 500000, M = 5000000 cuda (no inverse): 0.92 ms cpu (no inverse): 104.26 ms cuda (with inverse): 0.84 ms cpu (with inverse): 187.12 ms ``` Pull Request resolved: #16145 Differential Revision: D13738821 Pulled By: soumith fbshipit-source-id: 0811fb4ade47e3b466cebbc124e3f3333a986749
Summary: pytorch/pytorch#8899 had added CUDA support for `torch.unique()` pytorch/pytorch#16145 has some timing stats that could be relevant --- Experiment results: https://fb.quip.com/olQOA853j0mb Words per second (`gpu-unique_wps_avg_vs_base`): 1.046x Total train time (`gpu-unique_total_train_time_vs_base`; excl ar_AR-fr_XX): 0.987x Even though train time reduction is pretty minimal (probably overshadowed by random variance, scheduling delay, etc), WPS does seem to be ~5% faster - so might as well land this. Training time for ar_AR-fr_XX increased significantly - but that's b/c it trained for many more updates (`gpu-unique_num_updates_avg_vs_base`) - and also ended up w/ +1.43 BLEU. I think this is probably just an anomaly? Differential Revision: D15073468 fbshipit-source-id: a9710738c827013afb35a67bd3a9be259b0e2d5f
…torch#537) Summary: Pull Request resolved: pytorch#537 pytorch/pytorch#8899 had added CUDA support for `torch.unique()` pytorch/pytorch#16145 has some timing stats that could be relevant --- Experiment results: https://fb.quip.com/olQOA853j0mb Words per second (`gpu-unique_wps_avg_vs_base`): 1.046x Total train time (`gpu-unique_total_train_time_vs_base`; excl ar_AR-fr_XX): 0.987x Even though train time reduction is pretty minimal (probably overshadowed by random variance, scheduling delay, etc), WPS does seem to be ~5% faster - so might as well land this. Training time for ar_AR-fr_XX increased significantly - but that's b/c it trained for many more updates (`gpu-unique_num_updates_avg_vs_base`) - and also ended up w/ +1.43 BLEU. I think this is probably just an anomaly? Differential Revision: D15073468 fbshipit-source-id: 713288fc7c77f582840f270dd2e343a3b63f8fe5
…torch#537) Summary: Pull Request resolved: pytorch#537 pytorch/pytorch#8899 had added CUDA support for `torch.unique()` pytorch/pytorch#16145 has some timing stats that could be relevant --- Experiment results: https://fb.quip.com/olQOA853j0mb Words per second (`gpu-unique_wps_avg_vs_base`): 1.046x Total train time (`gpu-unique_total_train_time_vs_base`; excl ar_AR-fr_XX): 0.987x Even though train time reduction is pretty minimal (probably overshadowed by random variance, scheduling delay, etc), WPS does seem to be ~5% faster - so might as well land this. Training time for ar_AR-fr_XX increased significantly - but that's b/c it trained for many more updates (`gpu-unique_num_updates_avg_vs_base`) - and also ended up w/ +1.43 BLEU. I think this is probably just an anomaly? Differential Revision: D15073468 fbshipit-source-id: 29c7eaaddd63d629866c7314920fe27b22690603
Summary: Pull Request resolved: #537 pytorch/pytorch#8899 had added CUDA support for `torch.unique()` pytorch/pytorch#16145 has some timing stats that could be relevant --- Experiment results: https://fb.quip.com/olQOA853j0mb Words per second (`gpu-unique_wps_avg_vs_base`): 1.046x Total train time (`gpu-unique_total_train_time_vs_base`; excl ar_AR-fr_XX): 0.987x Even though train time reduction is pretty minimal (probably overshadowed by random variance, scheduling delay, etc), WPS does seem to be ~5% faster - so might as well land this. Training time for ar_AR-fr_XX increased significantly - but that's b/c it trained for many more updates (`gpu-unique_num_updates_avg_vs_base`) - and also ended up w/ +1.43 BLEU. I think this is probably just an anomaly? Reviewed By: akinh, jmp84 Differential Revision: D15073468 fbshipit-source-id: c2dba562b6d4fb4d15d2a56d03ce6a6e3ddff07d
Partial fix for #15804, only w/o dim.
For @jcjohnson benchmarking script I'm getting the following results on V100:
Before:
After