Rename sparse contiguous() to coalesce(); make out of place; speed up THC Embedding 10x#1302
Conversation
e19e3a7 to
17d9937
Compare
|
I made some modifications to make sparse accumulation into dense gradients on GPU a lot faster (~10x) by avoiding coalescing and calling Here are my previous notes from #1147 : I dug into speeding up sparse for nn.Embedding based on my benchmark on GPU (https://gist.github.com/adamlerer/865c1a09000c7dc8208e1456255209c2). I think my findings apply more broadly though.
|
|
Btw I'm still unsure about whether we should make Pro in-place:
This is going to be extremely inefficient. Pro out-of-place:
Always coalesceThere's a third option which is to always keep tensors in coalesced form. This is probably hard to make efficient. But it's possible that if we move |
|
Heisenbugs suck. I'll switch to out-of-place coalesce. It will be up to the user to coalesce tensors explicitly if they're going to be operating on them multiple times. |
35a762f to
a00bc5a
Compare
|
I'm not too qualified to review this diff, but some thoughts did occur to me:
|
|
Thanks @ezyang
Re: duplicates, the current implementation must sum duplicates because we take advantage of this to do some ops efficiently. |
|
Note to self: in the torch tests, |
This is a first cut at documentation on sparse tensors, on the encouragement of @adamlerer. It uses the new "coalesce" terminology that is currently on review in #1302. There is no documentation for the methods; I've stuck to documenting aspects that are unlikely to change. Towards #1303. Signed-off-by: Edward Z. Yang <ezyang@fb.com>
|
This is ready for review and merge. @apaszke you'll note that you were right in #1147 to suggest making coalesce out-of-place. Re: Embedding perf, here's the results of the benchmark script at https://gist.github.com/adamlerer/4159ddb4bd19e87de1b4bb3215a1db6f cf. 3s for new before this PR. |
c665770 to
7b3d6ec
Compare
|
👍 |
… instead of waiting for fusion opportunity (pytorch#1302)
… instead of waiting for fusion opportunity (pytorch#1302)
To resolve https://ontrack-internal.amd.com/browse/SWDEV-403530 and https://ontrack-internal.amd.com/browse/SWDEV-419837. For more context check upstream issue pytorch#111834
================================================= Temporarily skip test_conv3d_64bit_indexing - Rocblas API support is requested - SWDEV-383635 & sub task - SWDEV-390218 Skip ddp apply_optim_in_bwd tests for gloo (pytorch#1302) To resolve https://ontrack-internal.amd.com/browse/SWDEV-403530 and https://ontrack-internal.amd.com/browse/SWDEV-419837. For more context check upstream issue pytorch#111834 Add skipIfRocmArch decorator for Navi skips (pytorch#1356) Converted NAVI check as a function (pytorch#1364) * Moved NAVI check to the test file * Revised NAVI check as a function [Navi] [Inductor] Unskip Navi inductor UTs (pytorch#1514) Relates to https://ontrack-internal.amd.com/browse/SWDEV-461590 Bad import in test_torchinductor and skip torchvision related UT (pytorch#1374) skip test_inductor_freezing failing UTs (pytorch#1375) Skip test_mm_triton_kernel_benchmark (pytorch#1376) * Running triton kernel on ROCM only has one GB/s metric reported * Update test_kernel_benchmark.py skip vmapvjpvjp_linalg_householder_product_cuda_float32 (pytorch#1420) skipIfRocm needs msg parameter [NO CP] Updated changes to skip few UTs Imported skipIfRocm in certain test suites (pytorch#1577) Fixes SWDEV-472397 Added functions imports (pytorch#1521) Fixes inductor.test_torchinductor_dynamic_shapes::TestInductorDynamicCUDA::test_item_unbacked_stride_nobreak_cuda Enable test_public_api_surface (pytorch#1601) Fixes SWDEV-462410. Enable this unit test since PyTorch issue pytorch#104012 has been closed. This unit test runs fine on MI100/MI300 and upstream. (cherry picked from commit 0001d4ab5070635cfecc146ee299bbb9fa45ca67) [rocm6.3_internal_testing] Fixed error string assertion in test_invalid_devices (pytorch#1607) Fixes pytorch#8974 (cherry picked from commit a688e0a)
================================================= Temporarily skip test_conv3d_64bit_indexing - Rocblas API support is requested - SWDEV-383635 & sub task - SWDEV-390218 Skip ddp apply_optim_in_bwd tests for gloo (pytorch#1302) To resolve https://ontrack-internal.amd.com/browse/SWDEV-403530 and https://ontrack-internal.amd.com/browse/SWDEV-419837. For more context check upstream issue pytorch#111834 Add skipIfRocmArch decorator for Navi skips (pytorch#1356) Converted NAVI check as a function (pytorch#1364) * Moved NAVI check to the test file * Revised NAVI check as a function [Navi] [Inductor] Unskip Navi inductor UTs (pytorch#1514) Relates to https://ontrack-internal.amd.com/browse/SWDEV-461590 Bad import in test_torchinductor and skip torchvision related UT (pytorch#1374) skip test_inductor_freezing failing UTs (pytorch#1375) Skip test_mm_triton_kernel_benchmark (pytorch#1376) * Running triton kernel on ROCM only has one GB/s metric reported * Update test_kernel_benchmark.py skip vmapvjpvjp_linalg_householder_product_cuda_float32 (pytorch#1420) skipIfRocm needs msg parameter [NO CP] Updated changes to skip few UTs Imported skipIfRocm in certain test suites (pytorch#1577) Fixes SWDEV-472397 Added functions imports (pytorch#1521) Fixes inductor.test_torchinductor_dynamic_shapes::TestInductorDynamicCUDA::test_item_unbacked_stride_nobreak_cuda Enable test_public_api_surface (pytorch#1601) Fixes SWDEV-462410. Enable this unit test since PyTorch issue pytorch#104012 has been closed. This unit test runs fine on MI100/MI300 and upstream. (cherry picked from commit 0001d4ab5070635cfecc146ee299bbb9fa45ca67) [rocm6.3_internal_testing] Fixed error string assertion in test_invalid_devices (pytorch#1607) Fixes pytorch#8974 (cherry picked from commit a688e0a) (cherry picked from commit b966e44) [rocm6.4_internal_testing] Skip non_standard_bool_values tests (pytorch#1880) Fixes SWDEV-509757 (cherry picked from commit 80b4c41) [rocm6.4_internal_testing] [NAVI32] Skipped sdpa_2 test in test_aot_inductor for Navi32 (pytorch#1882) The test fails with assertion error "Tensors are not close" After testing I can confirm that this issue is caused by eager mode execution specific to navi32 during the test_sdpa_2 run. Made a cross reference between navi31, navi32 and mi300. AOTInductor results are all the exact same for all of the archs, only the eager mode fails here for navi32 with 1.5% difference in tensor values from the gpu run. I assume that this happens due to fp16-32-16 conversions in eager mode or missing some if-statements for navi32 specifically. Simple reproducer to check the values for cpu/gpu/eager/aoti runs. [gfx1101_test_sdpa_2_issue_reproducer.txt](https://github.com/user-attachments/files/18676367/gfx1101_test_sdpa_2_issue_reproducer.txt) (cherry picked from commit 896c789) Fixed rocm skip import issue (pytorch#1949) skip_if_rocm does not exist in torch/testing/_internal/common_distributed.py. Use skipIfRocm from torch/testing/_internal/common_utils.py instead. (cherry picked from commit cfb673e) Skip certain unit tests on NAVI (pytorch#1950) This PR is to skip certain unit tests on NAVI only. Fixes SWDEV-509011 - test_sac_ilp.py::TestSACILP::test_sac_ilp_case1 Fixes SWDEV-509311 - test_max_autotune.py::TestMaxAutotune::test_non_contiguous_input_addmm Fixes SWDEV-510738 test_fsdp_sharded_grad_scaler.py::TestShardedGradScalerParityWithDDP::test_sharded_grad_scaler_found_inf (cherry picked from commit e86291a)
Rename sparse tensor
contiguoustocoalesce, and be more sane about in-place vs out-of-place (i.e. always do it in place).One thing I'm wondering: should the python methods
tensor.indices(),tensor.values(), andtensor.__repr__()automatically coalesce first? This would have the advantage of hiding the implementation detail of coalescing from the Python user, guaranteeing expected invariants such ast.indices() == t.coalesce_().indices(). On the other hand, it limits what you can do in python (i.e. if you want to do something that uses the indices without coalescing for efficiency).Opinions welcome!