Batched torch.eig() and use of magma for CUDA torch.eig() operation by mfkasim1 · Pull Request #32932 · pytorch/pytorch

mfkasim1 · 2020-02-03T18:00:18Z

This pull request is the first one in response to my feature request #32531 about batched torch.eig().
It is actually harder than I previously thought and I ended up writing torch.eig() in ATen (replacing the legacy functions used beforehand).
I also create an eig function for CUDA using magma which in my run shows a slight improvement in terms of running time.

I haven't made the tests for batched eig, just to show that the change I made here does not break the previous tests for torch.eig().
Should I add a new function in test_torch.py for batched eig or should I just change the function test_eig?

kostmo · 2020-02-03T18:04:03Z

💊 CircleCI build failures summary and remediations

As of commit 75bb4ac:

1/3 broken upstream at merge base 87dc2db since Feb 17
Please rebase on the viable/strict branch (expand for instructions)

If your commit is newer than viable/strict, you can try basing on an older, stable commit:
```
git fetch origin viable/strict
git rebase --onto viable/strict $(git merge-base origin/master HEAD)
```
If your commit is older than viable/strict:
```
git fetch origin viable/strict
git rebase viable/strict
```
Check out the recency history of this "viable master" tracking branch.
2/3 failures introduced in this PR

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

🕵️ 1 new failure recognized by patterns

The following build failures do not appear to be due to upstream breakage:

pytorch_xla_linux_xenial_py3_6_clang7_test (1/1)

Step: "Test" (full log | pattern match details)

Feb 17 20:28:52 FAIL [0.186s]: test_eig_xla_float64 (__main__.TestTorchDeviceTypeXLA)

Feb 17 20:28:52     return fn(slf, device, *args, **kwargs) 
Feb 17 20:28:52   File "/var/lib/jenkins/workspace/xla/test/../../test/test_torch.py", line 13219, in test_batch_eig 
Feb 17 20:28:52     run_test(batch_dims + (5,), eigenvectors) 
Feb 17 20:28:52   File "/var/lib/jenkins/workspace/xla/test/../../test/test_torch.py", line 13186, in run_test 
Feb 17 20:28:52     self.assertEqual(torch.zeros_like(oute_imag, device=device), oute_imag, "Eigenvectors not all real") 
Feb 17 20:28:52   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 335, in assertEqual 
Feb 17 20:28:52     prec = max(self.precision, prec) 
Feb 17 20:28:52 TypeError: '>' not supported between instances of 'str' and 'float' 
Feb 17 20:28:52  
Feb 17 20:28:52 ====================================================================== 
Feb 17 20:28:52 FAIL [0.186s]: test_eig_xla_float64 (__main__.TestTorchDeviceTypeXLA) 
Feb 17 20:28:52 ---------------------------------------------------------------------- 
Feb 17 20:28:52 Traceback (most recent call last): 
Feb 17 20:28:52   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 198, in instantiated_test 
Feb 17 20:28:52     result = test(self, device_arg, dtype) 
Feb 17 20:28:52   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 431, in dep_fn 
Feb 17 20:28:52     return fn(slf, device, *args, **kwargs) 
Feb 17 20:28:52   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 431, in dep_fn 
Feb 17 20:28:52     return fn(slf, device, *args, **kwargs) 
Feb 17 20:28:52   File "/var/lib/jenkins/workspace/xla/test/../../test/test_torch.py", line 13148, in test_eig 
Feb 17 20:28:52     self.assertFalse(v.is_contiguous(), 'V is contiguous')

🚧 1 upstream failure recognized by patterns:

These builds matched patterns, but were probably caused by upstream breakages:

pytorch_windows_vs2019_py36_cuda10.1_test2 from Feb 17 until Feb 17

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 12 times.

Summary: The need for this is felt because sometimes we change a build script and change the `std=c++XX` flag, which does not get caught until the compilation has progressed for a while. #31757 Pull Request resolved: #32819 Differential Revision: D19697205 Pulled By: ezyang fbshipit-source-id: b045a1d15e24c4c6007b5d1464756051d32bf911

…32862) Summary: Fixes #32001 Pull Request resolved: #32862 Differential Revision: D19695935 Pulled By: ezyang fbshipit-source-id: bb37eb7a187214aa69259828024366f479a258d7

Summary: Pull Request resolved: #28870 Differential Revision: D19698758 Pulled By: ezyang fbshipit-source-id: 23167ec5bf9f7ab81012a124206bb4c2bdd6ca06

Summary: * New ops supported for exporting. * Updates on support for tensor indexing and dynamic list of tensors. * lara-hdr, spandantiwari Should we also include updates on torchvision support in this page? cc houseroad, neginraoof Please review if I have missed anything. Pull Request resolved: #32805 Reviewed By: hl475 Differential Revision: D19635699 Pulled By: houseroad fbshipit-source-id: b6be4fce641f852dcbceed20b4433f4037d8024a

Summary: This should be `BoolTensor` Pull Request resolved: #30385 Differential Revision: D19698414 Pulled By: ezyang fbshipit-source-id: 68f1e10eb9d4b99552bb158f6ad7e6ff0f7cc1c4

Summary: Very similar to #16267 but handling directories. Stoked to contribute! Pull Request resolved: #27836 Differential Revision: D19698398 Pulled By: ezyang fbshipit-source-id: eabc3a44d258124f860babb47ab91e22c2c3d6cc

Summary: I noticed the description of the initialization of convolutional modules is inconsistent with the actual implementation. There are two such cases: 1) `k` in the initialization of ConvTranspose modules is not dependent on the input channels but on the output channels (`kaiming_uniform_` uses the size of the second dimension of `weight` which is transposed in the first two dimensions). 2) Both the normal convolutions and the transposed ones use `k` divided by `groups`. Pull Request resolved: #30079 Differential Revision: D19698511 Pulled By: ezyang fbshipit-source-id: 1ba938fbbd97663eaf29fd1245872179d2761fff

Summary: Pull Request resolved: #32882 Update tensorboard binary and unit tests to python 3 Test Plan: ``` > buck test //caffe2/caffe2/contrib/tensorboard:tensorboard_test ``` ``` > buck test //caffe2/caffe2/contrib/tensorboard:tensorboard_exporter_test ``` Reviewed By: sanekmelnikov Differential Revision: D19670873 fbshipit-source-id: f5eb65ccbb4ecfdc801b9fa05a60d4c5c29dc428

Summary: **Running Clang-Tidy**, **Pre-commit Tidy/Linting Hook**, **Building PyTorch with ASAN** shouldn't belong to **Windows development tips**. Pull Request resolved: #28412 Differential Revision: D19700228 Pulled By: ezyang fbshipit-source-id: 39d999c68e4bd9264f4ae1fdab517871c883a663

Summary: Pull Request resolved: #28763 Differential Revision: D19698808 Pulled By: ezyang fbshipit-source-id: 7820acd7b0715ebf1d9ae954dca0058b6759075e

Summary: Adding symbolic for onnx einsum as part of opset 12 Pull Request resolved: #32716 Reviewed By: hl475 Differential Revision: D19626168 Pulled By: houseroad fbshipit-source-id: d8cc8af5f05f36aca3cd55dead602261ccdfec51

Summary: e.g. `tensor[torch.tensor([0, 1, 0], dtype=torch.bool)]` Previously the mask is of type uint8. Both uint8 and bool should be supported for export. Pull Request resolved: #32445 Reviewed By: hl475 Differential Revision: D19610713 Pulled By: houseroad fbshipit-source-id: 8df636e0c3cb0b82919a689242a962c79220209c

Summary: Pull Request resolved: #28935 Differential Revision: D19698781 Pulled By: ezyang fbshipit-source-id: abdd735c98656ed16cd326529441d1fcec2ace3e

Summary: Pull Request resolved: #32923 As per https://isocpp.org/wiki/faq/ctors#construct-on-first-use-v2 and https://isocpp.org/wiki/faq/ctors#static-init-order-on-first-use-members, we should be using leaky singletons to avoid static initialization order problem. Closes #27412 ghstack-source-id: 97601384 Test Plan: waitforbuildbot Differential Revision: D19688986 fbshipit-source-id: 8c1935fb7da8a7116dbca55eb43dc04bc02695ac

Summary: Fix for constant folding flaky tests Looks like the constant folding test modules are sometimes exported with ONNX_ATEN op export type, which is causing the CI failures. I'm unable to repro this issue locally, but my guess is that the op export param is being overwritten on CI build at some point. This PR sets the op export type and hopefully fixes the issue. Pull Request resolved: #32546 Reviewed By: hl475 Differential Revision: D19606919 Pulled By: houseroad fbshipit-source-id: 31793d6857bbbf99b43b4a7c22a045a56ae19e44

Summary: SpatialBNFakeLoweredFp16NNPI this is the fake operator for SpatialBN that gets lowered into add/mul/div, etc. Test Plan: test_spatialbn Reviewed By: tracelogfb, amylittleyang Differential Revision: D19658680 fbshipit-source-id: 2abddbcd9a2023ac75c494f20eaac2051b7139dc

… case (#32383) Summary: Step 2 of #31975 Vectorized memory access is enabled. Generated code: https://github.com/zasdfgbnm/things/blob/master/2020Q1/disassembly-elementwise-vec.ipynb ``` void at::native::modern::elementwise_kernel<4, 64, 4, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()https://github.com/pytorch/pytorch/issues/1}::operator()() const::{lambda()https://github.com/pytorch/pytorch/issues/4}::operator()() const::{lambda(float, float)#1}, at::detail::Array<char*, 3> >(int, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()https://github.com/pytorch/pytorch/issues/1}::operator()() const::{lambda()https://github.com/pytorch/pytorch/issues/4}::operator()() const::{lambda(float, float)#1}, at::detail::Array<char*, 3>) **ASM:** .section .text._ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,"ax",progbits .sectioninfo @"SHI_REGISTERS=20" .align 128 .global _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_ .type _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,function .size _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,(.L_40898 - _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_) .other _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,@"STO_CUDA_ENTRY STV_DEFAULT" _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_: .text._ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_: //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 294 /*0000*/ IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ; /*0010*/ @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ; /*0020*/ S2R R9, SR_CTAID.X ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 177 /*0030*/ S2R R0, SR_TID.X ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 294 /*0040*/ IMAD.SHL.U32 R9, R9, 0x100, RZ ; /*0050*/ IADD3 R5, -R9, c[0x0][0x160], RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 256 /*0060*/ SHF.R.S32.HI R17, RZ, 0x1f, R9 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 296 /*0070*/ ISETP.GE.AND P0, PT, R5, 0x100, PT ; /*0080*/ @!P0 BRA `(.L_3173) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 256 /*0090*/ IMAD.SHL.U32 R12, R9.reuse, 0x4, RZ ; /*00a0*/ SHF.L.U64.HI R17, R9, 0x2, R17 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 260 /*00b0*/ IADD3 R8, P0, R12.reuse, c[0x0][0x188], RZ ; /*00c0*/ IADD3 R2, P1, R12, c[0x0][0x190], RZ ; /*00d0*/ IADD3.X R9, R17.reuse, c[0x0][0x18c], RZ, P0, !PT ; /*00e0*/ IADD3.X R3, R17, c[0x0][0x194], RZ, P1, !PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 218 /*00f0*/ IMAD.WIDE R8, R0, 0x10, R8 ; /*0100*/ IMAD.WIDE R2, R0, 0x10, R2 ; /*0110*/ LDG.E.128.SYS R8, [R8] ; /*0120*/ LDG.E.128.SYS R4, [R2] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 256 /*0130*/ IADD3 R12, P0, R12, c[0x0][0x180], RZ ; /*0140*/ IADD3.X R13, R17, c[0x0][0x184], RZ, P0, !PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 238 /*0150*/ IMAD.WIDE R12, R0, 0x10, R12 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 196 /*0160*/ FFMA R7, R7, c[0x0][0x168], R11 ; /*0170*/ FFMA R6, R6, c[0x0][0x168], R10 ; /*0180*/ FFMA R5, R5, c[0x0][0x168], R9 ; /*0190*/ FFMA R4, R4, c[0x0][0x168], R8 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 238 /*01a0*/ STG.E.128.SYS [R12], R4 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 301 /*01b0*/ EXIT ; .L_3173: //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /*01c0*/ ISETP.GE.AND P0, PT, R0, R5, PT ; /*01d0*/ BMOV.32.CLEAR RZ, B0 ; /*01e0*/ BSSY B0, `(.L_3174) ; /*01f0*/ P0 BRA `(.L_3175) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /*0200*/ IADD3 R3, P1, R9, R0, RZ ; /*0210*/ LEA.HI.X.SX32 R4, R0, R17, 0x1, P1 ; /*0220*/ LEA R2, P1, R3, c[0x0][0x188], 0x2 ; /*0230*/ LEA.HI.X R3, R3, c[0x0][0x18c], R4, 0x2, P1 ; /*0240*/ LDG.E.SYS R8, [R2] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184 /*0250*/ IADD3 R4, R0, 0x40, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /*0260*/ ISETP.GE.AND P1, PT, R4, R5, PT ; /*0270*/ P1 BRA `(.L_3175) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /*0280*/ LDG.E.SYS R4, [R2+0x100] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184 /*0290*/ IADD3 R6, R0, 0x80, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /*02a0*/ ISETP.GE.AND P1, PT, R6, R5, PT ; /*02b0*/ P1 BRA `(.L_3175) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184 /*02c0*/ IADD3 R10, R0, 0xc0, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /*02d0*/ LDG.E.SYS R7, [R2+0x200] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /*02e0*/ ISETP.GE.AND P1, PT, R10, R5, PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /*02f0*/ @!P1 LDG.E.SYS R6, [R2+0x300] ; .L_3175: /*0300*/ BSYNC B0 ; .L_3174: //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /*0310*/ BMOV.32.CLEAR RZ, B0 ; /*0320*/ BSSY B0, `(.L_3176) ; /*0330*/ P0 BRA `(.L_3177) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /*0340*/ IADD3 R3, P1, R9, R0, RZ ; /*0350*/ LEA.HI.X.SX32 R10, R0, R17, 0x1, P1 ; /*0360*/ LEA R2, P1, R3, c[0x0][0x190], 0x2 ; /*0370*/ LEA.HI.X R3, R3, c[0x0][0x194], R10, 0x2, P1 ; /*0380*/ LDG.E.SYS R11, [R2] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184 /*0390*/ IADD3 R10, R0, 0x40, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /*03a0*/ ISETP.GE.AND P1, PT, R10, R5, PT ; /*03b0*/ P1 BRA `(.L_3177) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /*03c0*/ LDG.E.SYS R13, [R2+0x100] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184 /*03d0*/ IADD3 R10, R0, 0x80, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /*03e0*/ ISETP.GE.AND P1, PT, R10, R5, PT ; /*03f0*/ P1 BRA `(.L_3177) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184 /*0400*/ IADD3 R10, R0, 0xc0, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /*0410*/ ISETP.GE.AND P1, PT, R10, R5, PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /*0420*/ LDG.E.SYS R10, [R2+0x200] ; /*0430*/ @!P1 LDG.E.SYS R15, [R2+0x300] ; .L_3177: /*0440*/ BSYNC B0 ; .L_3176: //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /*0450*/ P0 EXIT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196 /*0460*/ IADD3 R9, P0, R9, R0, RZ ; /*0470*/ FFMA R11, R11, c[0x0][0x168], R8 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 197 /*0480*/ IADD3 R14, R0, 0x40, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196 /*0490*/ LEA.HI.X.SX32 R12, R0, R17, 0x1, P0 ; /*04a0*/ LEA R2, P0, R9.reuse, c[0x0][0x180], 0x2 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /*04b0*/ ISETP.GE.AND P1, PT, R14, R5, PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196 /*04c0*/ LEA.HI.X R3, R9, c[0x0][0x184], R12, 0x2, P0 ; /*04d0*/ STG.E.SYS [R2], R11 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /*04e0*/ P1 EXIT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 197 /*04f0*/ IADD3 R8, R0, 0x80, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 196 /*0500*/ FFMA R13, R13, c[0x0][0x168], R4 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /*0510*/ ISETP.GE.AND P0, PT, R8, R5, PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196 /*0520*/ STG.E.SYS [R2+0x100], R13 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /*0530*/ P0 EXIT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 197 /*0540*/ IADD3 R0, R0, 0xc0, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 196 /*0550*/ FFMA R7, R10, c[0x0][0x168], R7 ; /*0560*/ FFMA R15, R15, c[0x0][0x168], R6 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /*0570*/ ISETP.GE.AND P0, PT, R0, R5, PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196 /*0580*/ STG.E.SYS [R2+0x200], R7 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /*0590*/ P0 EXIT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196 /*05a0*/ STG.E.SYS [R2+0x300], R15 ; /*05b0*/ EXIT ; .L_3178: /*05c0*/ BRA `(.L_3178); /*05d0*/ NOP; /*05e0*/ NOP; /*05f0*/ NOP; .L_40898: ``` We can clearly see the `LDG.E.128` in it, which is a result of vectorization. Benchmark: https://github.com/zasdfgbnm/things/blob/master/2020Q1/benchmark-vec.ipynb Benchmark on P100, dtype `uint8`: before: ``` 1.4.0a0+a5b4d78 e1d9702 22.2 µs ± 89.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 34.7 µs ± 38.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 52 µs ± 312 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 86.9 µs ± 135 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 154 µs ± 204 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 291 µs ± 668 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 566 µs ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.18 ms ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2.29 ms ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 4.4 ms ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` after: ``` 1.4.0a0+a5b4d78 1281cdf 24 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 30.5 µs ± 355 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 43.1 µs ± 300 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 67.6 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 116 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 215 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 413 µs ± 791 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 824 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.63 ms ± 478 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 3.19 ms ± 1.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` Benchmark on P100, dtype `half`: Before: ``` 1.4.0a0+a5b4d78 1c017f0 30.8 µs ± 226 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 43.4 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 69.1 µs ± 83 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 119 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 224 µs ± 99.1 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 418 µs ± 206 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 865 µs ± 237 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.69 ms ± 695 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 3.3 ms ± 527 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) 6.77 ms ± 741 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` After ``` 1.4.0a0+a5b4d78 7e50ee2 28.9 µs ± 61.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 40.2 µs ± 244 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 63.8 µs ± 350 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 109 µs ± 196 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 199 µs ± 157 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 380 µs ± 446 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 743 µs ± 2.17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.47 ms ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2.91 ms ± 9.17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 5.8 ms ± 296 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` cc: csarofeen ptrblck Pull Request resolved: #32383 Differential Revision: D19697455 Pulled By: ngimel fbshipit-source-id: 0707481c2f334e6634c000b4afd275b2fee8fbe1

…32384) Summary: The `BatchNorm*` part of the issue (see gh-12013) seems to have been fixed in the master branch and these tests would make it concrete. However I would appreciate comments on #12013 (comment) on whether the current behaviour is satisfactory. Pull Request resolved: #32384 Differential Revision: D19704154 Pulled By: ngimel fbshipit-source-id: 1bbbbf1ae1215a460b22cf26e6b263e518ecf60b

Summary: - Show values in question like glog. - Handle expressions with logical operators properly by adding parentheses around expressions. - Allow outputting nullptr (some build failed without this) Pull Request resolved: #29539 Reviewed By: dreiss Differential Revision: D19698991 Pulled By: ljk53 fbshipit-source-id: e329c01622cfc386ac009904092519a4adfe94a8

#32907) Summary: Pull Request resolved: #32907 All op-specific information used in this logic was available to the parser itself, so the check can be done in that context, no codegen needed. No change in the warning behavior itself, mod minor formatting tweak - passes existing tests. Saves like ~275K binary size on mac: ``` -rwxr-xr-x 1 bhosmer 1876110778 16502064 Feb 1 00:43 torch/lib/libtorch_python.dylib -rwxr-xr-x 1 bhosmer 1876110778 16247888 Feb 1 00:44 torch/lib/libtorch_python.dylib ``` [codegen diff](bhosmer/scratch@deprecation_warning_before...deprecation_warning_after) More important than the size savings is the minimization of codegen. Ideally the generated artifact should express distinctive per-op properties in as minimal a form as practically possible - e.g. here instead of generating check-and-warn behavior into every binding, we generate only the data that triggers the behavior in the parser. (And actually we were generating it already.) Test Plan: Imported from OSS Differential Revision: D19679928 Pulled By: bhosmer fbshipit-source-id: cf0140573118430720c6b797c762fe5be98acd86

Summary: The default value is removed because it is explained right below. Pull Request resolved: #32945 Reviewed By: soumith Differential Revision: D19706567 Pulled By: ailzhang fbshipit-source-id: 1b7cc87991532f69b81aaae2451d944f70dda427

Summary: Should fix #32346 hopefully. Now when _flat_weights list is updated, `None` elements are appended to it if some weights are missing, subsequent `setattr` calls for the missing weights should repair _flat_weights and make it suitable to use in the backend. Pull Request resolved: #32939 Differential Revision: D19710990 Pulled By: ngimel fbshipit-source-id: c978c7519464e94beeffa9bc33b9172854a2f298

Summary: Pull Request resolved: #32935 Mock away the content of onnxified net with some low cost ops so that we can still mimic the input/output transfer while doing minimal work on the card. Test Plan: ``` buck run glow/fb/test:sparsenn_test -- --gtest_filter='SparseNNTest.vanillaC2' --onnxifi_debug_mode --onnxifi_loop_test_mode --nocaffe2_predictor_use_memonger ``` Differential Revision: D19631971 fbshipit-source-id: f970c55ccb410702f479255eeb750e01e3f8c2ae

…32952) Summary: Pull Request resolved: #32952 When the Async() version of clearAndWaitForOutstandingRpcs() was written, we didn't yet have the generic Future<T> class, and hadn't worked out our error model fully. This change fixes that method to properly propagate the first encountered error to the future, using a bool+CAS. ghstack-source-id: 97665749 Test Plan: existing test coverage, buck test mode/dev-nosan caffe2/test/... Differential Revision: D19710337 fbshipit-source-id: 66ce5593a94a16ea624930dbb9409917ef5cfd5d

…plete tensor types. Test Plan: revert-hammer Differential Revision: D19900566 Original commit changeset: c8eaad70c8ea fbshipit-source-id: 764f2139fdf19f22a397694d011078ec525f5e8a

…ec (#32962) Summary: Pull Request resolved: #32962 As per gchanan's comments on #30445, I've used `torch.set_default_dtype` in test_data_parallel instead of specifying dtype=torch.double everywhere. Also, renamed dtype2prec to dtype2prec_DONTUSE ghstack-source-id: 98388429 Test Plan: waitforbuildbot Differential Revision: D19714374 fbshipit-source-id: eb55bbca33881625636ba9ea6dd4cb692f25668e

Summary: Globally define ```C++ constexpr int num_threads = C10_WARP_SIZE * 2; constexpr int thread_work_size = 4; constexpr int block_work_size = thread_work_size * num_threads; ``` and kill all the template arguments passing these values. These are effectively global, but we are now passing them around by template arguments, causing many inconvenience in coding. Pull Request resolved: #33308 Differential Revision: D19907250 Pulled By: ngimel fbshipit-source-id: 4623b69baea7e6e77f460ffdfa07cf9f8cba588a

Summary: Fixes the `TensorIterator` parts of #32863 (THC is still broken) `TensorIterator::split` now keeps track of the `view_offsets` into the full tensor range. With this, I can take the base offset for the reduced dimension and translate partial results from the sub-iter into the index range of the full tensor. This happens only once for each intermediate result, so we should still benefit from the performance of 32-bit indexing in loops. Pull Request resolved: #33310 Differential Revision: D19906136 Pulled By: ngimel fbshipit-source-id: 3372ee4b8d5b115a53be79aeafc52e80ff9c490b

Summary: Pull Request resolved: #33387 CI is broken. Skip two functions to fix the problem. Test Plan: ci Reviewed By: hl475 Differential Revision: D19926249 fbshipit-source-id: a46d1465c59de8616d2af5fb0b9cc18532359f88

Summary: in dper2, local net is hard-coded by whitelisting some layers. Add SparseFeatureGating related layers to local net explicitly. Test Plan: * workflow: f167812211 * QRT: fall back looks normal {F228442018} Differential Revision: D19852280 fbshipit-source-id: 6fecc3d745c3f742d029575a7b9fe320618f1863

Summary: Pull Request resolved: #33325 Closes #32924. There was a bug where for TCPStore, we would not respect the timeout passed into `init_process_group` while constructing the TCPStore. Instead, we'd set the timeout after the rendezvous created the store, meaning that we used the default timeout of 300s while connecting to the server. This diff passes the timeout passed into `init_process_group` to rendezvous so that it can be passed into the constructor for TCPStore, so that we can use the right timeout at construction time. Question: Should we make this change for FileStore as well? Currently the FileStore constructor does not take in a timeout at all. ghstack-source-id: 98401875 Test Plan: Added a UT Differential Revision: D19871946 fbshipit-source-id: dd002180c4c883216645b8a97cc472c6116ac117

…tializing Test Plan: revert-hammer Differential Revision: D19871946 Original commit changeset: dd002180c4c8 fbshipit-source-id: 40b0676c51e43366c0700e81d16cc7927ee8efc2

Summary: GitHub commits: facebook/fb303@80dda47 facebookarchive/fbzmq@797af57 pytorch/FBGEMM@b2fceb9 Test Plan: n/a Reviewed By: zpao fbshipit-source-id: dde5fb9abca185422df11dc61c658dc333ad63ca

Summary: Pull Request resolved: #32974 Pull Request resolved: pytorch/FBGEMM#286 Re-attempt of D18805426 . Decided to be consistent with PyTorch Adagrad There was an inconsistency in the order of operation between scalar and SIMD code when we compute Adagrad. This diff make them consistent by doing w += lr * grad / (sqrt(moment) + epsilon) in Adagrad and w += lr / (sqrt(moment) + epsilon) * grad in RowWiseSparseAdagrad. The Adagrad order is consistent with PyTorch (see aten/src/ATen/native/cpu/PointwiseOpsKernel.cpp addcmul_cpu_kernel function). The RowWiseSparseAdagrad order is to make compute more efficient. In RowWiseSparseAdagrad, lr / (sqrt(moment) + epsilon) is shared among all elements in the row And, we're not going to use FMA to be consistent with PyTorch (even though it provides a little accuracy benefit) Test Plan: CI Reviewed By: wx1988 Differential Revision: D19342865 fbshipit-source-id: e950c16f2e1c4a2f2a3ef53b1705db373c67f341

Summary: GitHub commits: pytorch/FBGEMM@19c040c Test Plan: n/a Reviewed By: zpao fbshipit-source-id: ddc41000622a682874ab3a11fdf4a91038f9c15f

…o batchedeig

mfkasim1 · 2020-02-17T17:51:01Z

I think I messed up the commits. Sorry. I will try to fix it

Summary: Another pull request to follow up issue pytorch#32531. Here I implemented the backward operation for `torch.eig` with a condition that all the eigenvalues are real. This pull request is independent of my another pull request pytorch#32932, which means that there is no dependency between this PR and my another PR. Pull Request resolved: pytorch#33090 Differential Revision: D19814347 Pulled By: albanD fbshipit-source-id: 2fae30964e97987abb690544df8240aedeae56e8

pytorchbot added the open source label Feb 3, 2020

v0dro and others added 10 commits February 3, 2020 10:12

min, max: check that operand and outputs are on the same device type (#…

7cddc30

…32862) Summary: Fixes #32001 Pull Request resolved: #32862 Differential Revision: D19695935 Pulled By: ezyang fbshipit-source-id: bb37eb7a187214aa69259828024366f479a258d7

[JIT] Update OVERVIEW.md

ef50161

Summary: Pull Request resolved: #28870 Differential Revision: D19698758 Pulled By: ezyang fbshipit-source-id: 23167ec5bf9f7ab81012a124206bb4c2bdd6ca06

[docs] Fix argument type of torch.masked_select (#30385)

3fa907c

Summary: This should be `BoolTensor` Pull Request resolved: #30385 Differential Revision: D19698414 Pulled By: ezyang fbshipit-source-id: 68f1e10eb9d4b99552bb158f6ad7e6ff0f7cc1c4

raise when jit-load.ing a folder (#27836)

37953d9

Summary: Very similar to #16267 but handling directories. Stoked to contribute! Pull Request resolved: #27836 Differential Revision: D19698398 Pulled By: ezyang fbshipit-source-id: eabc3a44d258124f860babb47ab91e22c2c3d6cc

Add missing shuffle attribute to DistributedSampler typing file

167a892

Summary: Pull Request resolved: #28763 Differential Revision: D19698808 Pulled By: ezyang fbshipit-source-id: 7820acd7b0715ebf1d9ae954dca0058b6759075e

smessmer requested review from orionr and zou3519 and removed request for orionr February 3, 2020 20:47

smessmer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 3, 2020

neginraoof and others added 14 commits February 3, 2020 12:56

[ONNX] Add einsum export (#32716)

e03e4f3

Summary: Adding symbolic for onnx einsum as part of opset 12 Pull Request resolved: #32716 Reviewed By: hl475 Differential Revision: D19626168 Pulled By: houseroad fbshipit-source-id: d8cc8af5f05f36aca3cd55dead602261ccdfec51

Add missing default_collate in dataloader.pyi

6996f8d

Summary: Pull Request resolved: #28935 Differential Revision: D19698781 Pulled By: ezyang fbshipit-source-id: abdd735c98656ed16cd326529441d1fcec2ace3e

houseroad and others added 16 commits February 15, 2020 11:37

Revert D19900566: [pytorch][PR] Simplify prim::shape when we have com…

6dd6b0b

…plete tensor types. Test Plan: revert-hammer Differential Revision: D19900566 Original commit changeset: c8eaad70c8ea fbshipit-source-id: 764f2139fdf19f22a397694d011078ec525f5e8a

[BC] Temporarily fix the BC check (#33387)

f6808df

Summary: Pull Request resolved: #33387 CI is broken. Skip two functions to fix the problem. Test Plan: ci Reviewed By: hl475 Differential Revision: D19926249 fbshipit-source-id: a46d1465c59de8616d2af5fb0b9cc18532359f88

Revert D19871946: [distributed] pass in timeout to TCP store when ini…

d4e4bed

…tializing Test Plan: revert-hammer Differential Revision: D19871946 Original commit changeset: dd002180c4c8 fbshipit-source-id: 40b0676c51e43366c0700e81d16cc7927ee8efc2

Updating submodules

d299973

Summary: GitHub commits: facebook/fb303@80dda47 facebookarchive/fbzmq@797af57 pytorch/FBGEMM@b2fceb9 Test Plan: n/a Reviewed By: zpao fbshipit-source-id: dde5fb9abca185422df11dc61c658dc333ad63ca

Updating submodules

87dc2db

Summary: GitHub commits: pytorch/FBGEMM@19c040c Test Plan: n/a Reviewed By: zpao fbshipit-source-id: ddc41000622a682874ab3a11fdf4a91038f9c15f

Batched torch.eig() and use of magma for CUDA torch.eig() operation

4f86f5c

Bug fixes on missing semi-colon

dd9a320

Added test for batched torch.eig()

09b637c

Batch eig test for autograd

2235450

Merge branch 'batchedeig' of https://github.com/mfkasim91/pytorch int…

75bb4ac

…o batchedeig

mfkasim1 requested review from apaszke, ebetica, ezyang, fmassa, goldsborough, mrshenli, pietern, pritamdamania87, soumith, yf225 and zhaojuanmao as code owners February 17, 2020 17:49

mfkasim1 closed this Feb 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batched torch.eig() and use of magma for CUDA torch.eig() operation#32932

Batched torch.eig() and use of magma for CUDA torch.eig() operation#32932
mfkasim1 wants to merge 253 commits intopytorch:masterfrom
mfkasim1:batchedeig

mfkasim1 commented Feb 3, 2020

Uh oh!

kostmo commented Feb 3, 2020 •

edited by dr-ci bot

Loading

Uh oh!

mfkasim1 commented Feb 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

mfkasim1 commented Feb 3, 2020

Uh oh!

kostmo commented Feb 3, 2020 • edited by dr-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CircleCI build failures summary and remediations

Detailed failure analysis

🕵️ 1 new failure recognized by patterns

pytorch_xla_linux_xenial_py3_6_clang7_test (1/1)

🚧 1 upstream failure recognized by patterns:

Uh oh!

mfkasim1 commented Feb 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

kostmo commented Feb 3, 2020 •

edited by dr-ci bot

Loading