Skip to content

Batched torch.eig() and use of magma for CUDA torch.eig() operation#32932

Closed
mfkasim1 wants to merge 253 commits intopytorch:masterfrom
mfkasim1:batchedeig
Closed

Batched torch.eig() and use of magma for CUDA torch.eig() operation#32932
mfkasim1 wants to merge 253 commits intopytorch:masterfrom
mfkasim1:batchedeig

Conversation

@mfkasim1
Copy link
Contributor

@mfkasim1 mfkasim1 commented Feb 3, 2020

This pull request is the first one in response to my feature request #32531 about batched torch.eig().
It is actually harder than I previously thought and I ended up writing torch.eig() in ATen (replacing the legacy functions used beforehand).
I also create an eig function for CUDA using magma which in my run shows a slight improvement in terms of running time.

I haven't made the tests for batched eig, just to show that the change I made here does not break the previous tests for torch.eig().
Should I add a new function in test_torch.py for batched eig or should I just change the function test_eig?

@kostmo
Copy link
Member

kostmo commented Feb 3, 2020

💊 CircleCI build failures summary and remediations

As of commit 75bb4ac:

  • 1/3 broken upstream at merge base 87dc2db since Feb 17

    Please rebase on the viable/strict branch (expand for instructions)

    If your commit is newer than viable/strict, you can try basing on an older, stable commit:

    git fetch origin viable/strict
    git rebase --onto viable/strict $(git merge-base origin/master HEAD)
    

    If your commit is older than viable/strict:

    git fetch origin viable/strict
    git rebase viable/strict
    

    Check out the recency history of this "viable master" tracking branch.

  • 2/3 failures introduced in this PR

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

🕵️ 1 new failure recognized by patterns

The following build failures do not appear to be due to upstream breakage:

See CircleCI build pytorch_xla_linux_xenial_py3_6_clang7_test (1/1)

Step: "Test" (full log | pattern match details)

Feb 17 20:28:52 FAIL [0.186s]: test_eig_xla_float64 (__main__.TestTorchDeviceTypeXLA)
Feb 17 20:28:52     return fn(slf, device, *args, **kwargs) 
Feb 17 20:28:52   File "/var/lib/jenkins/workspace/xla/test/../../test/test_torch.py", line 13219, in test_batch_eig 
Feb 17 20:28:52     run_test(batch_dims + (5,), eigenvectors) 
Feb 17 20:28:52   File "/var/lib/jenkins/workspace/xla/test/../../test/test_torch.py", line 13186, in run_test 
Feb 17 20:28:52     self.assertEqual(torch.zeros_like(oute_imag, device=device), oute_imag, "Eigenvectors not all real") 
Feb 17 20:28:52   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 335, in assertEqual 
Feb 17 20:28:52     prec = max(self.precision, prec) 
Feb 17 20:28:52 TypeError: '>' not supported between instances of 'str' and 'float' 
Feb 17 20:28:52  
Feb 17 20:28:52 ====================================================================== 
Feb 17 20:28:52 FAIL [0.186s]: test_eig_xla_float64 (__main__.TestTorchDeviceTypeXLA) 
Feb 17 20:28:52 ---------------------------------------------------------------------- 
Feb 17 20:28:52 Traceback (most recent call last): 
Feb 17 20:28:52   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 198, in instantiated_test 
Feb 17 20:28:52     result = test(self, device_arg, dtype) 
Feb 17 20:28:52   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 431, in dep_fn 
Feb 17 20:28:52     return fn(slf, device, *args, **kwargs) 
Feb 17 20:28:52   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 431, in dep_fn 
Feb 17 20:28:52     return fn(slf, device, *args, **kwargs) 
Feb 17 20:28:52   File "/var/lib/jenkins/workspace/xla/test/../../test/test_torch.py", line 13148, in test_eig 
Feb 17 20:28:52     self.assertFalse(v.is_contiguous(), 'V is contiguous') 

🚧 1 upstream failure recognized by patterns:

These builds matched patterns, but were probably caused by upstream breakages:


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 12 times.

v0dro and others added 10 commits February 3, 2020 10:12
Summary:
The need for this is felt because sometimes we change a build script and change the `std=c++XX` flag, which does not get caught until the compilation has progressed for a while.

#31757
Pull Request resolved: #32819

Differential Revision: D19697205

Pulled By: ezyang

fbshipit-source-id: b045a1d15e24c4c6007b5d1464756051d32bf911
…32862)

Summary:
Fixes #32001
Pull Request resolved: #32862

Differential Revision: D19695935

Pulled By: ezyang

fbshipit-source-id: bb37eb7a187214aa69259828024366f479a258d7
Summary: Pull Request resolved: #28870

Differential Revision: D19698758

Pulled By: ezyang

fbshipit-source-id: 23167ec5bf9f7ab81012a124206bb4c2bdd6ca06
Summary:
* New ops supported for exporting.
* Updates on support for tensor indexing and dynamic list of tensors.
* lara-hdr, spandantiwari Should we also include updates on torchvision support in this page?

cc houseroad, neginraoof Please review if I have missed anything.
Pull Request resolved: #32805

Reviewed By: hl475

Differential Revision: D19635699

Pulled By: houseroad

fbshipit-source-id: b6be4fce641f852dcbceed20b4433f4037d8024a
Summary:
This should be `BoolTensor`
Pull Request resolved: #30385

Differential Revision: D19698414

Pulled By: ezyang

fbshipit-source-id: 68f1e10eb9d4b99552bb158f6ad7e6ff0f7cc1c4
Summary:
Very similar to #16267 but handling directories.

Stoked to contribute!
Pull Request resolved: #27836

Differential Revision: D19698398

Pulled By: ezyang

fbshipit-source-id: eabc3a44d258124f860babb47ab91e22c2c3d6cc
Summary:
I noticed the description of the initialization of convolutional modules is inconsistent with the actual implementation. There are two such cases:

1) `k` in the initialization of ConvTranspose modules is not dependent on the input channels but on the output channels (`kaiming_uniform_` uses the size of the second dimension of `weight` which is transposed in the first two dimensions).

2) Both the normal convolutions and the transposed ones use `k` divided by `groups`.
Pull Request resolved: #30079

Differential Revision: D19698511

Pulled By: ezyang

fbshipit-source-id: 1ba938fbbd97663eaf29fd1245872179d2761fff
Summary:
Pull Request resolved: #32882

Update tensorboard binary and unit tests to python 3

Test Plan:
```
> buck test //caffe2/caffe2/contrib/tensorboard:tensorboard_test
```
```
> buck test //caffe2/caffe2/contrib/tensorboard:tensorboard_exporter_test
```

Reviewed By: sanekmelnikov

Differential Revision: D19670873

fbshipit-source-id: f5eb65ccbb4ecfdc801b9fa05a60d4c5c29dc428
Summary:
**Running Clang-Tidy**, **Pre-commit Tidy/Linting Hook**, **Building PyTorch with ASAN** shouldn't belong to **Windows development tips**.
Pull Request resolved: #28412

Differential Revision: D19700228

Pulled By: ezyang

fbshipit-source-id: 39d999c68e4bd9264f4ae1fdab517871c883a663
Summary: Pull Request resolved: #28763

Differential Revision: D19698808

Pulled By: ezyang

fbshipit-source-id: 7820acd7b0715ebf1d9ae954dca0058b6759075e
@smessmer smessmer requested review from orionr and zou3519 and removed request for orionr February 3, 2020 20:47
@smessmer smessmer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 3, 2020
neginraoof and others added 14 commits February 3, 2020 12:56
Summary:
Adding symbolic for onnx einsum as part of opset 12
Pull Request resolved: #32716

Reviewed By: hl475

Differential Revision: D19626168

Pulled By: houseroad

fbshipit-source-id: d8cc8af5f05f36aca3cd55dead602261ccdfec51
Summary:
e.g. `tensor[torch.tensor([0, 1, 0], dtype=torch.bool)]`
Previously the mask is of type uint8. Both uint8 and bool should be supported for export.
Pull Request resolved: #32445

Reviewed By: hl475

Differential Revision: D19610713

Pulled By: houseroad

fbshipit-source-id: 8df636e0c3cb0b82919a689242a962c79220209c
Summary: Pull Request resolved: #28935

Differential Revision: D19698781

Pulled By: ezyang

fbshipit-source-id: abdd735c98656ed16cd326529441d1fcec2ace3e
Summary:
Pull Request resolved: #32923

As per
https://isocpp.org/wiki/faq/ctors#construct-on-first-use-v2 and
https://isocpp.org/wiki/faq/ctors#static-init-order-on-first-use-members, we
should be using leaky singletons to avoid static initialization order problem.

Closes #27412
ghstack-source-id: 97601384

Test Plan: waitforbuildbot

Differential Revision: D19688986

fbshipit-source-id: 8c1935fb7da8a7116dbca55eb43dc04bc02695ac
Summary:
Fix for constant folding flaky tests
Looks like the constant folding test modules are sometimes exported with ONNX_ATEN op export type, which is causing the CI failures.
I'm unable to repro this issue locally, but my guess is that the op export param is being overwritten on CI build at some point.
This PR sets the op export type and hopefully fixes the issue.
Pull Request resolved: #32546

Reviewed By: hl475

Differential Revision: D19606919

Pulled By: houseroad

fbshipit-source-id: 31793d6857bbbf99b43b4a7c22a045a56ae19e44
Summary:
SpatialBNFakeLoweredFp16NNPI

this is the fake operator for SpatialBN that gets lowered into add/mul/div, etc.

Test Plan: test_spatialbn

Reviewed By: tracelogfb, amylittleyang

Differential Revision: D19658680

fbshipit-source-id: 2abddbcd9a2023ac75c494f20eaac2051b7139dc
… case (#32383)

Summary:
Step 2 of #31975

Vectorized memory access is enabled. Generated code: https://github.com/zasdfgbnm/things/blob/master/2020Q1/disassembly-elementwise-vec.ipynb

```
void at::native::modern::elementwise_kernel<4, 64, 4, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()https://github.com/pytorch/pytorch/issues/1}::operator()() const::{lambda()https://github.com/pytorch/pytorch/issues/4}::operator()() const::{lambda(float, float)#1}, at::detail::Array<char*, 3> >(int, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()https://github.com/pytorch/pytorch/issues/1}::operator()() const::{lambda()https://github.com/pytorch/pytorch/issues/4}::operator()() const::{lambda(float, float)#1}, at::detail::Array<char*, 3>)

**ASM:**

	.section	.text._ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,"ax",progbits
	.sectioninfo	@"SHI_REGISTERS=20"
	.align	128
        .global         _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_
        .type           _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,function
        .size           _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,(.L_40898 - _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_)
        .other          _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,@"STO_CUDA_ENTRY STV_DEFAULT"
_ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_:
.text._ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_:
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 294
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;
        /*0010*/              @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;
        /*0020*/                   S2R R9, SR_CTAID.X ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 177
        /*0030*/                   S2R R0, SR_TID.X ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 294
        /*0040*/                   IMAD.SHL.U32 R9, R9, 0x100, RZ ;
        /*0050*/                   IADD3 R5, -R9, c[0x0][0x160], RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 256
        /*0060*/                   SHF.R.S32.HI R17, RZ, 0x1f, R9 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 296
        /*0070*/                   ISETP.GE.AND P0, PT, R5, 0x100, PT ;
        /*0080*/              @!P0 BRA `(.L_3173) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 256
        /*0090*/                   IMAD.SHL.U32 R12, R9.reuse, 0x4, RZ ;
        /*00a0*/                   SHF.L.U64.HI R17, R9, 0x2, R17 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 260
        /*00b0*/                   IADD3 R8, P0, R12.reuse, c[0x0][0x188], RZ ;
        /*00c0*/                   IADD3 R2, P1, R12, c[0x0][0x190], RZ ;
        /*00d0*/                   IADD3.X R9, R17.reuse, c[0x0][0x18c], RZ, P0, !PT ;
        /*00e0*/                   IADD3.X R3, R17, c[0x0][0x194], RZ, P1, !PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 218
        /*00f0*/                   IMAD.WIDE R8, R0, 0x10, R8 ;
        /*0100*/                   IMAD.WIDE R2, R0, 0x10, R2 ;
        /*0110*/                   LDG.E.128.SYS R8, [R8] ;
        /*0120*/                   LDG.E.128.SYS R4, [R2] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 256
        /*0130*/                   IADD3 R12, P0, R12, c[0x0][0x180], RZ ;
        /*0140*/                   IADD3.X R13, R17, c[0x0][0x184], RZ, P0, !PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 238
        /*0150*/                   IMAD.WIDE R12, R0, 0x10, R12 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 196
        /*0160*/                   FFMA R7, R7, c[0x0][0x168], R11 ;
        /*0170*/                   FFMA R6, R6, c[0x0][0x168], R10 ;
        /*0180*/                   FFMA R5, R5, c[0x0][0x168], R9 ;
        /*0190*/                   FFMA R4, R4, c[0x0][0x168], R8 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 238
        /*01a0*/                   STG.E.128.SYS [R12], R4 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 301
        /*01b0*/                   EXIT ;
.L_3173:
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180
        /*01c0*/                   ISETP.GE.AND P0, PT, R0, R5, PT ;
        /*01d0*/                   BMOV.32.CLEAR RZ, B0 ;
        /*01e0*/                   BSSY B0, `(.L_3174) ;
        /*01f0*/               P0 BRA `(.L_3175) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183
        /*0200*/                   IADD3 R3, P1, R9, R0, RZ ;
        /*0210*/                   LEA.HI.X.SX32 R4, R0, R17, 0x1, P1 ;
        /*0220*/                   LEA R2, P1, R3, c[0x0][0x188], 0x2 ;
        /*0230*/                   LEA.HI.X R3, R3, c[0x0][0x18c], R4, 0x2, P1 ;
        /*0240*/                   LDG.E.SYS R8, [R2] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184
        /*0250*/                   IADD3 R4, R0, 0x40, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180
        /*0260*/                   ISETP.GE.AND P1, PT, R4, R5, PT ;
        /*0270*/               P1 BRA `(.L_3175) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183
        /*0280*/                   LDG.E.SYS R4, [R2+0x100] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184
        /*0290*/                   IADD3 R6, R0, 0x80, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180
        /*02a0*/                   ISETP.GE.AND P1, PT, R6, R5, PT ;
        /*02b0*/               P1 BRA `(.L_3175) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184
        /*02c0*/                   IADD3 R10, R0, 0xc0, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183
        /*02d0*/                   LDG.E.SYS R7, [R2+0x200] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180
        /*02e0*/                   ISETP.GE.AND P1, PT, R10, R5, PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183
        /*02f0*/              @!P1 LDG.E.SYS R6, [R2+0x300] ;
.L_3175:
        /*0300*/                   BSYNC B0 ;
.L_3174:
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180
        /*0310*/                   BMOV.32.CLEAR RZ, B0 ;
        /*0320*/                   BSSY B0, `(.L_3176) ;
        /*0330*/               P0 BRA `(.L_3177) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183
        /*0340*/                   IADD3 R3, P1, R9, R0, RZ ;
        /*0350*/                   LEA.HI.X.SX32 R10, R0, R17, 0x1, P1 ;
        /*0360*/                   LEA R2, P1, R3, c[0x0][0x190], 0x2 ;
        /*0370*/                   LEA.HI.X R3, R3, c[0x0][0x194], R10, 0x2, P1 ;
        /*0380*/                   LDG.E.SYS R11, [R2] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184
        /*0390*/                   IADD3 R10, R0, 0x40, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180
        /*03a0*/                   ISETP.GE.AND P1, PT, R10, R5, PT ;
        /*03b0*/               P1 BRA `(.L_3177) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183
        /*03c0*/                   LDG.E.SYS R13, [R2+0x100] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184
        /*03d0*/                   IADD3 R10, R0, 0x80, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180
        /*03e0*/                   ISETP.GE.AND P1, PT, R10, R5, PT ;
        /*03f0*/               P1 BRA `(.L_3177) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184
        /*0400*/                   IADD3 R10, R0, 0xc0, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180
        /*0410*/                   ISETP.GE.AND P1, PT, R10, R5, PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183
        /*0420*/                   LDG.E.SYS R10, [R2+0x200] ;
        /*0430*/              @!P1 LDG.E.SYS R15, [R2+0x300] ;
.L_3177:
        /*0440*/                   BSYNC B0 ;
.L_3176:
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193
        /*0450*/               P0 EXIT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196
        /*0460*/                   IADD3 R9, P0, R9, R0, RZ ;
        /*0470*/                   FFMA R11, R11, c[0x0][0x168], R8 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 197
        /*0480*/                   IADD3 R14, R0, 0x40, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196
        /*0490*/                   LEA.HI.X.SX32 R12, R0, R17, 0x1, P0 ;
        /*04a0*/                   LEA R2, P0, R9.reuse, c[0x0][0x180], 0x2 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193
        /*04b0*/                   ISETP.GE.AND P1, PT, R14, R5, PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196
        /*04c0*/                   LEA.HI.X R3, R9, c[0x0][0x184], R12, 0x2, P0 ;
        /*04d0*/                   STG.E.SYS [R2], R11 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193
        /*04e0*/               P1 EXIT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 197
        /*04f0*/                   IADD3 R8, R0, 0x80, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 196
        /*0500*/                   FFMA R13, R13, c[0x0][0x168], R4 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193
        /*0510*/                   ISETP.GE.AND P0, PT, R8, R5, PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196
        /*0520*/                   STG.E.SYS [R2+0x100], R13 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193
        /*0530*/               P0 EXIT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 197
        /*0540*/                   IADD3 R0, R0, 0xc0, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 196
        /*0550*/                   FFMA R7, R10, c[0x0][0x168], R7 ;
        /*0560*/                   FFMA R15, R15, c[0x0][0x168], R6 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193
        /*0570*/                   ISETP.GE.AND P0, PT, R0, R5, PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196
        /*0580*/                   STG.E.SYS [R2+0x200], R7 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193
        /*0590*/               P0 EXIT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196
        /*05a0*/                   STG.E.SYS [R2+0x300], R15 ;
        /*05b0*/                   EXIT ;
.L_3178:
        /*05c0*/                   BRA `(.L_3178);
        /*05d0*/                   NOP;
        /*05e0*/                   NOP;
        /*05f0*/                   NOP;
.L_40898:
```

We can clearly see the `LDG.E.128` in it, which is a result of vectorization.

Benchmark: https://github.com/zasdfgbnm/things/blob/master/2020Q1/benchmark-vec.ipynb

Benchmark on P100, dtype `uint8`:

before:
```
1.4.0a0+a5b4d78
e1d9702
22.2 µs ± 89.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
34.7 µs ± 38.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
52 µs ± 312 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
86.9 µs ± 135 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
154 µs ± 204 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
291 µs ± 668 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
566 µs ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.18 ms ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.29 ms ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.4 ms ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

after:
```
1.4.0a0+a5b4d78
1281cdf
24 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
30.5 µs ± 355 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
43.1 µs ± 300 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
67.6 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
116 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
215 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
413 µs ± 791 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
824 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.63 ms ± 478 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.19 ms ± 1.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

Benchmark on P100, dtype `half`:

Before:
```
1.4.0a0+a5b4d78
1c017f0
30.8 µs ± 226 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
43.4 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
69.1 µs ± 83 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
119 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
224 µs ± 99.1 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
418 µs ± 206 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
865 µs ± 237 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.69 ms ± 695 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.3 ms ± 527 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.77 ms ± 741 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

After

```
1.4.0a0+a5b4d78
7e50ee2
28.9 µs ± 61.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
40.2 µs ± 244 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
63.8 µs ± 350 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
109 µs ± 196 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
199 µs ± 157 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
380 µs ± 446 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
743 µs ± 2.17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.47 ms ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.91 ms ± 9.17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.8 ms ± 296 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

cc: csarofeen ptrblck
Pull Request resolved: #32383

Differential Revision: D19697455

Pulled By: ngimel

fbshipit-source-id: 0707481c2f334e6634c000b4afd275b2fee8fbe1
…32384)

Summary:
The `BatchNorm*` part of the issue (see gh-12013) seems to have been fixed in the master branch and these tests would make it concrete.

However I would appreciate comments on #12013 (comment) on whether the current behaviour is satisfactory.
Pull Request resolved: #32384

Differential Revision: D19704154

Pulled By: ngimel

fbshipit-source-id: 1bbbbf1ae1215a460b22cf26e6b263e518ecf60b
Summary:
- Show values in question like glog.
- Handle expressions with logical operators properly by adding
  parentheses around expressions.
- Allow outputting nullptr (some build failed without this)
Pull Request resolved: #29539

Reviewed By: dreiss

Differential Revision: D19698991

Pulled By: ljk53

fbshipit-source-id: e329c01622cfc386ac009904092519a4adfe94a8
#32907)

Summary:
Pull Request resolved: #32907

All op-specific information used in this logic was available to the
parser itself, so the check can be done in that context, no codegen
needed.

No change in the warning behavior itself, mod minor formatting tweak -
passes existing tests. Saves like ~275K binary size on mac:
```
-rwxr-xr-x  1 bhosmer  1876110778   16502064 Feb  1 00:43 torch/lib/libtorch_python.dylib
-rwxr-xr-x  1 bhosmer  1876110778   16247888 Feb  1 00:44 torch/lib/libtorch_python.dylib
```

[codegen diff](bhosmer/scratch@deprecation_warning_before...deprecation_warning_after)

More important than the size savings is the minimization of codegen. Ideally the generated artifact should express distinctive per-op properties in as minimal a form as practically possible - e.g. here instead of generating check-and-warn behavior into every binding, we generate only the data that triggers the behavior in the parser. (And actually we were generating it already.)

Test Plan: Imported from OSS

Differential Revision: D19679928

Pulled By: bhosmer

fbshipit-source-id: cf0140573118430720c6b797c762fe5be98acd86
Summary:
The default value is removed because it is explained right below.
Pull Request resolved: #32945

Reviewed By: soumith

Differential Revision: D19706567

Pulled By: ailzhang

fbshipit-source-id: 1b7cc87991532f69b81aaae2451d944f70dda427
Summary:
Should fix #32346 hopefully. Now when _flat_weights list is updated, `None` elements are appended to it if some weights are missing, subsequent `setattr` calls for the missing weights should repair _flat_weights and make it suitable to use in the backend.
Pull Request resolved: #32939

Differential Revision: D19710990

Pulled By: ngimel

fbshipit-source-id: c978c7519464e94beeffa9bc33b9172854a2f298
Summary:
Pull Request resolved: #32935

Mock away the content of onnxified net with some low cost ops so that we can still mimic the input/output transfer while doing minimal work on the card.

Test Plan:
```
buck run glow/fb/test:sparsenn_test -- --gtest_filter='SparseNNTest.vanillaC2' --onnxifi_debug_mode --onnxifi_loop_test_mode --nocaffe2_predictor_use_memonger
```

Differential Revision: D19631971

fbshipit-source-id: f970c55ccb410702f479255eeb750e01e3f8c2ae
…32952)

Summary:
Pull Request resolved: #32952

When the Async() version of clearAndWaitForOutstandingRpcs() was written,
we didn't yet have the generic Future<T> class, and hadn't worked out our
error model fully.

This change fixes that method to properly propagate the first encountered error
to the future, using a bool+CAS.
ghstack-source-id: 97665749

Test Plan: existing test coverage, buck test mode/dev-nosan caffe2/test/...

Differential Revision: D19710337

fbshipit-source-id: 66ce5593a94a16ea624930dbb9409917ef5cfd5d
houseroad and others added 16 commits February 15, 2020 11:37
…plete tensor types.

Test Plan: revert-hammer

Differential Revision:
D19900566

Original commit changeset: c8eaad70c8ea

fbshipit-source-id: 764f2139fdf19f22a397694d011078ec525f5e8a
…ec (#32962)

Summary:
Pull Request resolved: #32962

As per gchanan's comments on
#30445, I've used
`torch.set_default_dtype` in test_data_parallel instead of specifying
dtype=torch.double everywhere. Also, renamed dtype2prec to dtype2prec_DONTUSE
ghstack-source-id: 98388429

Test Plan: waitforbuildbot

Differential Revision: D19714374

fbshipit-source-id: eb55bbca33881625636ba9ea6dd4cb692f25668e
Summary:
Globally define
```C++
constexpr int num_threads = C10_WARP_SIZE * 2;
constexpr int thread_work_size = 4;
constexpr int block_work_size = thread_work_size * num_threads;
```
and kill all the template arguments passing these values.

These are effectively global, but we are now passing them around by template arguments, causing many inconvenience in coding.
Pull Request resolved: #33308

Differential Revision: D19907250

Pulled By: ngimel

fbshipit-source-id: 4623b69baea7e6e77f460ffdfa07cf9f8cba588a
Summary:
Fixes the `TensorIterator` parts of #32863 (THC is still broken)

`TensorIterator::split` now keeps track of the `view_offsets` into the full tensor range. With this, I can take the base offset for the reduced dimension and translate partial results from the sub-iter into the index range of the full tensor. This happens only once for each intermediate result, so we should still benefit from the performance of 32-bit indexing in loops.
Pull Request resolved: #33310

Differential Revision: D19906136

Pulled By: ngimel

fbshipit-source-id: 3372ee4b8d5b115a53be79aeafc52e80ff9c490b
Summary:
Pull Request resolved: #33387

CI is broken. Skip two functions to fix the problem.

Test Plan: ci

Reviewed By: hl475

Differential Revision: D19926249

fbshipit-source-id: a46d1465c59de8616d2af5fb0b9cc18532359f88
Summary: in dper2, local net is hard-coded by whitelisting some layers. Add SparseFeatureGating related layers to local net explicitly.

Test Plan:
* workflow: f167812211
* QRT: fall back looks normal

{F228442018}

Differential Revision: D19852280

fbshipit-source-id: 6fecc3d745c3f742d029575a7b9fe320618f1863
Summary:
Pull Request resolved: #33325

Closes #32924. There was a bug where for TCPStore, we would not respect the timeout passed into `init_process_group` while constructing the TCPStore. Instead, we'd set the timeout after the rendezvous created the store, meaning that we used the default timeout of 300s while connecting to the server. This diff passes the timeout passed into `init_process_group` to rendezvous so that it can be passed into the constructor for TCPStore, so that we can use the right timeout at construction time.

Question: Should we make this change for FileStore as well? Currently the FileStore constructor does not take in a timeout at all.
ghstack-source-id: 98401875

Test Plan: Added a UT

Differential Revision: D19871946

fbshipit-source-id: dd002180c4c883216645b8a97cc472c6116ac117
…tializing

Test Plan: revert-hammer

Differential Revision:
D19871946

Original commit changeset: dd002180c4c8

fbshipit-source-id: 40b0676c51e43366c0700e81d16cc7927ee8efc2
Summary:
GitHub commits:

facebook/fb303@80dda47
facebookarchive/fbzmq@797af57
pytorch/FBGEMM@b2fceb9

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: dde5fb9abca185422df11dc61c658dc333ad63ca
Summary:
Pull Request resolved: #32974

Pull Request resolved: pytorch/FBGEMM#286

Re-attempt of D18805426 . Decided to be consistent with PyTorch Adagrad

There was an inconsistency in the order of operation between scalar and SIMD code when we compute Adagrad. This diff make them consistent by doing w += lr * grad / (sqrt(moment) + epsilon) in Adagrad and w += lr / (sqrt(moment) + epsilon) * grad in RowWiseSparseAdagrad.

The Adagrad order is consistent with PyTorch (see aten/src/ATen/native/cpu/PointwiseOpsKernel.cpp addcmul_cpu_kernel function). The RowWiseSparseAdagrad order is to make compute more efficient. In RowWiseSparseAdagrad, lr / (sqrt(moment) + epsilon) is shared among all elements in the row

And, we're not going to use FMA to be consistent with PyTorch (even though it provides a little accuracy benefit)

Test Plan: CI

Reviewed By: wx1988

Differential Revision: D19342865

fbshipit-source-id: e950c16f2e1c4a2f2a3ef53b1705db373c67f341
Summary:
GitHub commits:

pytorch/FBGEMM@19c040c

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: ddc41000622a682874ab3a11fdf4a91038f9c15f
@mfkasim1
Copy link
Contributor Author

I think I messed up the commits. Sorry. I will try to fix it

@mfkasim1 mfkasim1 closed this Feb 17, 2020
ttumiel pushed a commit to ttumiel/pytorch that referenced this pull request Mar 4, 2020
Summary:
Another pull request to follow up issue pytorch#32531.
Here I implemented the backward operation for `torch.eig` with a condition that all the eigenvalues are real.

This pull request is independent of my another pull request pytorch#32932, which means that there is no dependency between this PR and my another PR.
Pull Request resolved: pytorch#33090

Differential Revision: D19814347

Pulled By: albanD

fbshipit-source-id: 2fae30964e97987abb690544df8240aedeae56e8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.