Batched torch.eig() and use of magma for CUDA torch.eig() operation#32932
Batched torch.eig() and use of magma for CUDA torch.eig() operation#32932mfkasim1 wants to merge 253 commits intopytorch:masterfrom mfkasim1:batchedeig
Conversation
💊 CircleCI build failures summary and remediationsAs of commit 75bb4ac:
Detailed failure analysisOne may explore the probable reasons each build failed interactively on the Dr. CI website. 🕵️ 1 new failure recognized by patternsThe following build failures do not appear to be due to upstream breakage:
|
Summary: The need for this is felt because sometimes we change a build script and change the `std=c++XX` flag, which does not get caught until the compilation has progressed for a while. #31757 Pull Request resolved: #32819 Differential Revision: D19697205 Pulled By: ezyang fbshipit-source-id: b045a1d15e24c4c6007b5d1464756051d32bf911
Summary: Pull Request resolved: #28870 Differential Revision: D19698758 Pulled By: ezyang fbshipit-source-id: 23167ec5bf9f7ab81012a124206bb4c2bdd6ca06
Summary: * New ops supported for exporting. * Updates on support for tensor indexing and dynamic list of tensors. * lara-hdr, spandantiwari Should we also include updates on torchvision support in this page? cc houseroad, neginraoof Please review if I have missed anything. Pull Request resolved: #32805 Reviewed By: hl475 Differential Revision: D19635699 Pulled By: houseroad fbshipit-source-id: b6be4fce641f852dcbceed20b4433f4037d8024a
Summary: This should be `BoolTensor` Pull Request resolved: #30385 Differential Revision: D19698414 Pulled By: ezyang fbshipit-source-id: 68f1e10eb9d4b99552bb158f6ad7e6ff0f7cc1c4
Summary: I noticed the description of the initialization of convolutional modules is inconsistent with the actual implementation. There are two such cases: 1) `k` in the initialization of ConvTranspose modules is not dependent on the input channels but on the output channels (`kaiming_uniform_` uses the size of the second dimension of `weight` which is transposed in the first two dimensions). 2) Both the normal convolutions and the transposed ones use `k` divided by `groups`. Pull Request resolved: #30079 Differential Revision: D19698511 Pulled By: ezyang fbshipit-source-id: 1ba938fbbd97663eaf29fd1245872179d2761fff
Summary: Pull Request resolved: #32882 Update tensorboard binary and unit tests to python 3 Test Plan: ``` > buck test //caffe2/caffe2/contrib/tensorboard:tensorboard_test ``` ``` > buck test //caffe2/caffe2/contrib/tensorboard:tensorboard_exporter_test ``` Reviewed By: sanekmelnikov Differential Revision: D19670873 fbshipit-source-id: f5eb65ccbb4ecfdc801b9fa05a60d4c5c29dc428
Summary: **Running Clang-Tidy**, **Pre-commit Tidy/Linting Hook**, **Building PyTorch with ASAN** shouldn't belong to **Windows development tips**. Pull Request resolved: #28412 Differential Revision: D19700228 Pulled By: ezyang fbshipit-source-id: 39d999c68e4bd9264f4ae1fdab517871c883a663
Summary: Pull Request resolved: #28763 Differential Revision: D19698808 Pulled By: ezyang fbshipit-source-id: 7820acd7b0715ebf1d9ae954dca0058b6759075e
Summary: Adding symbolic for onnx einsum as part of opset 12 Pull Request resolved: #32716 Reviewed By: hl475 Differential Revision: D19626168 Pulled By: houseroad fbshipit-source-id: d8cc8af5f05f36aca3cd55dead602261ccdfec51
Summary: e.g. `tensor[torch.tensor([0, 1, 0], dtype=torch.bool)]` Previously the mask is of type uint8. Both uint8 and bool should be supported for export. Pull Request resolved: #32445 Reviewed By: hl475 Differential Revision: D19610713 Pulled By: houseroad fbshipit-source-id: 8df636e0c3cb0b82919a689242a962c79220209c
Summary: Pull Request resolved: #28935 Differential Revision: D19698781 Pulled By: ezyang fbshipit-source-id: abdd735c98656ed16cd326529441d1fcec2ace3e
Summary: Pull Request resolved: #32923 As per https://isocpp.org/wiki/faq/ctors#construct-on-first-use-v2 and https://isocpp.org/wiki/faq/ctors#static-init-order-on-first-use-members, we should be using leaky singletons to avoid static initialization order problem. Closes #27412 ghstack-source-id: 97601384 Test Plan: waitforbuildbot Differential Revision: D19688986 fbshipit-source-id: 8c1935fb7da8a7116dbca55eb43dc04bc02695ac
Summary: Fix for constant folding flaky tests Looks like the constant folding test modules are sometimes exported with ONNX_ATEN op export type, which is causing the CI failures. I'm unable to repro this issue locally, but my guess is that the op export param is being overwritten on CI build at some point. This PR sets the op export type and hopefully fixes the issue. Pull Request resolved: #32546 Reviewed By: hl475 Differential Revision: D19606919 Pulled By: houseroad fbshipit-source-id: 31793d6857bbbf99b43b4a7c22a045a56ae19e44
Summary: SpatialBNFakeLoweredFp16NNPI this is the fake operator for SpatialBN that gets lowered into add/mul/div, etc. Test Plan: test_spatialbn Reviewed By: tracelogfb, amylittleyang Differential Revision: D19658680 fbshipit-source-id: 2abddbcd9a2023ac75c494f20eaac2051b7139dc
… case (#32383) Summary: Step 2 of #31975 Vectorized memory access is enabled. Generated code: https://github.com/zasdfgbnm/things/blob/master/2020Q1/disassembly-elementwise-vec.ipynb ``` void at::native::modern::elementwise_kernel<4, 64, 4, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()https://github.com/pytorch/pytorch/issues/1}::operator()() const::{lambda()https://github.com/pytorch/pytorch/issues/4}::operator()() const::{lambda(float, float)#1}, at::detail::Array<char*, 3> >(int, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()https://github.com/pytorch/pytorch/issues/1}::operator()() const::{lambda()https://github.com/pytorch/pytorch/issues/4}::operator()() const::{lambda(float, float)#1}, at::detail::Array<char*, 3>) **ASM:** .section .text._ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,"ax",progbits .sectioninfo @"SHI_REGISTERS=20" .align 128 .global _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_ .type _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,function .size _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,(.L_40898 - _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_) .other _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,@"STO_CUDA_ENTRY STV_DEFAULT" _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_: .text._ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_: //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 294 /*0000*/ IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ; /*0010*/ @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ; /*0020*/ S2R R9, SR_CTAID.X ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 177 /*0030*/ S2R R0, SR_TID.X ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 294 /*0040*/ IMAD.SHL.U32 R9, R9, 0x100, RZ ; /*0050*/ IADD3 R5, -R9, c[0x0][0x160], RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 256 /*0060*/ SHF.R.S32.HI R17, RZ, 0x1f, R9 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 296 /*0070*/ ISETP.GE.AND P0, PT, R5, 0x100, PT ; /*0080*/ @!P0 BRA `(.L_3173) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 256 /*0090*/ IMAD.SHL.U32 R12, R9.reuse, 0x4, RZ ; /*00a0*/ SHF.L.U64.HI R17, R9, 0x2, R17 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 260 /*00b0*/ IADD3 R8, P0, R12.reuse, c[0x0][0x188], RZ ; /*00c0*/ IADD3 R2, P1, R12, c[0x0][0x190], RZ ; /*00d0*/ IADD3.X R9, R17.reuse, c[0x0][0x18c], RZ, P0, !PT ; /*00e0*/ IADD3.X R3, R17, c[0x0][0x194], RZ, P1, !PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 218 /*00f0*/ IMAD.WIDE R8, R0, 0x10, R8 ; /*0100*/ IMAD.WIDE R2, R0, 0x10, R2 ; /*0110*/ LDG.E.128.SYS R8, [R8] ; /*0120*/ LDG.E.128.SYS R4, [R2] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 256 /*0130*/ IADD3 R12, P0, R12, c[0x0][0x180], RZ ; /*0140*/ IADD3.X R13, R17, c[0x0][0x184], RZ, P0, !PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 238 /*0150*/ IMAD.WIDE R12, R0, 0x10, R12 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 196 /*0160*/ FFMA R7, R7, c[0x0][0x168], R11 ; /*0170*/ FFMA R6, R6, c[0x0][0x168], R10 ; /*0180*/ FFMA R5, R5, c[0x0][0x168], R9 ; /*0190*/ FFMA R4, R4, c[0x0][0x168], R8 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 238 /*01a0*/ STG.E.128.SYS [R12], R4 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 301 /*01b0*/ EXIT ; .L_3173: //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /*01c0*/ ISETP.GE.AND P0, PT, R0, R5, PT ; /*01d0*/ BMOV.32.CLEAR RZ, B0 ; /*01e0*/ BSSY B0, `(.L_3174) ; /*01f0*/ P0 BRA `(.L_3175) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /*0200*/ IADD3 R3, P1, R9, R0, RZ ; /*0210*/ LEA.HI.X.SX32 R4, R0, R17, 0x1, P1 ; /*0220*/ LEA R2, P1, R3, c[0x0][0x188], 0x2 ; /*0230*/ LEA.HI.X R3, R3, c[0x0][0x18c], R4, 0x2, P1 ; /*0240*/ LDG.E.SYS R8, [R2] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184 /*0250*/ IADD3 R4, R0, 0x40, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /*0260*/ ISETP.GE.AND P1, PT, R4, R5, PT ; /*0270*/ P1 BRA `(.L_3175) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /*0280*/ LDG.E.SYS R4, [R2+0x100] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184 /*0290*/ IADD3 R6, R0, 0x80, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /*02a0*/ ISETP.GE.AND P1, PT, R6, R5, PT ; /*02b0*/ P1 BRA `(.L_3175) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184 /*02c0*/ IADD3 R10, R0, 0xc0, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /*02d0*/ LDG.E.SYS R7, [R2+0x200] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /*02e0*/ ISETP.GE.AND P1, PT, R10, R5, PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /*02f0*/ @!P1 LDG.E.SYS R6, [R2+0x300] ; .L_3175: /*0300*/ BSYNC B0 ; .L_3174: //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /*0310*/ BMOV.32.CLEAR RZ, B0 ; /*0320*/ BSSY B0, `(.L_3176) ; /*0330*/ P0 BRA `(.L_3177) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /*0340*/ IADD3 R3, P1, R9, R0, RZ ; /*0350*/ LEA.HI.X.SX32 R10, R0, R17, 0x1, P1 ; /*0360*/ LEA R2, P1, R3, c[0x0][0x190], 0x2 ; /*0370*/ LEA.HI.X R3, R3, c[0x0][0x194], R10, 0x2, P1 ; /*0380*/ LDG.E.SYS R11, [R2] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184 /*0390*/ IADD3 R10, R0, 0x40, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /*03a0*/ ISETP.GE.AND P1, PT, R10, R5, PT ; /*03b0*/ P1 BRA `(.L_3177) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /*03c0*/ LDG.E.SYS R13, [R2+0x100] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184 /*03d0*/ IADD3 R10, R0, 0x80, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /*03e0*/ ISETP.GE.AND P1, PT, R10, R5, PT ; /*03f0*/ P1 BRA `(.L_3177) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184 /*0400*/ IADD3 R10, R0, 0xc0, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /*0410*/ ISETP.GE.AND P1, PT, R10, R5, PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /*0420*/ LDG.E.SYS R10, [R2+0x200] ; /*0430*/ @!P1 LDG.E.SYS R15, [R2+0x300] ; .L_3177: /*0440*/ BSYNC B0 ; .L_3176: //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /*0450*/ P0 EXIT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196 /*0460*/ IADD3 R9, P0, R9, R0, RZ ; /*0470*/ FFMA R11, R11, c[0x0][0x168], R8 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 197 /*0480*/ IADD3 R14, R0, 0x40, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196 /*0490*/ LEA.HI.X.SX32 R12, R0, R17, 0x1, P0 ; /*04a0*/ LEA R2, P0, R9.reuse, c[0x0][0x180], 0x2 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /*04b0*/ ISETP.GE.AND P1, PT, R14, R5, PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196 /*04c0*/ LEA.HI.X R3, R9, c[0x0][0x184], R12, 0x2, P0 ; /*04d0*/ STG.E.SYS [R2], R11 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /*04e0*/ P1 EXIT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 197 /*04f0*/ IADD3 R8, R0, 0x80, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 196 /*0500*/ FFMA R13, R13, c[0x0][0x168], R4 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /*0510*/ ISETP.GE.AND P0, PT, R8, R5, PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196 /*0520*/ STG.E.SYS [R2+0x100], R13 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /*0530*/ P0 EXIT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 197 /*0540*/ IADD3 R0, R0, 0xc0, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 196 /*0550*/ FFMA R7, R10, c[0x0][0x168], R7 ; /*0560*/ FFMA R15, R15, c[0x0][0x168], R6 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /*0570*/ ISETP.GE.AND P0, PT, R0, R5, PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196 /*0580*/ STG.E.SYS [R2+0x200], R7 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /*0590*/ P0 EXIT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196 /*05a0*/ STG.E.SYS [R2+0x300], R15 ; /*05b0*/ EXIT ; .L_3178: /*05c0*/ BRA `(.L_3178); /*05d0*/ NOP; /*05e0*/ NOP; /*05f0*/ NOP; .L_40898: ``` We can clearly see the `LDG.E.128` in it, which is a result of vectorization. Benchmark: https://github.com/zasdfgbnm/things/blob/master/2020Q1/benchmark-vec.ipynb Benchmark on P100, dtype `uint8`: before: ``` 1.4.0a0+a5b4d78 e1d9702 22.2 µs ± 89.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 34.7 µs ± 38.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 52 µs ± 312 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 86.9 µs ± 135 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 154 µs ± 204 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 291 µs ± 668 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 566 µs ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.18 ms ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2.29 ms ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 4.4 ms ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` after: ``` 1.4.0a0+a5b4d78 1281cdf 24 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 30.5 µs ± 355 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 43.1 µs ± 300 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 67.6 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 116 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 215 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 413 µs ± 791 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 824 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.63 ms ± 478 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 3.19 ms ± 1.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` Benchmark on P100, dtype `half`: Before: ``` 1.4.0a0+a5b4d78 1c017f0 30.8 µs ± 226 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 43.4 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 69.1 µs ± 83 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 119 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 224 µs ± 99.1 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 418 µs ± 206 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 865 µs ± 237 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.69 ms ± 695 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 3.3 ms ± 527 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) 6.77 ms ± 741 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` After ``` 1.4.0a0+a5b4d78 7e50ee2 28.9 µs ± 61.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 40.2 µs ± 244 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 63.8 µs ± 350 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 109 µs ± 196 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 199 µs ± 157 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 380 µs ± 446 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 743 µs ± 2.17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.47 ms ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2.91 ms ± 9.17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 5.8 ms ± 296 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` cc: csarofeen ptrblck Pull Request resolved: #32383 Differential Revision: D19697455 Pulled By: ngimel fbshipit-source-id: 0707481c2f334e6634c000b4afd275b2fee8fbe1
…32384) Summary: The `BatchNorm*` part of the issue (see gh-12013) seems to have been fixed in the master branch and these tests would make it concrete. However I would appreciate comments on #12013 (comment) on whether the current behaviour is satisfactory. Pull Request resolved: #32384 Differential Revision: D19704154 Pulled By: ngimel fbshipit-source-id: 1bbbbf1ae1215a460b22cf26e6b263e518ecf60b
Summary: - Show values in question like glog. - Handle expressions with logical operators properly by adding parentheses around expressions. - Allow outputting nullptr (some build failed without this) Pull Request resolved: #29539 Reviewed By: dreiss Differential Revision: D19698991 Pulled By: ljk53 fbshipit-source-id: e329c01622cfc386ac009904092519a4adfe94a8
#32907) Summary: Pull Request resolved: #32907 All op-specific information used in this logic was available to the parser itself, so the check can be done in that context, no codegen needed. No change in the warning behavior itself, mod minor formatting tweak - passes existing tests. Saves like ~275K binary size on mac: ``` -rwxr-xr-x 1 bhosmer 1876110778 16502064 Feb 1 00:43 torch/lib/libtorch_python.dylib -rwxr-xr-x 1 bhosmer 1876110778 16247888 Feb 1 00:44 torch/lib/libtorch_python.dylib ``` [codegen diff](bhosmer/scratch@deprecation_warning_before...deprecation_warning_after) More important than the size savings is the minimization of codegen. Ideally the generated artifact should express distinctive per-op properties in as minimal a form as practically possible - e.g. here instead of generating check-and-warn behavior into every binding, we generate only the data that triggers the behavior in the parser. (And actually we were generating it already.) Test Plan: Imported from OSS Differential Revision: D19679928 Pulled By: bhosmer fbshipit-source-id: cf0140573118430720c6b797c762fe5be98acd86
Summary: The default value is removed because it is explained right below. Pull Request resolved: #32945 Reviewed By: soumith Differential Revision: D19706567 Pulled By: ailzhang fbshipit-source-id: 1b7cc87991532f69b81aaae2451d944f70dda427
Summary: Should fix #32346 hopefully. Now when _flat_weights list is updated, `None` elements are appended to it if some weights are missing, subsequent `setattr` calls for the missing weights should repair _flat_weights and make it suitable to use in the backend. Pull Request resolved: #32939 Differential Revision: D19710990 Pulled By: ngimel fbshipit-source-id: c978c7519464e94beeffa9bc33b9172854a2f298
Summary: Pull Request resolved: #32935 Mock away the content of onnxified net with some low cost ops so that we can still mimic the input/output transfer while doing minimal work on the card. Test Plan: ``` buck run glow/fb/test:sparsenn_test -- --gtest_filter='SparseNNTest.vanillaC2' --onnxifi_debug_mode --onnxifi_loop_test_mode --nocaffe2_predictor_use_memonger ``` Differential Revision: D19631971 fbshipit-source-id: f970c55ccb410702f479255eeb750e01e3f8c2ae
…32952) Summary: Pull Request resolved: #32952 When the Async() version of clearAndWaitForOutstandingRpcs() was written, we didn't yet have the generic Future<T> class, and hadn't worked out our error model fully. This change fixes that method to properly propagate the first encountered error to the future, using a bool+CAS. ghstack-source-id: 97665749 Test Plan: existing test coverage, buck test mode/dev-nosan caffe2/test/... Differential Revision: D19710337 fbshipit-source-id: 66ce5593a94a16ea624930dbb9409917ef5cfd5d
…plete tensor types. Test Plan: revert-hammer Differential Revision: D19900566 Original commit changeset: c8eaad70c8ea fbshipit-source-id: 764f2139fdf19f22a397694d011078ec525f5e8a
…ec (#32962) Summary: Pull Request resolved: #32962 As per gchanan's comments on #30445, I've used `torch.set_default_dtype` in test_data_parallel instead of specifying dtype=torch.double everywhere. Also, renamed dtype2prec to dtype2prec_DONTUSE ghstack-source-id: 98388429 Test Plan: waitforbuildbot Differential Revision: D19714374 fbshipit-source-id: eb55bbca33881625636ba9ea6dd4cb692f25668e
Summary: Globally define ```C++ constexpr int num_threads = C10_WARP_SIZE * 2; constexpr int thread_work_size = 4; constexpr int block_work_size = thread_work_size * num_threads; ``` and kill all the template arguments passing these values. These are effectively global, but we are now passing them around by template arguments, causing many inconvenience in coding. Pull Request resolved: #33308 Differential Revision: D19907250 Pulled By: ngimel fbshipit-source-id: 4623b69baea7e6e77f460ffdfa07cf9f8cba588a
Summary: Fixes the `TensorIterator` parts of #32863 (THC is still broken) `TensorIterator::split` now keeps track of the `view_offsets` into the full tensor range. With this, I can take the base offset for the reduced dimension and translate partial results from the sub-iter into the index range of the full tensor. This happens only once for each intermediate result, so we should still benefit from the performance of 32-bit indexing in loops. Pull Request resolved: #33310 Differential Revision: D19906136 Pulled By: ngimel fbshipit-source-id: 3372ee4b8d5b115a53be79aeafc52e80ff9c490b
Summary: Pull Request resolved: #33387 CI is broken. Skip two functions to fix the problem. Test Plan: ci Reviewed By: hl475 Differential Revision: D19926249 fbshipit-source-id: a46d1465c59de8616d2af5fb0b9cc18532359f88
Summary: in dper2, local net is hard-coded by whitelisting some layers. Add SparseFeatureGating related layers to local net explicitly.
Test Plan:
* workflow: f167812211
* QRT: fall back looks normal
{F228442018}
Differential Revision: D19852280
fbshipit-source-id: 6fecc3d745c3f742d029575a7b9fe320618f1863
Summary: Pull Request resolved: #33325 Closes #32924. There was a bug where for TCPStore, we would not respect the timeout passed into `init_process_group` while constructing the TCPStore. Instead, we'd set the timeout after the rendezvous created the store, meaning that we used the default timeout of 300s while connecting to the server. This diff passes the timeout passed into `init_process_group` to rendezvous so that it can be passed into the constructor for TCPStore, so that we can use the right timeout at construction time. Question: Should we make this change for FileStore as well? Currently the FileStore constructor does not take in a timeout at all. ghstack-source-id: 98401875 Test Plan: Added a UT Differential Revision: D19871946 fbshipit-source-id: dd002180c4c883216645b8a97cc472c6116ac117
…tializing Test Plan: revert-hammer Differential Revision: D19871946 Original commit changeset: dd002180c4c8 fbshipit-source-id: 40b0676c51e43366c0700e81d16cc7927ee8efc2
Summary: GitHub commits: facebook/fb303@80dda47 facebookarchive/fbzmq@797af57 pytorch/FBGEMM@b2fceb9 Test Plan: n/a Reviewed By: zpao fbshipit-source-id: dde5fb9abca185422df11dc61c658dc333ad63ca
Summary: Pull Request resolved: #32974 Pull Request resolved: pytorch/FBGEMM#286 Re-attempt of D18805426 . Decided to be consistent with PyTorch Adagrad There was an inconsistency in the order of operation between scalar and SIMD code when we compute Adagrad. This diff make them consistent by doing w += lr * grad / (sqrt(moment) + epsilon) in Adagrad and w += lr / (sqrt(moment) + epsilon) * grad in RowWiseSparseAdagrad. The Adagrad order is consistent with PyTorch (see aten/src/ATen/native/cpu/PointwiseOpsKernel.cpp addcmul_cpu_kernel function). The RowWiseSparseAdagrad order is to make compute more efficient. In RowWiseSparseAdagrad, lr / (sqrt(moment) + epsilon) is shared among all elements in the row And, we're not going to use FMA to be consistent with PyTorch (even though it provides a little accuracy benefit) Test Plan: CI Reviewed By: wx1988 Differential Revision: D19342865 fbshipit-source-id: e950c16f2e1c4a2f2a3ef53b1705db373c67f341
Summary: GitHub commits: pytorch/FBGEMM@19c040c Test Plan: n/a Reviewed By: zpao fbshipit-source-id: ddc41000622a682874ab3a11fdf4a91038f9c15f
|
I think I messed up the commits. Sorry. I will try to fix it |
Summary: Another pull request to follow up issue pytorch#32531. Here I implemented the backward operation for `torch.eig` with a condition that all the eigenvalues are real. This pull request is independent of my another pull request pytorch#32932, which means that there is no dependency between this PR and my another PR. Pull Request resolved: pytorch#33090 Differential Revision: D19814347 Pulled By: albanD fbshipit-source-id: 2fae30964e97987abb690544df8240aedeae56e8
This pull request is the first one in response to my feature request #32531 about batched
torch.eig().It is actually harder than I previously thought and I ended up writing
torch.eig()in ATen (replacing the legacy functions used beforehand).I also create an eig function for CUDA using magma which in my run shows a slight improvement in terms of running time.
I haven't made the tests for batched
eig, just to show that the change I made here does not break the previous tests fortorch.eig().Should I add a new function in
test_torch.pyfor batched eig or should I just change the functiontest_eig?