Merge more changes from into the forked repo#5
Merged
imaginary-person merged 27 commits intoimaginary-person:masterfrom Jan 22, 2021
Merged
Merge more changes from into the forked repo#5imaginary-person merged 27 commits intoimaginary-person:masterfrom
imaginary-person merged 27 commits intoimaginary-person:masterfrom
Conversation
Summary: Base on quickwritereader 's comment: #50439 (comment) Those builtins were added in gcc-8 or newer Fixes #50439 Pull Request resolved: #50640 Reviewed By: walterddr Differential Revision: D25934384 Pulled By: malfet fbshipit-source-id: b5dcfcf644ab92a78279c4dca5dbffbb8d8aae0c
…#50769) Summary: Pull Request resolved: #50769 There were a couple new of these lines added in the last couple of days but they're not necessary anymore. This PR removes them and also adds an assertion to make sure we don't add any more. ghstack-source-id: 120133715 Test Plan: waitforsandcastle Reviewed By: bhosmer Differential Revision: D25961316 fbshipit-source-id: e2befc5b6215b42decb2acedcacfb50734857e2f
…der import.h (#50795) Summary: Pull Request resolved: #50795 There's [a post](https://fb.workplace.com/groups/2148543255442743/permalink/2583012411995823/) about a customer having to pass in `-Wno-global-constructors` to disable warnings related to calling constructors for global objects. This is related to the initialization of `default_extra_files_mobile` in `import.h`. It requires end users to pass in the compiler flag, since the definition is now in code (.cpp files) that they will be compiling. In addition, it makes the API for `_load_for_mobile` non-re-entrant (i.e. can not be safely used concurrently from multiple threads without the caller taking a mutex/lock) if the `extra_files_mobile` argument is not explicitly passed in. Instead, a better option would be to create different overloads; one which requires all 3 parameters, and one that can work with 1-2. This solves the problem without creating a static variable. ghstack-source-id: 120127083 Test Plan: Build Lite Interpreter and sandcastle. Reviewed By: raziel Differential Revision: D25968216 fbshipit-source-id: fbd80dfcafb8ef7231aca301445c4a2ca9a08995
Summary: This not only specifies the data types of these NaNs, but also indicate that the function isn't signaling anything unusual. Pull Request resolved: #50412 Reviewed By: mrshenli Differential Revision: D25899828 Pulled By: ezyang fbshipit-source-id: a8ded10954ad08cba3098aa473c6b77f2e03dc93
Summary: Pull Request resolved: #50876 Test Plan: Imported from OSS Reviewed By: nikithamalgifb Differential Revision: D26002640 Pulled By: ansley fbshipit-source-id: 4de8a63ef227ae3d46fab231f739c8472289ca4d
Summary: ROCm 3.5 replaced hcc with hip-clang and deprecated HIP_HCC_FLAGS. HIP_CLANG_FLAGS should be used moving forward. HIP_HCC_FLAGS will be removed soon. Pull Request resolved: #50917 Reviewed By: ejguan Differential Revision: D26008094 Pulled By: walterddr fbshipit-source-id: cfec4f96fbd9bd338834a841c37267f6a4703cab
#50739) Summary: Pull Request resolved: #50739 This does not turn on batched grad testing for autogenerated NewModuleTest tests and CriterionTest tests. Those are coming later. Test Plan: - run tests Reviewed By: ejguan Differential Revision: D25997677 Pulled By: zou3519 fbshipit-source-id: b4b2d68e0f99c3d573faf237e1e531d0b3fced40
Summary: Pull Request resolved: #50838 Similar to `torch.save` and `torch.jit.save`, accept a IO-like object instead of just a file. Test Plan: Imported from OSS Reviewed By: nikithamalgifb Differential Revision: D25982719 Pulled By: suo fbshipit-source-id: 42f3665932bbaa6897215002d116df6338edae50
Summary: Pull Request resolved: #50850 Test Plan: Imported from OSS Reviewed By: Chillee Differential Revision: D25985085 Pulled By: ZolotukhinM fbshipit-source-id: e51709423c2c12b37b449a9d7bb22be04cda7ef1
Summary: Pull Request resolved: #50288 torch::deploy will bundle the objects contained in libtorch-python together with frozenpython into a shared library. Therefore, the libtorch-python objs can't bring with them a dependency on system python. Buck TARGETS are added throughout the caffe2 tree to make available objects or headers that will be needed by torch::deploy but would have brought unsuitable dependencies if accessed using existing targets. CMakeLists are modified to separate a torch-python-objs object library which lets torch::deploy compile these objs with the same compile flags as libttorch_python used, but without some of the link-time dependencies such as python. CudaIPCTypes is moved from libtorch_python to libtorch_cuda because it is really not a python binding, and it statically registers a cuda_ipc_callback which would be duplicated if included in each copy of torch::deploy. Test Plan: no new functionality, just ensure existing tests continue to pass Reviewed By: malfet Differential Revision: D25850785 fbshipit-source-id: b0b81c050cbee04e9de96888f8a09d29238a9db8
Summary: Pull Request resolved: #50279 This allows different sample inputs to have different behavior for the same operator. For example, `div(..., rounding_mode='true')` will promote but other rounding modes don't. The current boolean flag is too restrictive to allow this. Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D25950011 Pulled By: mruberry fbshipit-source-id: 7e82b82bedc626b2b6970d92d5b25676183ec384
Summary: This is a benchmarking tooling to work with sparse tensors. To implement this, we extended PR `benchmarking util` [https://github.com/pytorch/pytorch/issues/38338](https://github.com/pytorch/pytorch/pull/38338) for sparse tensors. In order to extend the proposed utility library the **FuzzedTensor** class was extended by creating the new **FuzzedSparseTensor** class. In addition two new operator classes were added, the `UnaryOpSparseFuzzer` and `BinaryOpSparseFuzzer`. The class `FuzzedSparseTensor` adds new input parameters to the constructor: 1. `sparse_dim`: The number of sparse dimensions in a sparse tensor. 2. `nnz`: Number of non-zero elements in the sparse tensor. 3. `density`: The density of the sparse tensor. 4. `coalesced`: As we know the sparse tensor format permits coalesced/uncoalesced sparse tensors. and removes `probability_contiguous`, `max_allocation_bytes`, `roll_parameter`, `tensor_constructor` as they are dense-tensors related parameters. In addition, I've extended the `torch.utils.benchmark.examples` to work with the new classes `FuzzedSparseTensor`, `UnaryOpSparseFuzzer` and `BinaryOpSparseFuzzer`. Hopefully, this tooling and these examples will help to make other benchmarks in other PRs. Looking forward to your thoughts and feedback. cc robieta, mruberry, ngimel Pull Request resolved: #48397 Reviewed By: ejguan Differential Revision: D26008137 Pulled By: mruberry fbshipit-source-id: 2f37811c7c3eaa3494a0f2500e519267f2186dfb
Summary: Pull Request resolved: #50883 Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D26003682 Pulled By: anjali411 fbshipit-source-id: f02967d2d236d740cd8647891f732f1d63098d3e
Summary: Pull Request resolved: #50593 There are no equivalent to torch.FloatTensor, torch.cuda.FloatTensor for complex types. So `get_gpu_type` and `get_cpu_type` are broken for complex tensors. Also found a few places that explicitly cast inputs to floating point types, which would drop the imaginary component before running the test. Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D25954050 Pulled By: mruberry fbshipit-source-id: 1fa8e5af233aa095c839d5e2f860564baaf92aef
Summary: Pull Request resolved: #50736 Exposes tanh and sigmoid to other backends Test Plan: buck test caffe2/test/cpp/tensorexpr:tensorexpr -- "ATen.fast" Reviewed By: bertmaher Differential Revision: D25884911 fbshipit-source-id: f9a5286450331f60935cfd40bb23f4a4f4c1d087
…ort.h (#50832) Summary: Pull Request resolved: #50832 Please see the previous diff in this stack for the motivation to do so. This makes the same change but for the non-mobile codebase. ghstack-source-id: 120184012 Test Plan: Sandcastle + Build Reviewed By: raziel, iseeyuan Differential Revision: D25979986 fbshipit-source-id: 7708f4f6a50cb16d7a23651e5655144d277d0a4f
Summary: Pull Request resolved: #50912 Test Plan: Sandcastle tests Reviewed By: ansley Differential Revision: D26001948 fbshipit-source-id: 3bfe6a8283a2b1882ed472f836ae1b6e720e519f
Summary: Pull Request resolved: #50583 Test Plan: Imported from OSS Reviewed By: pbelevich Differential Revision: D26010501 Pulled By: ansley fbshipit-source-id: 947121af7e57c16c96f849fbbb3fa83e97d003b2
Summary: Pull Request resolved: #50951 Test Plan: Imported from OSS Reviewed By: fmassa Differential Revision: D26021488 Pulled By: IvanKobzarev fbshipit-source-id: 6d295762bb1160a3ed8bafac08e03e1eeb07d688
Summary: Cant think of a reason not .gitignore test-reports folder. this can be helpful when 1. running `python test/test*.py` from github root directory since it creates the folder at root. 2. CI test report path generated by `torch/testing/_internal/common_utils.py` creates the folder in the same path where the test python file locates. Creating a PR to make sure CI is happy. this is also needed by #50923 Pull Request resolved: #50952 Reviewed By: samestep Differential Revision: D26022436 Pulled By: walterddr fbshipit-source-id: 83e6296de802bd1754b802b8c70502c317f078c9
imaginary-person
pushed a commit
that referenced
this pull request
May 26, 2021
Summary: added more statistic info for static runtime Test Plan: caffe2/benchmarks/static_runtime:static_runtime_cpptest Expected output example: Static runtime ms per iter: 0.939483. Iters per second: 1064.41 Node #0: 0.195671 ms/iter, %wide_offset.1 : Tensor = aten::add(%wide.1, %self._mu, %4) Node #1: 0.169457 ms/iter, %wide_normalized.1 : Tensor = aten::mul(%wide_offset.1, %self._sigma) Node #2: 0.118218 ms/iter, %wide_preproc.1 : Tensor = aten::clamp(%wide_normalized.1, %5, %6) Node #3: 0.038814 ms/iter, %user_emb_t.1 : Tensor = aten::transpose(%user_emb.1, %4, %7) Node #4: 0.0860747 ms/iter, %dp_unflatten.1 : Tensor = aten::bmm(%ad_emb_packed.1, %user_emb_t.1) Node #5: 0.0102666 ms/iter, %31 : Tensor = static_runtime::flatten_copy(%dp_unflatten.1, %4, %8) Node #6: 0.000476333 ms/iter, %19 : Tensor[] = prim::ListConstruct(%31, %wide_preproc.1) Node #7: 0.0707332 ms/iter, %input.1 : Tensor = aten::cat(%19, %4) Node #8: 0.123695 ms/iter, %fc1.1 : Tensor = aten::addmm(%self._fc_b, %input.1, %29, %4, %4) Node #9: 0.0309244 ms/iter, %23 : Tensor = aten::sigmoid(%fc1.1) Node #10: 0.0046297 ms/iter, %24 : (Tensor) = prim::TupleConstruct(%23) Time per node type: 0.195671 ms. 23.0483%. aten::add (1 nodes) 0.169457 ms. 19.9605%. aten::mul (1 nodes, out variant) 0.123695 ms. 14.5702%. aten::addmm (1 nodes, out variant) 0.118218 ms. 13.925%. aten::clamp (1 nodes, out variant) 0.0860747 ms. 10.1388%. aten::bmm (1 nodes, out variant) 0.0707332 ms. 8.33175%. aten::cat (1 nodes, out variant) 0.038814 ms. 4.57195%. aten::transpose (1 nodes) 0.0309244 ms. 3.64263%. aten::sigmoid (1 nodes, out variant) 0.0102666 ms. 1.20932%. static_runtime::flatten_copy (1 nodes, out variant) 0.0046297 ms. 0.545338%. prim::TupleConstruct (1 nodes, out variant) 0.000476333 ms. 0.0561079%. prim::ListConstruct (1 nodes, out variant) 0.848959 ms. in Total StaticRuntime setup time: 0.018925 ms Memory allocation time: 0.019808 ms Memory deallocation time: 0.0120445 ms Outputs deallocation time: 0.0864947 ms Total memory managed: 19328 bytes Total number of reused tensors: 3 Total number of 'out' variant nodes/total number of nodes: 9/11 (81.8182%) Reviewed By: hlu1 Differential Revision: D28553029 fbshipit-source-id: 55e7eab50b4b475ae219896100bdf4f6678875a4
imaginary-person
pushed a commit
that referenced
this pull request
Jul 2, 2021
Summary: Pull Request resolved: pytorch#60987 We were seeing deadlocks as follows during shutdown: ``` Thread 1 (LWP 2432101): #0 0x00007efca470190b in __pause_nocancel () from /lib64/libc.so.6 #1 0x00007efca49de485 in __pthread_mutex_lock_full () from /lib64/libpthread.so.0 #2 0x00007ef91d4c42c6 in __cuda_CallJitEntryPoint () from /lib64/libnvidia-ptxjitcompiler.so.1 #3 0x00007efc651ac8f1 in ?? () from /lib64/libcuda.so #4 0x00007efc651aee03 in ?? () from /lib64/libcuda.so #5 0x00007efc64f76b84 in ?? () from /lib64/libcuda.so #6 0x00007efc64f77f5d in ?? () from /lib64/libcuda.so #7 0x00007efc64eac858 in ?? () from /lib64/libcuda.so #8 0x00007efc64eacfbc in ?? () from /lib64/libcuda.so #9 0x00007efc7810a924 in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #10 0x00007efc780fa2be in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #11 0x00007efc78111044 in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #12 0x00007efc7811580a in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #13 0x00007efc78115aa4 in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #14 0x00007efc781079ec in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #15 0x00007efc780e6a7a in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #16 0x00007efc7811cfa5 in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #17 0x00007efc777ea98c in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #18 0x00007efc777ebd80 in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #19 0x00007efc777ea2c9 in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #20 0x00007efc778c2e2d in cublasDestroy_v2 () from /usr/local/cuda/lib64/libcublas.so.11 #21 0x00007efc51a3fb56 in std::_Sp_counted_ptr_inplace<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle>, std::allocator<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle> >, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /data/users/pritam/pytorch/torch/lib/libtorch_cuda.so #22 0x00007efc51a3fc5f in std::shared_ptr<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle> >::~shared_ptr() () from /data/users/pritam/pytorch/torch/lib/libtorch_cuda.so #23 0x00007efca4648b0c in __run_exit_handlers () from /lib64/libc.so.6 #24 0x00007efca4648c40 in exit () from /lib64/libc.so.6 #25 0x0000558c8852e5f9 in Py_Exit (sts=0) at /tmp/build/80754af9/python_1614362349910/work/Python/pylifecycle.c:2292 #26 0x0000558c8852e6a7 in handle_system_exit () at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:636 #27 0x0000558c8852e742 in PyErr_PrintEx (set_sys_last_vars=<optimized out>, set_sys_last_vars=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:646 #28 0x0000558c88540dd6 in PyRun_SimpleStringFlags (command=0x7efca4dc9050 "from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=9, pipe_handle=13)\n", flags=0x7ffe3a986110) at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:457 #29 0x0000558c88540ead in pymain_run_command (cf=0x7ffe3a986110, command=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:420 #30 pymain_run_python (pymain=0x7ffe3a986220) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:2907 #31 pymain_main (pymain=0x7ffe3a986220) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:3460 #32 0x0000558c8854122c in _Py_UnixMain (argc=<optimized out>, argv=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:3495 #33 0x00007efca4632493 in __libc_start_main () from /lib64/libc.so.6 #34 0x0000558c884e5e90 in _start () at ../sysdeps/x86_64/elf/start.S:103 ``` This was likely caused due to a static singleton that wasn't leaky. Following the guidance in https://isocpp.org/wiki/faq/ctors#construct-on-first-use-v2 to use a leaky singleton instead. ghstack-source-id: 132847448 Test Plan: Verified locally. Reviewed By: malfet Differential Revision: D29468866 fbshipit-source-id: 89250594c5cd2643417b1da584c658b742dc5a5c
imaginary-person
pushed a commit
that referenced
this pull request
Jul 20, 2021
Summary: Pull Request resolved: pytorch#61588 As part of debugging pytorch#60290, we discovered the following deadlock: ``` Thread 79 (Thread 0x7f52ff7fe700 (LWP 205437)): #0 pthread_cond_timedwait@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225 #1 0x0000564880199152 in PyCOND_TIMEDWAIT (cond=0x564880346080 <gil_cond>, mut=0x564880346100 <gil_mutex>, us=5000) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/condvar.h:103 #2 take_gil (tstate=0x7f5254005ef0) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/ceval_gil.h:224 #3 0x0000564880217b62 in PyEval_AcquireThread (tstate=0x7f5254005ef0) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/ceval.c:278 #4 0x00007f557d54aabd in pybind11::gil_scoped_acquire::gil_scoped_acquire() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so #5 0x00007f557da7792f in (anonymous namespace)::concrete_decref_fn(c10::impl::PyInterpreter const*, _object*) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so #6 0x00007f5560dadba6 in c10::TensorImpl::release_resources() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so #7 0x00007f5574c885bc in std::_Sp_counted_ptr_inplace<torch::distributed::autograd::DistAutogradContext, std::allocator<torch::distributed::autograd::DistAutogradContext>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so #8 0x00007f5574c815e9 in std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::distributed::autograd::DistAutogradContext> >, false> > >::_M_deallocate_node(std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::distributed::autograd::DistAutogradContext> >, false>*) [clone .isra.325] () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so #9 0x00007f5574c81bf1 in torch::distributed::autograd::DistAutogradContainer::eraseContextIdAndReset(torch::distributed::autograd::DistAutogradContainer::ContextsShard&, long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so #10 0x00007f5574c86e83 in torch::distributed::autograd::DistAutogradContainer::releaseContextIfPresent(long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so #11 0x00007f5574cc6395 in torch::distributed::rpc::RequestCallbackNoPython::processCleanupAutogradContextReq(torch::distributed::rpc::RpcCommandBase&) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so #12 0x00007f5574cccf15 in torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so Thread 72 (Thread 0x7f53077fe700 (LWP 205412)): #0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007f55bc62adbd in __GI___pthread_mutex_lock (mutex=0x564884396440) at ../nptl/pthread_mutex_lock.c:80 #2 0x00007f5574c82a2f in torch::distributed::autograd::DistAutogradContainer::retrieveContext(long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so #3 0x00007f557de9bb2f in pybind11::cpp_function::initialize<torch::distributed::autograd::(anonymous namespace)::dist_autograd_init(_object*, _object*)::{lambda(long)#11}, pybind11::dict, long, pybind11::name, pybind11::scope, pybind11::sibling, char [931], pybind11::arg>(torch::distributed::autograd::(anonymous namespace)::dist_autograd_init(_object*, _object*)::{lambda(long)#11}&&, pybind11::dict (*)(long), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [931], pybind11::arg const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so ``` Basically Thread 72, holds GIL and tries to acquire the lock for DistAutogradContainer to perform a lookup on a map. On the other hand, Thread 79 holds the lock on DistAutogradContainer to remove a Tensor and as part of TensorImpl destructor, concrete_decref_fn is called which waits for GIL. As a result, we have a deadlock. To fix this issue, I've ensured we release GIL when we call `retrieveContext` and acquire it later when needed. ghstack-source-id: 133493659 Test Plan: waitforbuildbot Reviewed By: mrshenli Differential Revision: D29682624 fbshipit-source-id: f68a1fb39040ca0447a26e456a97bce64af6b79c
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.