Integrate from upstream by iotamudelta · Pull Request #282 · ROCm/pytorch

iotamudelta · 2018-10-22T15:59:11Z

No description provided.

Summary: Pull Request resolved: pytorch#12381 The workflow passes after D10150834, so we can restore strides. Reviewed By: ezyang Differential Revision: D10220313 fbshipit-source-id: aaf9edebf4ff739cbe45b2d32e77918fce47ba34

Summary: Fix pytorch#12624 . internal usecase of legacy `reduce`. Add test in test_nn Pull Request resolved: pytorch#12689 Reviewed By: ezyang Differential Revision: D10391195 Pulled By: ailzhang fbshipit-source-id: 1af2b258c4abb2b6527eaaeac63e8bf1762c66a1

…d49783 (pytorch#12676) Summary: Pull Request resolved: pytorch#12676 Previous import was 06f6d63d5529e3a94533c9f34c402be1793420b1 Included changes: - **[1cbe274](onnx/onnx@1cbe274)**: fix the optimizer (ROCm#1510) <Lu Fang> - **[481ad99](onnx/onnx@481ad99)**: Fix TensorProto int32_data comment (ROCm#1509) <Lutz Roeder> - **[f04fbe0](onnx/onnx@f04fbe0)**: fix ninja external (ROCm#1507) <Rui Zhu> Reviewed By: jamesr66a, wanchaol Differential Revision: D10388438 fbshipit-source-id: 298100589ce226c63d4e58edf185c9227fd52c85

Summary: Pull Request resolved: pytorch#12685 In this diff, we push the fake run of the net into the ONNXIFI transformer, because 1. We cannot do shape inference for every op 2. Since the net has been SSA rewritten, we cannot use shape info from outer workspace directly. In addition, this diff adds input shape info when querying the `onnxBackendCompatibility` function. Reviewed By: bddppq Differential Revision: D10390164 fbshipit-source-id: 80475444da2170c814678ed0ed3298e28a1fba92

Summary: Pull Request resolved: pytorch#12593 size() returns numel_, but what we really want is nbytes(), which is the capacity. Reviewed By: salexspb Differential Revision: D10354488 fbshipit-source-id: f7b37ad79ae78290ce96f37c65caa37d91686f95

Summary: There were two problems with SN + DP: 1. In SN, the updated _u vector is saved back to module via a `setattr`. However, in DP, everything is run on a replica, so those updates are lost. 2. In DP, the buffers are broadcast via a `broadcast_coalesced`, so on replicas they are all views. Therefore, the `detach_` call won't work. Fixes are: 1. Update _u vector in-place so, by the shared storage between 1st replica and the parallelized module, the update is retained 2. Do not call `detach_`. 3. Added comments in SN about the subtlety. 4. Added a note to the DP doc on this particular behavior of DP. cc crcrpar taesung89 The controller you requested could not be found. yaoshengfu Fixes pytorch#11476 Pull Request resolved: pytorch#12671 Differential Revision: D10410232 Pulled By: SsnL fbshipit-source-id: c447951844a30366d8c196bf9436340e88f3b6d9

Summary: Addressed Dima's feedback. The proposal is here: https://fb.quip.com/TbQmAuqIznCf Pull Request resolved: pytorch#12384 Reviewed By: dzhulgakov Differential Revision: D10246743 Pulled By: houseroad fbshipit-source-id: c80db0c35d60ca32965275da705f2b1dfb2a7265

Summary: This PR contains changes for: 1. Removing MIOpen softmax operator. Will be added later with the required functionality 2. Enabling softmax_ops_test on ROCm target Differential Revision: D10416079 Pulled By: bddppq fbshipit-source-id: 288099903aa9e0c3378e068fffe6e7d6a9a84841

Summary: Pull Request resolved: pytorch#12306 In a future diff, I'm going to introduce non-placement constructor and destructor to TypeMeta. To make it less ambigous, this diff is first renaming the existing ones to PlacementXXX. Reviewed By: dzhulgakov Differential Revision: D10184117 fbshipit-source-id: 119120ebc718048bdc1d66e0cc4d6a7840e666a4

Summary: The mapping protocol stipulates that when `__delitem__` is called, this is passed to `__setitem__` [(well, the same function in the C extension interface)](https://docs.python.org/3/c-api/typeobj.html#c.PyMappingMethods.mp_ass_subscript) with NULL data. PyTorch master crashes in this situation, with this patch, it does not anymore. Test code (careful, sefaults your interpreter): ```python import torch a = torch.randn(5) del a[2] ``` Pull Request resolved: pytorch#12726 Differential Revision: D10414244 Pulled By: colesbury fbshipit-source-id: c49716e1a0a3d9a117ce88fc394858f1df36ed79

pytorch#12691) Summary: Pull Request resolved: pytorch#12691 We check input(0) but not input(1) in BatchMatMul. This may result in a protobuf exception which won't be caught by upstream and causing termination of the program. Check that with `CAFFE_ENFORCE` will be caught by upstream inference function. Plus, it will print out clean stack tracing showing where went wrong. Reviewed By: bddppq, houseroad, BIT-silence Differential Revision: D10391130 fbshipit-source-id: daf8dcd8fcf9629a0626edad660dff54dd9aeae3

Summary: - update docs examples at sparse tensor after print format changed - update example to create empty sparse tensor: ``` >>> torch.sparse_coo_tensor(torch.LongTensor(size=[1,0]), [], torch.Size([1])) tensor(indices=tensor([], size=(1, 0)), values=tensor([], size=(0,)), size=(1,), nnz=0, layout=torch.sparse_coo) ``` zou3519 SsnL yf225 Pull Request resolved: pytorch#12221 Differential Revision: D10412447 Pulled By: weiyangfb fbshipit-source-id: 155b8cb0965f060e978f12239abdc1b3b41f6ab0

Summary: Pull Request resolved: pytorch#12696 In majority of the case, we use `InheritOnnxSchema(type_)`. This diff makes declaration of such case easier. Reviewed By: bddppq Differential Revision: D10395109 fbshipit-source-id: 914c1041387d5be386048d923eb832244fc506c3

Summary: Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: pytorch#12693 Differential Revision: D10419424 Pulled By: ezyang fbshipit-source-id: dc3999253f19b5615849619bd3e4a77ab3ca984e

Summary: Fixes pytorch#11683. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: pytorch#11705 Differential Revision: D9833057 Pulled By: ezyang fbshipit-source-id: 18af9bcd77b088326738d567100fbe4a4c869dd6

Summary: Before and after coming after I run the tests on CI Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: pytorch#12610 Differential Revision: D10419483 Pulled By: ezyang fbshipit-source-id: 5543e971f8362e4cea64f332ba44a26c2145caea

Summary: This is for Caffe2 optimization. WIth this optimization, the following two ops can boost a lot. (Test with MaskRCNN, on SKX8180 one socket) BatchPermutation op: reduced from 8.296387 ms to 1.4501984 ms. Pull Request resolved: pytorch#12153 Differential Revision: D10362823 Pulled By: ezyang fbshipit-source-id: 04d1486f6c7db49270992cd8cde41092154e62ee

Summary: Module.to uses the Tensor.to parsing facility. It should not, however, accept "copy" as a keyword/fourth positional argument. See pytorch#12571 for discussion. Thank you SsnL for noticing. Pull Request resolved: pytorch#12617 Differential Revision: D10392053 Pulled By: ezyang fbshipit-source-id: b67a5def7993189b4b47193abc7b741b7d07512c

Summary: Optimize the UpsampleNearest Op. 1. Add OMP 2. revise the translated_idx method Pull Request resolved: pytorch#12151 Differential Revision: D10362856 Pulled By: ezyang fbshipit-source-id: 535a4b87c7423942217f2d79bedc463a0617c67a

Summary: This PR removes some duplication in `recurrent_op_cudnn.cc`. Instead of 4 of the same exact descriptor, should work fine with just 1. I don't see any other code that relies on those being 4 separate locations, but if that is what you need you can always allocate additional descriptors as necessary. Have not fully tested this thing out, just something I noticed when I was reading through the descriptor code. Cheers Pull Request resolved: pytorch#8321 Differential Revision: D10363744 Pulled By: ezyang fbshipit-source-id: 733c8242fb86866f1d64cfd79c54ee7bedb03b84

Summary: Add a mapping for conversion -- this will help with debugging as well but is directly used by the TUI stacked on top of this Reviewed By: duc0 Differential Revision: D10396130 fbshipit-source-id: cdd39278f0ed563bb828b1aebbbd228f486d89c8

Summary: Where is declared as: ``` where(Tensor condition, Tensor self, Tensor other) ``` Previously the compiler assumed that self must be the first argument. But this is not true in practice for `where` and for a few other exceptions. This changes the compiler to take an explicit self argument which gets matched to the `self` that appears in the schema. Note that this requires renaming a variant of pow, which referred to an exponent Tensor as `self` because otherwise that would cause `t^3` to match against `t` being the exponent. Pull Request resolved: pytorch#12385 Differential Revision: D10364658 Pulled By: zdevito fbshipit-source-id: 39e030c6912dd19b4b0b9e35fcbabc167b4cc255

Summary: - Fix broken sparse_coo_examples, update output - Tensor(...) to tensor(...) - Fix arguments to math.log to be floats While the last might be debateable, mypy currently complains when passing an int to math.log. As it is not essential for our examples, let's be clean w.r.t. other people's expectations. These popped up while checking examples in the context of pytorch#12500 . Pull Request resolved: pytorch#12707 Differential Revision: D10415256 Pulled By: SsnL fbshipit-source-id: c907b576b02cb0f89d8f261173dbf4b3175b4b8d

Summary: I struggled with yet another DataLoader hang for the entire evening. After numerous experiments, I realized that it is unsafe to do anything when Python is shutting down. We also unfortunately implement our DataLaoder cleaning-up logic in `__del__`, a function that may or may not be called during shutdown, and if called, may or may not be called before core library resources are freed. Fortunately, we are already setting all our workers and pin_memory_thread as daemonic. So in case of Python shutting down, we can just do a no-op in `__del__` and rely on the automatic termination of daemonic children. An `atexit` hook is used to detect Python exit. Pull Request resolved: pytorch#12700 Differential Revision: D10419027 Pulled By: SsnL fbshipit-source-id: 5753e70d03e69eb1c9ec4ae2154252d51e2f79b0

Summary: They are flaky in master. ashishfarmer petrex Pull Request resolved: pytorch#12749 Differential Revision: D10420265 Pulled By: bddppq fbshipit-source-id: cac58efb711941786b10b07ada58e0d59ab1db1d

Differential Revision: D10220313 Original commit changeset: aaf9edebf4ff fbshipit-source-id: 46c4d23d89d47be26c3f4967476271d8c2f95f11

Summary: include atomicAdd commentary as this is less well known There is some discussion in pytorch#12207 Unfortunately, I cannot seem to get the ..include working in `_tensor_docs.py` and `_torch_docs.py`. I could use a hint for that. Pull Request resolved: pytorch#12217 Differential Revision: D10419739 Pulled By: SsnL fbshipit-source-id: eecd04fb7486bd9c6ee64cd34859d61a0a97ec4e

Summary: Fixed the second example in NLLLoss. The LogSoftmax activation was missing after the convolution layer. Without this activation, the second example loss was sometimes negative. Pull Request resolved: pytorch#12703 Differential Revision: D10419694 Pulled By: ezyang fbshipit-source-id: 98bfefd1050290dd5b29d3ce18fe075103db4674

Summary: Pull Request resolved: pytorch#12307 This adds non-placement variants of New/Delete to TypeMeta. In a future diff, this is going to be used from Blob to destruct its contents. Reviewed By: dzhulgakov Differential Revision: D10184116 fbshipit-source-id: 7dc5592dbb9d7c4857c0ec7b8570329b33ce5017

Summary: Pull Request resolved: pytorch#11500 Since TypeMeta already stores a destructor, and we removed the ability from Blob to store a custom destructor in a diff stacked below this, there is now no reason for Blob to store it again. Reviewed By: ezyang Differential Revision: D9763423 fbshipit-source-id: d37a792ffd6928ed1906f5ba88bd4f1d1e2b3781

Summary: trying again without xlocale.h Pull Request resolved: pytorch#12838 Differential Revision: D10453078 Pulled By: zdevito fbshipit-source-id: 760852c82e16acee7d1abb8a918822bf5ff59bca

Summary: I'm trying to do some transformations on Declarations.cwrap and this makes things overly difficult and doesn't do anything useful. Pull Request resolved: pytorch#12832 Reviewed By: ezyang Differential Revision: D10450771 Pulled By: gchanan fbshipit-source-id: 1abb1bce27b323dd3e93b52240e7627cd8e56566

Summary: Pull Request resolved: pytorch#12736 This updates UpsampleBilinearOp and UpsampleBilinearGradientOp to support scales to bring it inline with ResizeNearestOp pytorch#12720. Reviewed By: houseroad Differential Revision: D10416228 fbshipit-source-id: f339b7e06979c9c566afb4cee64a2d939b352957

Summary: Pull Request resolved: pytorch#12833 Differential Revision: D10464815 Pulled By: yf225 fbshipit-source-id: 06a6a673b6bb32f7c252a217f9ce59db35c75e9c

Summary: Pull Request resolved: pytorch#12667 Differential Revision: D10466661 Pulled By: yf225 fbshipit-source-id: a1a150d3b384eb88ba4c7e6d57e59d8ed834e53c

…ytorch#12849) Summary: I got annoyed at waiting for OSS to tell me my c10d builds were busted, so I also added support for building the test scripts in fbcode and fixed the warnings this uncovered. Pull Request resolved: pytorch#12849 Reviewed By: pietern Differential Revision: D10457671 fbshipit-source-id: 5b0e36c606e397323f313f09dfce64d2df88faed

Summary: Pull Request resolved: pytorch#12845 Attempting to do this again. Reallocation of strides_ if there's no change in dim seems to cause the error that broke internal flow last time. This fixes that. Found a potential race condition in caffe2 counter ops that might be the cause, we will investigate that. Reviewed By: ezyang Differential Revision: D10421896 fbshipit-source-id: b961ea0bca79757991013a2d60cfe51565689ee9

…2731) Summary: Add strings to our set of built-in types for annotations. This is used in the the functional library. Pull Request resolved: pytorch#12731 Differential Revision: D10453153 Pulled By: eellison fbshipit-source-id: f54177c0c529f2e09f7ff380ddb476c3545ba5b0

Summary: Pull Request resolved: pytorch#12844 Optimize GroupNormOp Reviewed By: houseroad Differential Revision: D10455567 fbshipit-source-id: aee211badd1e0c8ea6196843e3e77f7c612a74d5

pytorch#12836) Summary: `OMP_NUM_THREADS` and `MKL_NUM_THREADS` are set to 4 by default in the docker images, which causes `nproc` to only show 4 cores in the docker containers by default, and building PyTorch is slow in this default case. We likely don't need these two flags to be set, and this PR tests that hypothesis. Pull Request resolved: pytorch#12836 Differential Revision: D10468218 Pulled By: yf225 fbshipit-source-id: 7a57962c962e162a8d97f730626825aa1e371c7f

Differential Revision: D10421896 Original commit changeset: b961ea0bca79 fbshipit-source-id: 9d9d2ed0c2cb23a3fdf6bbfc9509539aeeb7e382

Summary: Pull Request resolved: pytorch#12717 Reviewed By: ilia-cher Differential Revision: D10408325 fbshipit-source-id: 82583d0ad4b8db094ee4c5c607b52500826328f7

Summary: Pull Request resolved: pytorch#12840 Add binding for delete_node Reviewed By: duc0 Differential Revision: D10453555 fbshipit-source-id: cdcaca8420a9a0c61479961d907ef6bb5478a41d

Summary: This fixes the issue for pytorch#12168 Pull Request resolved: pytorch#12694 Differential Revision: D10468717 Pulled By: teng-li fbshipit-source-id: 3df31d75eea19d6085af665f5350d3cb667a5048

Summary: Pull Request resolved: pytorch#12881 TSIA. This should not change any functionality. Remaining work: - change the build script to deprecate use of CAFFE2_USE_MINIMAL_GOOGLE_GLOG and use a C10 macro instead. - Unify the exception name (EnforceNotMet -> Error) - Unify the logging and warning APIs (like AT_WARNING) Reviewed By: dzhulgakov Differential Revision: D10441597 fbshipit-source-id: 4784dc0cd5af83dacb10c4952a2d1d7236b3f14d

Summary: This test flushes out the issue that IDEEP cannot handle tensor with dims like (0, 2), which is a valid tensor shape. Pull Request resolved: pytorch#8459 Differential Revision: D10419328 Pulled By: yinghai fbshipit-source-id: c5efcd152364a544180a8305c47a2a2d126ab070

Summary: Pull Request resolved: pytorch#12790 Add DFS based topological sort to nomnigraph. Reviewed By: duc0 Differential Revision: D10434645 fbshipit-source-id: aaf106b0cc37806b8ae61f065c1592a29993eb40

Summary: Pull Request resolved: pytorch#12864 Differential Revision: D10481669 Pulled By: SsnL fbshipit-source-id: 20831af41aaba75546e6ed6a99f011f0447b1acf

Summary: * Moves `weak_script` annotation to `torch/_jit_internal.py` folder to resolve dependency issue between `torch.jit` and `torch.nn` * Add `torch._jit.weak_script` to `tanhshrink` and `softsign`, their tests now pass instead of giving an `unknown builtin op` error * Blacklist converted `torch.nn.functional` functions from appearing in the builtin op list if they don't actually have corresponding `aten` ops Pull Request resolved: pytorch#12723 Differential Revision: D10452986 Pulled By: driazati fbshipit-source-id: c7842bc2d3ba0aaf7ca6e1e228523dbed3d63c36

Summary: For pytorch#10114 soumith fmassa Pull Request resolved: pytorch#12708 Differential Revision: D10444102 Pulled By: goldsborough fbshipit-source-id: 529e737e795bd8801beab2247be3dad296af5a3e

…2835) Summary: This is designed to make it easier to see how your codegen changes affected actual generated code. Limitations: A) This is NOT robust; if new directories are added that include generated files, they need to be added to tools/generated_dirs.txt. Note that subdirectories of the list are not included. B) This is particular to my workflow which I don't claim is generally applicable. Ideally we would have a script that pumped out a diff that could be attached to PRs. C) Only works on OSS and definitely won't work on windows. How to use: 1) python setup.py ... 2) tools/git_add_generated_dirs 3) Edit codegen 4) python setup.py ... 4) git diff to see changes 5) If satisfied: tools/git_reset_generated_dirs, commit, etc. If not satisfied: Go to 3) Pull Request resolved: pytorch#12835 Reviewed By: ezyang Differential Revision: D10452255 Pulled By: gchanan fbshipit-source-id: 294fc74d41d1b840c7a26d20e05efd0aff154635

iotamudelta · 2018-10-22T21:42:14Z

@pytorchbot retest this please

Commit Messages: - Update README.md (#289) - Update version to 1.10.0 (#282) - add code to read BUILD_VERSION env variable, so that it is used instead of version.txt when creating a wheel (#278) PRs: - ROCm/apex#289 Fixes: - https://example.com/issue-289 - https://example.com/issue-282 - https://example.com/issue-278

Commit Messages: - Update README.md (#289) - Update version to 1.10.0 (#282) - add code to read BUILD_VERSION env variable, so that it is used instead of version.txt when creating a wheel (#278) PRs: - ROCm/apex#289 Fixes: - https://example.com/issue-278 - https://example.com/issue-289 - https://example.com/issue-282

Commit Messages: - Create custom python operators for MixedFusedLayerNorm and MixedFusedRMSNorm. (#304) - Add new apex module to jit load system (#294) * add code to add loader module for jit module * fix errors to create jit module adder - use correct file name to save code to * fix errors to create jit module adder - use correct class name of the builder and parameter to supply builder module name * fix errors to create jit module loader * add description about jit module script to add jit loader for a jit module with builder provided * add description about jit module script to add jit loader for a jit module with builder provided * add attributes and methods to override when creating a jit module builder * add extra new lines * update jit module to take the builder file name and extract module name from the builder, update missing entries in the table in readme for adding new module in jit * refine the description about module to jit * add description about jit * add description about jit * add code to create a builder based on user inputs * change the example from fused_dense to swiglu * allow user to skip sources list * change description of cxx and nvcc flags, add description of methods and fields in the initial builder code created by script - add details of fused_conv_bias_relu in table of modules and fix error of maximum depth reached (#297) * add details of fused_conv_bias_relu in table of modules and build flag * solve the maximum depth error. - Port fused_conv_bias_relu to ROCm (#295) * Add support for conv bias relu * Fix compilation failure * omit check_cudnn_version_and_warn check (no cuDNN on ROCm) * Flatten bias for PyTorch from 4D to 1D * Implement fusion of Conv with ReLU with MIOpen * Fix compilation issues * Fix crash for ConvBias * Fix merge issues * Add support for ConvBias and ConvBiasMaskRelu * Fix segmentation fault on bwd for ConvBias * add code for fusing conv+bias for retinanet, add test case for retinanet * Fix torch warning * Fix warnings in a unit test file as well * add builder and loader for fused_conv_bias_relu module --------- Co-authored-by: Sergey Solovyev <sergey.solovyev@amd.com> Co-authored-by: Mikko Tukiainen <mikko.tukiainen@amd.com> - Bump version from 1.10.0 to 1.11.0 (#293) - [REDUX] Refactor Apex build process to use the PyTorch JIT extension flow (#291) * Created initial code for loading fused_dense module dynamically instead of building it. Code uses accelerator and op_builder modules from deepspeed code. * add apex/git_version_info_installed.py to gitignore as it is dynamically created by setup.py for the build process * add code for building fused rope dynamically * add code for building fused bias swiglu dynamically * fix the code so that fused rope and fused softmax are not compiled in jit mode, add csrc back to setup.py since it is not copied to apex wheel * load the jit modules inside and this prevents them from building when building the wheel * convert syncbn module to jit * fix the unnecessary compile of syncbn module in wheel building due to imports in python module * add fused layer norm module to jit build * make focal loss module as jit module * make focal loss module as jit module * make xentropy module as jit module * make bpn module as jit module * add code to build individual extensions without JIT * clean up the flags for the modules based on apex/setup.py * add function to get the backward_pass_guard_args in CudaOpBuilder and make MLP JIT compile * add fused weight gradient mlp to jit compile * move fused_weight_gradient_mlp_cuda load inside so that it is not compiled during apex installation * make fused index mul 2d jit compile and dd aten atomic header flag method to CUDAOpBuilder to support its jit compile * make fast multihead attention as jit module, add generator_args to CudaOpBuilder support jit of this module * make transducer loss and transducer joint modules as jit modules, add nvcc_threads_args method in CUDAOpBuilder to support these jit modules * remove extra method - installed_cuda_version from CUDAOpBuilder * add apex_C module to jit compile, add py-cpuinfo to requirements.txt as it is needed for TorchCPUOpBuilder * make nccl allocator as a jit compile module, add nccl_args method to CUDAOpBuilder to support this * make amp_C as a jit module * add a few uses of amp_C jit module * add a few uses of amp_C jit module * make fused adam as a jit module * add a few uses of amp_C jit module * fix the issue with fused adam jit module * make fused lamb as jit module * make distributed adam as jit module * make distributed lamb as jit module * add remaining amp_C uses with jit loader * add remaining usage of apexC jit module * make nccl p2p module as jit compile * make peer memory module as jit compile * add code to check for minimum nccl version to compile nccl allocator module * add provision to provide APEX_CPP_OPS=1 and APEX_CUDA_OPS=1 as replacement for --cpp_ext --cuda_ext command line arguments for building specific extensions in apex, save these settings for later use * check for minimum torch version for nccl allocator, check if the module is compatible other removed from installed ops list * add build as a dependency to support wheel building * Replace is_compatible to check for installation conditions with is_supported, because there is an issue with loading nccl allocator * Similar to pytorch we create a make command to install aiter, that the user can use. There will be no building aiter in the setup.py * update extension import test so that it considers jit compile extensions * clean up MultiTensorApply usages so that amp_C is not build in jit compile mode * Adding missing modules from deepspeed repo. Remove extra code in setup.py. Use is_compatible instead of is_supported * change name of apex_C module * change the name of cpp and cuda build flags, remove APEX_BUILD_OPS, cleanup the logic to build specific modules * add missing files used in cpu accelerator * add make clean command to handle deleting torch extensions installed for jit modules, fix the cpu builder import error * remove unused code in setup.py, fix the code to build for cpu mode * Removing unused code * remove accelerator package and refactor the used code into op_builder.all_ops BuilderUtils class * remove accelerator package usages * revert code that was removed by mistake * Cleaning up the setup file and renaming functions and variable to more readable names. * Fix the nccl version so that the nccl_allocator.so file can be loaded properly. Setup() call has an argument called py_modules which copies the python class into sitepackages folder. The python modules in the compatibility folder do lazy load of the builder classes. First these files are copied in the parent folder so that the files themselves are copied into sitepackages so that the kernel can be loaded into python then these temporary files are deleted. * Restore to original importing the extension code. * renamed compatibility/scaled_masked_softmax_cuda.py, added some extra tests in the contrib test runner * Added instructions for JIT load and changes in installation options * Restructuring the README * Added instructions for building wheel * replaced TorchCPUBuilder with CPUBuilder, added a main method in contrib test runner * create a script to build different jit conditions for running different tests * add script to run tests with different jit builds, add instructions to run jit build and tests in readme, add other tests in readme * fix the issues with running the tests - improper paths, counting .so files in apex folder * add mad internal scripts * remove print statement * remove testing section from readme * change location of result file * remove multiple results file from models.json * add platform specific description to wheel name even if no CppExtension or CUDAExtension is built with JIT load approach * add ninja and wheel to requirements to be installed * Update Release notes in Readme * Exclude compatibility folder while installing apex * Update README.md * Update README.md * Update README.md * Adding modification note to the original copywrite * fix the issue with symbolic links for op_builder, csrc when the apex repo is cloned in the docker * assign the symbolically linked folders into a variable and then loop across the list entries * remove unnecessary tabs --------- Co-authored-by: skishore <sriramkumar.kishorekumar@amd.com> Co-authored-by: sriram <sriram.kumar@silo.ai> - Pow implementation is very expensive on AMD CDNA4. (#292) This commit changes it to a mathematically equivalent exp(y*log(x)) for x > 0. However 1-2 ULP prec loss might be possible. - Update README.md (#289) - Update version to 1.10.0 (#282) - add code to read BUILD_VERSION env variable, so that it is used instead of version.txt when creating a wheel (#278) PRs: - ROCm/apex#304 Fixes: - https://example.com/issue-292 - https://example.com/issue-278 - https://example.com/issue-295 - https://example.com/issue-294 - https://example.com/issue-289 - https://example.com/issue-304 - https://example.com/issue-291 - https://example.com/issue-282 - https://example.com/issue-293 - https://example.com/issue-297

Commit Messages: - Update README with release notes for version 1.11.0 (#310) Added release notes for version 1.11.0, including new extensions and upgrades. Updated previous release notes for clarity. - Create custom python operators for MixedFusedLayerNorm and MixedFusedRMSNorm. (#304) - Add new apex module to jit load system (#294) * add code to add loader module for jit module * fix errors to create jit module adder - use correct file name to save code to * fix errors to create jit module adder - use correct class name of the builder and parameter to supply builder module name * fix errors to create jit module loader * add description about jit module script to add jit loader for a jit module with builder provided * add description about jit module script to add jit loader for a jit module with builder provided * add attributes and methods to override when creating a jit module builder * add extra new lines * update jit module to take the builder file name and extract module name from the builder, update missing entries in the table in readme for adding new module in jit * refine the description about module to jit * add description about jit * add description about jit * add code to create a builder based on user inputs * change the example from fused_dense to swiglu * allow user to skip sources list * change description of cxx and nvcc flags, add description of methods and fields in the initial builder code created by script - add details of fused_conv_bias_relu in table of modules and fix error of maximum depth reached (#297) * add details of fused_conv_bias_relu in table of modules and build flag * solve the maximum depth error. - Port fused_conv_bias_relu to ROCm (#295) * Add support for conv bias relu * Fix compilation failure * omit check_cudnn_version_and_warn check (no cuDNN on ROCm) * Flatten bias for PyTorch from 4D to 1D * Implement fusion of Conv with ReLU with MIOpen * Fix compilation issues * Fix crash for ConvBias * Fix merge issues * Add support for ConvBias and ConvBiasMaskRelu * Fix segmentation fault on bwd for ConvBias * add code for fusing conv+bias for retinanet, add test case for retinanet * Fix torch warning * Fix warnings in a unit test file as well * add builder and loader for fused_conv_bias_relu module --------- Co-authored-by: Sergey Solovyev <sergey.solovyev@amd.com> Co-authored-by: Mikko Tukiainen <mikko.tukiainen@amd.com> - Bump version from 1.10.0 to 1.11.0 (#293) - [REDUX] Refactor Apex build process to use the PyTorch JIT extension flow (#291) * Created initial code for loading fused_dense module dynamically instead of building it. Code uses accelerator and op_builder modules from deepspeed code. * add apex/git_version_info_installed.py to gitignore as it is dynamically created by setup.py for the build process * add code for building fused rope dynamically * add code for building fused bias swiglu dynamically * fix the code so that fused rope and fused softmax are not compiled in jit mode, add csrc back to setup.py since it is not copied to apex wheel * load the jit modules inside and this prevents them from building when building the wheel * convert syncbn module to jit * fix the unnecessary compile of syncbn module in wheel building due to imports in python module * add fused layer norm module to jit build * make focal loss module as jit module * make focal loss module as jit module * make xentropy module as jit module * make bpn module as jit module * add code to build individual extensions without JIT * clean up the flags for the modules based on apex/setup.py * add function to get the backward_pass_guard_args in CudaOpBuilder and make MLP JIT compile * add fused weight gradient mlp to jit compile * move fused_weight_gradient_mlp_cuda load inside so that it is not compiled during apex installation * make fused index mul 2d jit compile and dd aten atomic header flag method to CUDAOpBuilder to support its jit compile * make fast multihead attention as jit module, add generator_args to CudaOpBuilder support jit of this module * make transducer loss and transducer joint modules as jit modules, add nvcc_threads_args method in CUDAOpBuilder to support these jit modules * remove extra method - installed_cuda_version from CUDAOpBuilder * add apex_C module to jit compile, add py-cpuinfo to requirements.txt as it is needed for TorchCPUOpBuilder * make nccl allocator as a jit compile module, add nccl_args method to CUDAOpBuilder to support this * make amp_C as a jit module * add a few uses of amp_C jit module * add a few uses of amp_C jit module * make fused adam as a jit module * add a few uses of amp_C jit module * fix the issue with fused adam jit module * make fused lamb as jit module * make distributed adam as jit module * make distributed lamb as jit module * add remaining amp_C uses with jit loader * add remaining usage of apexC jit module * make nccl p2p module as jit compile * make peer memory module as jit compile * add code to check for minimum nccl version to compile nccl allocator module * add provision to provide APEX_CPP_OPS=1 and APEX_CUDA_OPS=1 as replacement for --cpp_ext --cuda_ext command line arguments for building specific extensions in apex, save these settings for later use * check for minimum torch version for nccl allocator, check if the module is compatible other removed from installed ops list * add build as a dependency to support wheel building * Replace is_compatible to check for installation conditions with is_supported, because there is an issue with loading nccl allocator * Similar to pytorch we create a make command to install aiter, that the user can use. There will be no building aiter in the setup.py * update extension import test so that it considers jit compile extensions * clean up MultiTensorApply usages so that amp_C is not build in jit compile mode * Adding missing modules from deepspeed repo. Remove extra code in setup.py. Use is_compatible instead of is_supported * change name of apex_C module * change the name of cpp and cuda build flags, remove APEX_BUILD_OPS, cleanup the logic to build specific modules * add missing files used in cpu accelerator * add make clean command to handle deleting torch extensions installed for jit modules, fix the cpu builder import error * remove unused code in setup.py, fix the code to build for cpu mode * Removing unused code * remove accelerator package and refactor the used code into op_builder.all_ops BuilderUtils class * remove accelerator package usages * revert code that was removed by mistake * Cleaning up the setup file and renaming functions and variable to more readable names. * Fix the nccl version so that the nccl_allocator.so file can be loaded properly. Setup() call has an argument called py_modules which copies the python class into sitepackages folder. The python modules in the compatibility folder do lazy load of the builder classes. First these files are copied in the parent folder so that the files themselves are copied into sitepackages so that the kernel can be loaded into python then these temporary files are deleted. * Restore to original importing the extension code. * renamed compatibility/scaled_masked_softmax_cuda.py, added some extra tests in the contrib test runner * Added instructions for JIT load and changes in installation options * Restructuring the README * Added instructions for building wheel * replaced TorchCPUBuilder with CPUBuilder, added a main method in contrib test runner * create a script to build different jit conditions for running different tests * add script to run tests with different jit builds, add instructions to run jit build and tests in readme, add other tests in readme * fix the issues with running the tests - improper paths, counting .so files in apex folder * add mad internal scripts * remove print statement * remove testing section from readme * change location of result file * remove multiple results file from models.json * add platform specific description to wheel name even if no CppExtension or CUDAExtension is built with JIT load approach * add ninja and wheel to requirements to be installed * Update Release notes in Readme * Exclude compatibility folder while installing apex * Update README.md * Update README.md * Update README.md * Adding modification note to the original copywrite * fix the issue with symbolic links for op_builder, csrc when the apex repo is cloned in the docker * assign the symbolically linked folders into a variable and then loop across the list entries * remove unnecessary tabs --------- Co-authored-by: skishore <sriramkumar.kishorekumar@amd.com> Co-authored-by: sriram <sriram.kumar@silo.ai> - Pow implementation is very expensive on AMD CDNA4. (#292) This commit changes it to a mathematically equivalent exp(y*log(x)) for x > 0. However 1-2 ULP prec loss might be possible. - Update README.md (#289) - Update version to 1.10.0 (#282) - add code to read BUILD_VERSION env variable, so that it is used instead of version.txt when creating a wheel (#278) PRs: - ROCm/apex#310 Fixes: - https://example.com/issue-297 - https://example.com/issue-282 - https://example.com/issue-289 - https://example.com/issue-293 - https://example.com/issue-291 - https://example.com/issue-294 - https://example.com/issue-292 - https://example.com/issue-310 - https://example.com/issue-304 - https://example.com/issue-278 - https://example.com/issue-295

Roy Li and others added 30 commits October 16, 2018 09:19

restore caffe2 strides (pytorch#12381)

00f0dca

Summary: Pull Request resolved: pytorch#12381 The workflow passes after D10150834, so we can restore strides. Reviewed By: ezyang Differential Revision: D10220313 fbshipit-source-id: aaf9edebf4ff739cbe45b2d32e77918fce47ba34

A quick rundown of codebase structure. (pytorch#12693)

b189222

Summary: Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: pytorch#12693 Differential Revision: D10419424 Pulled By: ezyang fbshipit-source-id: dc3999253f19b5615849619bd3e4a77ab3ca984e

Don't serialize hooks (pytorch#11705)

3bfa725

Summary: Fixes pytorch#11683. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: pytorch#11705 Differential Revision: D9833057 Pulled By: ezyang fbshipit-source-id: 18af9bcd77b088326738d567100fbe4a4c869dd6

Optimize UpsampleNearest Op (pytorch#12151)

e497aa1

Summary: Optimize the UpsampleNearest Op. 1. Add OMP 2. revise the translated_idx method Pull Request resolved: pytorch#12151 Differential Revision: D10362856 Pulled By: ezyang fbshipit-source-id: 535a4b87c7423942217f2d79bedc463a0617c67a

Re-disable softmax ops tests in ROCM (pytorch#12749)

70c527d

Summary: They are flaky in master. ashishfarmer petrex Pull Request resolved: pytorch#12749 Differential Revision: D10420265 Pulled By: bddppq fbshipit-source-id: cac58efb711941786b10b07ada58e0d59ab1db1d

Revert D10220313: restore caffe2 strides

8c873de

Differential Revision: D10220313 Original commit changeset: aaf9edebf4ff fbshipit-source-id: 46c4d23d89d47be26c3f4967476271d8c2f95f11

zdevito and others added 22 commits October 19, 2018 00:25

Use C local in lexer (2) (pytorch#12838)

3123206

Summary: trying again without xlocale.h Pull Request resolved: pytorch#12838 Differential Revision: D10453078 Pulled By: zdevito fbshipit-source-id: 760852c82e16acee7d1abb8a918822bf5ff59bca

CircleCI: Add doc-push job

7837ec5

Summary: Pull Request resolved: pytorch#12833 Differential Revision: D10464815 Pulled By: yf225 fbshipit-source-id: 06a6a673b6bb32f7c252a217f9ce59db35c75e9c

CircleCI: enable OSX jobs

6ec2f09

Summary: Pull Request resolved: pytorch#12667 Differential Revision: D10466661 Pulled By: yf225 fbshipit-source-id: a1a150d3b384eb88ba4c7e6d57e59d8ed834e53c

Optimize GroupNormOp (pytorch#12844)

14ff866

Summary: Pull Request resolved: pytorch#12844 Optimize GroupNormOp Reviewed By: houseroad Differential Revision: D10455567 fbshipit-source-id: aee211badd1e0c8ea6196843e3e77f7c612a74d5

Revert D10421896: restore caffe2 strides

7921e16

Differential Revision: D10421896 Original commit changeset: b961ea0bca79 fbshipit-source-id: 9d9d2ed0c2cb23a3fdf6bbfc9509539aeeb7e382

Migrate DeviceOption.numa_node_id to DeviceOption.device_id

202893f

Summary: Pull Request resolved: pytorch#12717 Reviewed By: ilia-cher Differential Revision: D10408325 fbshipit-source-id: 82583d0ad4b8db094ee4c5c607b52500826328f7

expose delete_node (pytorch#12840)

8cb0848

Summary: Pull Request resolved: pytorch#12840 Add binding for delete_node Reviewed By: duc0 Differential Revision: D10453555 fbshipit-source-id: cdcaca8420a9a0c61479961d907ef6bb5478a41d

Make c10d pickling/unpickling work (pytorch#12694)

d120b9a

Summary: This fixes the issue for pytorch#12168 Pull Request resolved: pytorch#12694 Differential Revision: D10468717 Pulled By: teng-li fbshipit-source-id: 3df31d75eea19d6085af665f5350d3cb667a5048

Add topological sort to nomnigraph (pytorch#12790)

ed02619

Summary: Pull Request resolved: pytorch#12790 Add DFS based topological sort to nomnigraph. Reviewed By: duc0 Differential Revision: D10434645 fbshipit-source-id: aaf106b0cc37806b8ae61f065c1592a29993eb40

Add DistributedDataParallelCPU to doc

b357470

Summary: Pull Request resolved: pytorch#12864 Differential Revision: D10481669 Pulled By: SsnL fbshipit-source-id: 20831af41aaba75546e6ed6a99f011f0447b1acf

Env variable to not check compiler abi (pytorch#12708)

01227f3

Summary: For pytorch#10114 soumith fmassa Pull Request resolved: pytorch#12708 Differential Revision: D10444102 Pulled By: goldsborough fbshipit-source-id: 529e737e795bd8801beab2247be3dad296af5a3e

Merge remote-tracking branch 'rocm_upstream/upstream' into ifu

b9cc515

iotamudelta requested a review from ezyang as a code owner October 22, 2018 15:59

iotamudelta merged commit 5a78d04 into ROCm:master Oct 23, 2018

amd-sriram mentioned this pull request Nov 24, 2025

[rocm7.1_internal_testing] update related_commit #2826

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate from upstream#282

Integrate from upstream#282
iotamudelta merged 112 commits intoROCm:masterfrom
iotamudelta:ifu

iotamudelta commented Oct 22, 2018

Uh oh!

iotamudelta commented Oct 22, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

Conversation

iotamudelta commented Oct 22, 2018

Uh oh!

iotamudelta commented Oct 22, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants