Integrate from upstream#293
Merged
iotamudelta merged 91 commits intoROCm:masterfrom Oct 26, 2018
Merged
Conversation
Summary: Pull Request resolved: pytorch#12878 Python 3.6 headers define their own ssize_t, which clashes with our definition. Luckily, they also define a `HAVE_SSIZE_T` macro we can use to check for this case. Reviewed By: ezyang Differential Revision: D10467239 fbshipit-source-id: 661675ad1e30a6ca26d6790eaa75657ef6bf37c2
…icBlock, add move constructor. Summary: We cannot use copying as it loses recorded callbacks and thus after copying tracked values are no longer tracked. Reviewed By: bwasti, duc0 Differential Revision: D10510057 fbshipit-source-id: b64fdef3fb28fc26fe55eba41f4b5007ba6894de
Summary: the topological index shuffled arguments around, updating expect files. Pull Request resolved: pytorch#13005 Differential Revision: D10517246 Pulled By: michaelsuo fbshipit-source-id: 8f95e4e4ca8ff51da0507f9b0eb838c23ddaa821
…ixed reverted bug) (pytorch#12848) Summary: Pull Request resolved: pytorch#12848 Updated all non-test uses of protobuf::MessageLite::SerializeAsString to call SerializeAsString_EnforceCheck so that the return value is checked and can throw an exception if failing. Most of the affected code was called from classes derived from BlobSerializeBase. Didn't touch most tests and ENFORCE calls because they usually do checks anyway. Original commit changeset: c0760e73ecc7 Reviewed By: dzhulgakov Differential Revision: D10453456 fbshipit-source-id: d2f2b7b4578e721924354149f08f627c7e3bf070
Summary: TSIA - we want to deprecate numba in fbcode when moving to new compiler tiers. Converted the old test to a non-numba regular python op test. Reviewed By: xw285cornell Differential Revision: D10519910 fbshipit-source-id: 0e9188a6d0fc159100f0db704b106fbfde3c5833
Summary: Add support indexing tuples with constant integers by creating a new prim::TupleIndex operator. Pull Request resolved: pytorch#11492 Differential Revision: D9811996 Pulled By: eellison fbshipit-source-id: a458c2522b3c81476252d920e27a8d6c7b9a036b
Summary: Cleaning up the interface for nomnigraph in C++ world Reviewed By: duc0 Differential Revision: D10438090 fbshipit-source-id: 6b4309b8a4b3730f3309edf0047d4006a001895b
Summary: Basic ops.def update and converter.cc updates This is the standard way to ingest networks into nomnigraph Reviewed By: duc0 Differential Revision: D10412639 fbshipit-source-id: a4c523fda96bbe0e31de0d9fcf795ae9c7377c90
Summary: `inspect.stack()` calls are slow since they access a bunch of extra info about the frame. This PR instead uses `inspect.currentframe()` and goes up the stack until it reaches the correct frame. [Context](stackoverflow.com/questions/17407119/python-inspect-stack-is-slow) Pull Request resolved: pytorch#12859 Differential Revision: D10509912 Pulled By: driazati fbshipit-source-id: b85325adf1b3c85a1a3a82e96e567b8be498531b
Summary: Pull Request resolved: pytorch#13033 Basic graph manipulation exposed to python Reviewed By: ZolotukhinM Differential Revision: D10519720 fbshipit-source-id: 0f9a494d122289a3a9e23d4cff99ac0a21382ec6
Summary: Pull Request resolved: pytorch#13006 In Caffe2, Concat can have 2 outputs. The second being the output shape of the 1st output. In ONNX, Concat only has 1 output. So when we do the exporting, we need to add a `Shape` to the first output and generate the second output from it. Differential Revision: D10517698 fbshipit-source-id: 38e974423e2506b16d37b49d51c27ad87b73e63a
Differential Revision: D10412639 Original commit changeset: a4c523fda96b fbshipit-source-id: 973b6dd30b63b9a08069275278b0780b65067635
Differential Revision: D10438090 Original commit changeset: 6b4309b8a4b3 fbshipit-source-id: 5f6a28cf032e0be2544f0b33508148f4f49e10c5
Summary: Pull Request resolved: pytorch#12940 Dmytro was reading this code and requested that we rename the interface to something that made it more obvious that pooling was going on. Seems reasonable to me! Final name is a suggestion from Pieter. Reviewed By: dzhulgakov Differential Revision: D10492071 fbshipit-source-id: b1c2cac760f666968d58166be649dabfe1127c5e
Summary: Pull Request resolved: pytorch#12995 Reviewed By: Yangqing Differential Revision: D10513246 fbshipit-source-id: 0c6d52e09166d7e8a786c1a0e21685ec9c35b12a
Summary: Pull Request resolved: pytorch#13019 It just makes the semantic meaning of the int32_t a little bit clearer. Reviewed By: zou3519 Differential Revision: D10520295 fbshipit-source-id: 45b0bd1b6afddee17072b628d8e9b87d7c86e501
…h#13047) Summary: fix lint after new flake8 release added new style constraints Pull Request resolved: pytorch#13047 Differential Revision: D10527804 Pulled By: soumith fbshipit-source-id: 6f4d02662570b6339f69117b61037c8394b0bbd8
Summary: In order to support tensorboardX and other visualization tools, we need to make sure a non-empty scope is set on all nodes added by the JIT. This attempts to do this, but is still a WIP. This is a new version of pytorch#10749 Pull Request resolved: pytorch#12400 Reviewed By: ezyang Differential Revision: D10224380 Pulled By: orionr fbshipit-source-id: d1bccd0eee9ef7c4354112c6a39a5987bfac2994
Summary: Closes ROCm#2119. There was a small bug where the output_size got sliced with `[-2:]` where we really meant to slice it as `[2:]` (to remove the batch and channel dimensions). Added a new test for this. Pull Request resolved: pytorch#12952 Differential Revision: D10510678 Pulled By: zou3519 fbshipit-source-id: 4c04a5007fc6d002e1806d6fe981b43d33d6a4f2
Summary: these are pretty spammy - unless we have a reason to keep them, let's not Pull Request resolved: pytorch#13017 Differential Revision: D10528295 Pulled By: anderspapitto fbshipit-source-id: 5514371a6e61e13ec070cc5517488523d42f2935
Summary: Pull Request resolved: pytorch#12883 Attempting to do this again. last try broke oss ci: D10421896 Reallocation of strides_ if there's no change in dim seems to cause the error that broke internal flow last time. This fixes that. Found a potential race condition in caffe2 counter ops that might be the cause, we will investigate that. Reviewed By: ezyang Differential Revision: D10469960 fbshipit-source-id: 478186ff0d2f3dba1fbff6231db715322418d79c
Summary: Reopen of pytorch#11253 after fixing bug in index_select Pull Request resolved: pytorch#13001 Differential Revision: D10514987 Pulled By: SsnL fbshipit-source-id: 399a83a1d3246877a3523baf99aaf1ce8066f33f
Summary: This is used to patch our cmake cuda scripts - should be in the installation script. Pull Request resolved: pytorch#13013 Reviewed By: ir413 Differential Revision: D10519104 Pulled By: Yangqing fbshipit-source-id: 542049224ea41068f32d4c0f6399c7e8b684f764
Summary: We are beginning to use this class in a wider reaching set of use-cases. This PR refactors it so that we always access schema properties through methods. This will make adding extra information like alias information easier (i.e. we can a version of `type()` that returns the type with alias information and another version that returns a type without that information). Pull Request resolved: pytorch#12967 Differential Revision: D10502674 Pulled By: zdevito fbshipit-source-id: a88783ed8f20ab3be6460c12da95f9f940891c44
Summary: Pull Request resolved: pytorch#12994 Reviewed By: anderspapitto Differential Revision: D10515291 Pulled By: pjh5 fbshipit-source-id: 191054cdacff308b63e9063d22d62314398e4f88
Summary: tested manually that this works fixes pytorch#12395 obviates pytorch#12774 Pull Request resolved: pytorch#13055 Differential Revision: D10559788 Pulled By: anderspapitto fbshipit-source-id: 5cd8bac6eff548280c8742f36a5e7f2748a24623
Summary: Pull Request resolved: pytorch#13064 Differential Revision: D10561008 Pulled By: yf225 fbshipit-source-id: c48364662efa82865a1bc1a7e2db3a9fb8af10d5
Summary: Pull Request resolved: pytorch#13059 Reviewed By: llyfacebook Differential Revision: D10560147 Pulled By: sf-wind fbshipit-source-id: c8f38b30c9acdf6ae494e56a5876fd4493696e5d
Summary: Pull Request resolved: pytorch#12969 Differential Revision: D10560824 Pulled By: ezyang fbshipit-source-id: 86c21149682db5ebfd9610df9e9845688a3db3b0
Summary: Pull Request resolved: pytorch#13014 Tensor method renaming using clangr Reviewed By: ezyang Differential Revision: D10467556 fbshipit-source-id: 7d7eaf5fc59bbb493c057d5b8bfdda03b140c97e
…ams for memcpy (pytorch#12954) Summary: - Moved sync_reduction to C++ - Use a dedicated CUDA stream for memcpy - Also use a dedicated CUDA stream for memcpy in queue_reduction Added test as well. CI should cover both DDP and unittest Pull Request resolved: pytorch#12954 Differential Revision: D10520069 Pulled By: teng-li fbshipit-source-id: 64348e4e43c15f9695a4c28b036c232587ecfb65
Summary: Pull Request resolved: pytorch#12953 Differential Revision: D10850274 Pulled By: anderspapitto fbshipit-source-id: 42296e6e49ad8c1845040e031eab95ddbaf58ae4
Summary: Pull Request resolved: pytorch#13094 Expose operator_def property Reviewed By: duc0 Differential Revision: D10847125 fbshipit-source-id: 67a066555b690715e1f5f04125fd446ab197f45a
Summary: Pull Request resolved: pytorch#13043 memset on nullptr is undefined-behavior and as a result filament_test is failing in dev build. This diff is making operator to handle empty output properly, so we can return that test back. I'm not sure either this is even valid to call this op with input that would require empty memset (empty batch?). Will leave this to ninghz and sunnieshang to decide. Reviewed By: xianjiec Differential Revision: D10525605 fbshipit-source-id: a911cdbd62fc3d948328981fd01cd205ec2ad99f
Summary: It's empty. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: pytorch#13078 Differential Revision: D10843892 Pulled By: ezyang fbshipit-source-id: 39e6f73b3a8be3e7573c1af727b65da246d4515b
Summary: Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: pytorch#13074 Differential Revision: D10852728 Pulled By: ezyang fbshipit-source-id: 6b96c941f4655ba240adaa0678844efa2af81d06
…#12656) Summary: Pull Request resolved: pytorch#12656 I originally wanted to do this in two steps, but deleting the Storage-only constructor also changes the default numel state (which breaks tests), so easiest to do it all in one go.) - I still need a way to compute the correct TensorTypeId for all of the Caffe2 constructors; rather than hard-code it, I wrote a function in at::detail::computeTensorTypeId() to do this calculation. Maybe this function could be used more widely, but for now, it's used by Caffe2 only. - Added a pile more TensorTypeId for all of Caffe2's supported DeviceTypes - Because I still can't put arbitrary TypeMeta in TensorOptions, the TensorTypeId() calculation doesn't respect dtype. For now, this is not a problem, but this might block work to split non-POD dtypes into their own TensorTypeId. Reviewed By: li-roy Differential Revision: D10380678 fbshipit-source-id: 10c5d12020596fc9f27d5579adffad00513af363
Summary: Pull Request resolved: pytorch#13109 The "right" strategy of creating a socket, binding to an undefined port, closing the socket, and reusing the port it was bound to, was subject to a race condition. Another process could bind to that same port sooner than the tests would, causing an "Address already in use" failure when rank 0 would try and bind to that same port. The THD tests have been using a fixed port since forever. Time will tell if this fixes pytorch#12876. Differential Revision: D10850614 fbshipit-source-id: c19f12bb4916141187ee8ddb52880f5f418310dc
Summary: Codemod generated with clangr shard mode, 25 files per diff, for renaming dims() to sizes() Reviewed By: ezyang Differential Revision: D10848643 fbshipit-source-id: ac75833be8be9162e35b00dcd352f616bc7bbafe
Summary: This adds support for reductions like sum() and mul() to TensorIterator. Performance is similar to existing optimized code for CPU, and generally better than existing code for CUDA kernels. The templatized CUDA kernel requires fewer instantiations than the existing THCReduce/THCReduceAll code. For example, sum() previously generated 43 CUDA kernels, while it now requires only one (larger) CUDA kernel. I suspect this should reduce code-size and compilation time, but I haven't measured it. Below are timings for sum() on [CPU](https://ark.intel.com/products/81908/Intel-Xeon-Processor-E5-2680-v3-30M-Cache-2_50-GHz) (12 threads and 1 thread) and CUDA with various tensor sizes. CPU | Reduction (dim) | Master | PR | Master (1 thread) | PR (1 thread) | |----------------------|---------|---------|-------------------|---------------| | 1024x1024 (all) | 22 us | 34 us | 136 us | 147 us | | 1024x1024 (0) | 30 us | 28 us | 160 us | 160 us | | 1024x1024 (1) | 25 us | 25 us | 171 us | 146 us | | 1024x10x1024 (all) | 542 us | 550 us | 4.14 ms | 3.11 ms | | 1024x10x1024 (0) | 658 us | 690 us | 6.80 ms | 5.93 ms | | 1024x10x1024 (1) | 761 us | 757 us | 3.34 ms | 3.52 ms | | 1024x10x1024 (2) | 538 us | 545 us | 3.73 ms | 3.04 ms | | 1024x1024x1024 (all) | 72 ms | 71 ms | 364 ms | 357 ms | | 1024x1024x1024 (0) | 94 ms | 90 ms | 935 ms | 927 ms | | 1024x1024x1024 (1) | 80 ms | 86 ms | 881 ms | 688 ms | | 1024x1024x1024 (2) | 71 ms | 71 ms | 456 ms | 354 ms | CUDA | Reduction (dim) | M40 base | M40 PR | P100 base | P100 PR | |----------------------|----------|---------|-----------|-----------| | 1024x10x1024 (all) | 238 us | 182 us | 136 us | 97 us | | 1024x10x1024 (0) | 166 us | 179 us | 105 us | 84 us | | 1024x10x1024 (1) | 181 us | 182 us | 89 us | 91 us | | 1024x10x1024 (2) | 180 us | 168 us | 88 us | 79 us | | 1024x1024x1024 (all) | 17.5 ms | 16.4 ms | 8.23 ms | 7.48 ms | | 1024x1024x1024 (0) | 27.2 ms | 28.6 ms | 7.63 ms | 7.38 ms | | 1024x1024x1024 (1) | 16.5 ms | 16.3 ms | 7.66 ms | 7.40 ms | | 1024x1024x1024 (2) | 17.8 ms | 16.4 ms | 8.37 ms | 7.31 ms | Timings were generated with this script: https://gist.github.com/colesbury/d3238b266d8a9872fe6f68f77619b379 Pull Request resolved: pytorch#11908 Differential Revision: D10071760 Pulled By: colesbury fbshipit-source-id: 40e37a0e6803f1628b94cc5a52a10dfbb601f3d6
Summary: Adds the attribute name to the error message and fixes the corresponding test to actually run Pull Request resolved: pytorch#13072 Differential Revision: D10846622 Pulled By: driazati fbshipit-source-id: a7eee6320c28140c4937ede3d4e4685cfce08d84
Summary: std::memcpy has UB when either of src or dest are NULL, even if length is 0. This can and does happen when the input tensors are scalar tensors. This triggered UBSAN on pytorch#12824 but it is strange that it has not been triggered before. Pull Request resolved: pytorch#13121 Differential Revision: D10853113 Pulled By: zou3519 fbshipit-source-id: c4b4ad5e41de6f73dc755e0c25bc9947576a742d
Summary: Pull Request resolved: pytorch#12843 This adds a cuda implementation for the UpsampleBilinearOp and UpsampleBilinearGradientOp. The CUDA code is based off of the corresponding ResizeNearest operators but with bilinear interpolation logic taken from the CPU implementation. Reviewed By: houseroad Differential Revision: D10453776 fbshipit-source-id: b29ac330b72465974ddb27c0587bca590773fdec
Summary: * Disable MIOpen convolution on double tensors * MIOpen: set group count in convolution descriptor * MIOpen: Honor Max Dim (ROCm 222) * MIOpen: Batchnorm - Allow half/half and half/float, disallow double * Limit MIOpen batchnorm to same-precision * Fix maxdim check. (ROCm 246) * Fix reversed logic in DISABLE_MIOPEN (ROCm 253) * Export LANG/LC_ALL also for the test step. * Make tensors contiguous before calling MIOpen batch norm * Actually pass dilation to MIOpen. * Do not use MIOpen if there is dilation and the group size is > 1. - This is officially not supported currently. * Fixes for miopenforward bias call * Modified init conv descriptor param values and used same value for dilation * MIOpen: disable transposed convolutions For attention: bddppq ezyang Pull Request resolved: pytorch#13048 Differential Revision: D10785250 Pulled By: bddppq fbshipit-source-id: f9d9797de644652280d59308e5ea5cc07d177fd4
Summary: Pull Request resolved: pytorch#13096 Codemod generated with clangr shard mode, 25 files per diff, for renaming dims() to sizes() Reviewed By: ezyang Differential Revision: D10842875 fbshipit-source-id: 1784859735ed4d1bd5ccd7ca56e289498374a68f
Summary: Removes test_jit.cpp, which was supposed to have been deleted in pytorch#12030 I had to move zou3519's dynamic DAG tests into `test/cpp/jit/tests.h` too. No other changes to `test_jit.cpp` seem to have happened in the meantime. zdevito Pull Request resolved: pytorch#12988 Differential Revision: D10854320 Pulled By: goldsborough fbshipit-source-id: 7ab533e6e494e34a16ce39bbe62b1150e48fcb58
Summary: Pull Request resolved: pytorch#12713 Reviewed By: li-roy, dzhulgakov Differential Revision: D10404407 fbshipit-source-id: cbc6be2172af068c3fc96e1f6da0b04b6f29ad4b
Summary: Pull Request resolved: pytorch#13126 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D10454455 Pulled By: ezyang fbshipit-source-id: 7018a41b94e316305751f2f8ad2c2d049799f5d4
Summary: Pull Request resolved: pytorch#13065 - Open-source Caffe2 Int8 (quantized) operators Reviewed By: Yangqing Differential Revision: D10524381 fbshipit-source-id: 6daa153dc247572900c91e37262d033c368b382d
pytorch#13124) Summary: …size at dimension. Pull Request resolved: pytorch#13124 Reviewed By: ezyang Differential Revision: D10853167 Pulled By: gchanan fbshipit-source-id: 76eeb922304bf19243d9bc52da87f2be8d1700ae
akashveramd
pushed a commit
that referenced
this pull request
Jun 13, 2025
…int (#293) Summary: The profiler currently maintains a counter locally and that counter is not synchronized with the checkpointed train step. This PR fixes the issue.
amd-sriram
pushed a commit
that referenced
this pull request
Feb 26, 2026
Commit Messages: - Create custom python operators for MixedFusedLayerNorm and MixedFusedRMSNorm. (#304) - Add new apex module to jit load system (#294) * add code to add loader module for jit module * fix errors to create jit module adder - use correct file name to save code to * fix errors to create jit module adder - use correct class name of the builder and parameter to supply builder module name * fix errors to create jit module loader * add description about jit module script to add jit loader for a jit module with builder provided * add description about jit module script to add jit loader for a jit module with builder provided * add attributes and methods to override when creating a jit module builder * add extra new lines * update jit module to take the builder file name and extract module name from the builder, update missing entries in the table in readme for adding new module in jit * refine the description about module to jit * add description about jit * add description about jit * add code to create a builder based on user inputs * change the example from fused_dense to swiglu * allow user to skip sources list * change description of cxx and nvcc flags, add description of methods and fields in the initial builder code created by script - add details of fused_conv_bias_relu in table of modules and fix error of maximum depth reached (#297) * add details of fused_conv_bias_relu in table of modules and build flag * solve the maximum depth error. - Port fused_conv_bias_relu to ROCm (#295) * Add support for conv bias relu * Fix compilation failure * omit check_cudnn_version_and_warn check (no cuDNN on ROCm) * Flatten bias for PyTorch from 4D to 1D * Implement fusion of Conv with ReLU with MIOpen * Fix compilation issues * Fix crash for ConvBias * Fix merge issues * Add support for ConvBias and ConvBiasMaskRelu * Fix segmentation fault on bwd for ConvBias * add code for fusing conv+bias for retinanet, add test case for retinanet * Fix torch warning * Fix warnings in a unit test file as well * add builder and loader for fused_conv_bias_relu module --------- Co-authored-by: Sergey Solovyev <sergey.solovyev@amd.com> Co-authored-by: Mikko Tukiainen <mikko.tukiainen@amd.com> - Bump version from 1.10.0 to 1.11.0 (#293) - [REDUX] Refactor Apex build process to use the PyTorch JIT extension flow (#291) * Created initial code for loading fused_dense module dynamically instead of building it. Code uses accelerator and op_builder modules from deepspeed code. * add apex/git_version_info_installed.py to gitignore as it is dynamically created by setup.py for the build process * add code for building fused rope dynamically * add code for building fused bias swiglu dynamically * fix the code so that fused rope and fused softmax are not compiled in jit mode, add csrc back to setup.py since it is not copied to apex wheel * load the jit modules inside and this prevents them from building when building the wheel * convert syncbn module to jit * fix the unnecessary compile of syncbn module in wheel building due to imports in python module * add fused layer norm module to jit build * make focal loss module as jit module * make focal loss module as jit module * make xentropy module as jit module * make bpn module as jit module * add code to build individual extensions without JIT * clean up the flags for the modules based on apex/setup.py * add function to get the backward_pass_guard_args in CudaOpBuilder and make MLP JIT compile * add fused weight gradient mlp to jit compile * move fused_weight_gradient_mlp_cuda load inside so that it is not compiled during apex installation * make fused index mul 2d jit compile and dd aten atomic header flag method to CUDAOpBuilder to support its jit compile * make fast multihead attention as jit module, add generator_args to CudaOpBuilder support jit of this module * make transducer loss and transducer joint modules as jit modules, add nvcc_threads_args method in CUDAOpBuilder to support these jit modules * remove extra method - installed_cuda_version from CUDAOpBuilder * add apex_C module to jit compile, add py-cpuinfo to requirements.txt as it is needed for TorchCPUOpBuilder * make nccl allocator as a jit compile module, add nccl_args method to CUDAOpBuilder to support this * make amp_C as a jit module * add a few uses of amp_C jit module * add a few uses of amp_C jit module * make fused adam as a jit module * add a few uses of amp_C jit module * fix the issue with fused adam jit module * make fused lamb as jit module * make distributed adam as jit module * make distributed lamb as jit module * add remaining amp_C uses with jit loader * add remaining usage of apexC jit module * make nccl p2p module as jit compile * make peer memory module as jit compile * add code to check for minimum nccl version to compile nccl allocator module * add provision to provide APEX_CPP_OPS=1 and APEX_CUDA_OPS=1 as replacement for --cpp_ext --cuda_ext command line arguments for building specific extensions in apex, save these settings for later use * check for minimum torch version for nccl allocator, check if the module is compatible other removed from installed ops list * add build as a dependency to support wheel building * Replace is_compatible to check for installation conditions with is_supported, because there is an issue with loading nccl allocator * Similar to pytorch we create a make command to install aiter, that the user can use. There will be no building aiter in the setup.py * update extension import test so that it considers jit compile extensions * clean up MultiTensorApply usages so that amp_C is not build in jit compile mode * Adding missing modules from deepspeed repo. Remove extra code in setup.py. Use is_compatible instead of is_supported * change name of apex_C module * change the name of cpp and cuda build flags, remove APEX_BUILD_OPS, cleanup the logic to build specific modules * add missing files used in cpu accelerator * add make clean command to handle deleting torch extensions installed for jit modules, fix the cpu builder import error * remove unused code in setup.py, fix the code to build for cpu mode * Removing unused code * remove accelerator package and refactor the used code into op_builder.all_ops BuilderUtils class * remove accelerator package usages * revert code that was removed by mistake * Cleaning up the setup file and renaming functions and variable to more readable names. * Fix the nccl version so that the nccl_allocator.so file can be loaded properly. Setup() call has an argument called py_modules which copies the python class into sitepackages folder. The python modules in the compatibility folder do lazy load of the builder classes. First these files are copied in the parent folder so that the files themselves are copied into sitepackages so that the kernel can be loaded into python then these temporary files are deleted. * Restore to original importing the extension code. * renamed compatibility/scaled_masked_softmax_cuda.py, added some extra tests in the contrib test runner * Added instructions for JIT load and changes in installation options * Restructuring the README * Added instructions for building wheel * replaced TorchCPUBuilder with CPUBuilder, added a main method in contrib test runner * create a script to build different jit conditions for running different tests * add script to run tests with different jit builds, add instructions to run jit build and tests in readme, add other tests in readme * fix the issues with running the tests - improper paths, counting .so files in apex folder * add mad internal scripts * remove print statement * remove testing section from readme * change location of result file * remove multiple results file from models.json * add platform specific description to wheel name even if no CppExtension or CUDAExtension is built with JIT load approach * add ninja and wheel to requirements to be installed * Update Release notes in Readme * Exclude compatibility folder while installing apex * Update README.md * Update README.md * Update README.md * Adding modification note to the original copywrite * fix the issue with symbolic links for op_builder, csrc when the apex repo is cloned in the docker * assign the symbolically linked folders into a variable and then loop across the list entries * remove unnecessary tabs --------- Co-authored-by: skishore <sriramkumar.kishorekumar@amd.com> Co-authored-by: sriram <sriram.kumar@silo.ai> - Pow implementation is very expensive on AMD CDNA4. (#292) This commit changes it to a mathematically equivalent exp(y*log(x)) for x > 0. However 1-2 ULP prec loss might be possible. - Update README.md (#289) - Update version to 1.10.0 (#282) - add code to read BUILD_VERSION env variable, so that it is used instead of version.txt when creating a wheel (#278) PRs: - ROCm/apex#304 Fixes: - https://example.com/issue-292 - https://example.com/issue-278 - https://example.com/issue-295 - https://example.com/issue-294 - https://example.com/issue-289 - https://example.com/issue-304 - https://example.com/issue-291 - https://example.com/issue-282 - https://example.com/issue-293 - https://example.com/issue-297
amd-sriram
pushed a commit
that referenced
this pull request
Mar 2, 2026
Commit Messages: - Update README with release notes for version 1.11.0 (#310) Added release notes for version 1.11.0, including new extensions and upgrades. Updated previous release notes for clarity. - Create custom python operators for MixedFusedLayerNorm and MixedFusedRMSNorm. (#304) - Add new apex module to jit load system (#294) * add code to add loader module for jit module * fix errors to create jit module adder - use correct file name to save code to * fix errors to create jit module adder - use correct class name of the builder and parameter to supply builder module name * fix errors to create jit module loader * add description about jit module script to add jit loader for a jit module with builder provided * add description about jit module script to add jit loader for a jit module with builder provided * add attributes and methods to override when creating a jit module builder * add extra new lines * update jit module to take the builder file name and extract module name from the builder, update missing entries in the table in readme for adding new module in jit * refine the description about module to jit * add description about jit * add description about jit * add code to create a builder based on user inputs * change the example from fused_dense to swiglu * allow user to skip sources list * change description of cxx and nvcc flags, add description of methods and fields in the initial builder code created by script - add details of fused_conv_bias_relu in table of modules and fix error of maximum depth reached (#297) * add details of fused_conv_bias_relu in table of modules and build flag * solve the maximum depth error. - Port fused_conv_bias_relu to ROCm (#295) * Add support for conv bias relu * Fix compilation failure * omit check_cudnn_version_and_warn check (no cuDNN on ROCm) * Flatten bias for PyTorch from 4D to 1D * Implement fusion of Conv with ReLU with MIOpen * Fix compilation issues * Fix crash for ConvBias * Fix merge issues * Add support for ConvBias and ConvBiasMaskRelu * Fix segmentation fault on bwd for ConvBias * add code for fusing conv+bias for retinanet, add test case for retinanet * Fix torch warning * Fix warnings in a unit test file as well * add builder and loader for fused_conv_bias_relu module --------- Co-authored-by: Sergey Solovyev <sergey.solovyev@amd.com> Co-authored-by: Mikko Tukiainen <mikko.tukiainen@amd.com> - Bump version from 1.10.0 to 1.11.0 (#293) - [REDUX] Refactor Apex build process to use the PyTorch JIT extension flow (#291) * Created initial code for loading fused_dense module dynamically instead of building it. Code uses accelerator and op_builder modules from deepspeed code. * add apex/git_version_info_installed.py to gitignore as it is dynamically created by setup.py for the build process * add code for building fused rope dynamically * add code for building fused bias swiglu dynamically * fix the code so that fused rope and fused softmax are not compiled in jit mode, add csrc back to setup.py since it is not copied to apex wheel * load the jit modules inside and this prevents them from building when building the wheel * convert syncbn module to jit * fix the unnecessary compile of syncbn module in wheel building due to imports in python module * add fused layer norm module to jit build * make focal loss module as jit module * make focal loss module as jit module * make xentropy module as jit module * make bpn module as jit module * add code to build individual extensions without JIT * clean up the flags for the modules based on apex/setup.py * add function to get the backward_pass_guard_args in CudaOpBuilder and make MLP JIT compile * add fused weight gradient mlp to jit compile * move fused_weight_gradient_mlp_cuda load inside so that it is not compiled during apex installation * make fused index mul 2d jit compile and dd aten atomic header flag method to CUDAOpBuilder to support its jit compile * make fast multihead attention as jit module, add generator_args to CudaOpBuilder support jit of this module * make transducer loss and transducer joint modules as jit modules, add nvcc_threads_args method in CUDAOpBuilder to support these jit modules * remove extra method - installed_cuda_version from CUDAOpBuilder * add apex_C module to jit compile, add py-cpuinfo to requirements.txt as it is needed for TorchCPUOpBuilder * make nccl allocator as a jit compile module, add nccl_args method to CUDAOpBuilder to support this * make amp_C as a jit module * add a few uses of amp_C jit module * add a few uses of amp_C jit module * make fused adam as a jit module * add a few uses of amp_C jit module * fix the issue with fused adam jit module * make fused lamb as jit module * make distributed adam as jit module * make distributed lamb as jit module * add remaining amp_C uses with jit loader * add remaining usage of apexC jit module * make nccl p2p module as jit compile * make peer memory module as jit compile * add code to check for minimum nccl version to compile nccl allocator module * add provision to provide APEX_CPP_OPS=1 and APEX_CUDA_OPS=1 as replacement for --cpp_ext --cuda_ext command line arguments for building specific extensions in apex, save these settings for later use * check for minimum torch version for nccl allocator, check if the module is compatible other removed from installed ops list * add build as a dependency to support wheel building * Replace is_compatible to check for installation conditions with is_supported, because there is an issue with loading nccl allocator * Similar to pytorch we create a make command to install aiter, that the user can use. There will be no building aiter in the setup.py * update extension import test so that it considers jit compile extensions * clean up MultiTensorApply usages so that amp_C is not build in jit compile mode * Adding missing modules from deepspeed repo. Remove extra code in setup.py. Use is_compatible instead of is_supported * change name of apex_C module * change the name of cpp and cuda build flags, remove APEX_BUILD_OPS, cleanup the logic to build specific modules * add missing files used in cpu accelerator * add make clean command to handle deleting torch extensions installed for jit modules, fix the cpu builder import error * remove unused code in setup.py, fix the code to build for cpu mode * Removing unused code * remove accelerator package and refactor the used code into op_builder.all_ops BuilderUtils class * remove accelerator package usages * revert code that was removed by mistake * Cleaning up the setup file and renaming functions and variable to more readable names. * Fix the nccl version so that the nccl_allocator.so file can be loaded properly. Setup() call has an argument called py_modules which copies the python class into sitepackages folder. The python modules in the compatibility folder do lazy load of the builder classes. First these files are copied in the parent folder so that the files themselves are copied into sitepackages so that the kernel can be loaded into python then these temporary files are deleted. * Restore to original importing the extension code. * renamed compatibility/scaled_masked_softmax_cuda.py, added some extra tests in the contrib test runner * Added instructions for JIT load and changes in installation options * Restructuring the README * Added instructions for building wheel * replaced TorchCPUBuilder with CPUBuilder, added a main method in contrib test runner * create a script to build different jit conditions for running different tests * add script to run tests with different jit builds, add instructions to run jit build and tests in readme, add other tests in readme * fix the issues with running the tests - improper paths, counting .so files in apex folder * add mad internal scripts * remove print statement * remove testing section from readme * change location of result file * remove multiple results file from models.json * add platform specific description to wheel name even if no CppExtension or CUDAExtension is built with JIT load approach * add ninja and wheel to requirements to be installed * Update Release notes in Readme * Exclude compatibility folder while installing apex * Update README.md * Update README.md * Update README.md * Adding modification note to the original copywrite * fix the issue with symbolic links for op_builder, csrc when the apex repo is cloned in the docker * assign the symbolically linked folders into a variable and then loop across the list entries * remove unnecessary tabs --------- Co-authored-by: skishore <sriramkumar.kishorekumar@amd.com> Co-authored-by: sriram <sriram.kumar@silo.ai> - Pow implementation is very expensive on AMD CDNA4. (#292) This commit changes it to a mathematically equivalent exp(y*log(x)) for x > 0. However 1-2 ULP prec loss might be possible. - Update README.md (#289) - Update version to 1.10.0 (#282) - add code to read BUILD_VERSION env variable, so that it is used instead of version.txt when creating a wheel (#278) PRs: - ROCm/apex#310 Fixes: - https://example.com/issue-297 - https://example.com/issue-282 - https://example.com/issue-289 - https://example.com/issue-293 - https://example.com/issue-291 - https://example.com/issue-294 - https://example.com/issue-292 - https://example.com/issue-310 - https://example.com/issue-304 - https://example.com/issue-278 - https://example.com/issue-295
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.