Skip to content

Integrate from upstream#293

Merged
iotamudelta merged 91 commits intoROCm:masterfrom
iotamudelta:ifu
Oct 26, 2018
Merged

Integrate from upstream#293
iotamudelta merged 91 commits intoROCm:masterfrom
iotamudelta:ifu

Conversation

@iotamudelta
Copy link
Copy Markdown

No description provided.

smessmer and others added 30 commits October 23, 2018 15:30
Summary:
Pull Request resolved: pytorch#12878

Python 3.6 headers define their own ssize_t, which clashes with our definition.
Luckily, they also define a `HAVE_SSIZE_T` macro we can use to check for this case.

Reviewed By: ezyang

Differential Revision: D10467239

fbshipit-source-id: 661675ad1e30a6ca26d6790eaa75657ef6bf37c2
…icBlock, add move constructor.

Summary:
We cannot use copying as it loses recorded callbacks and thus after copying
tracked values are no longer tracked.

Reviewed By: bwasti, duc0

Differential Revision: D10510057

fbshipit-source-id: b64fdef3fb28fc26fe55eba41f4b5007ba6894de
Summary:
the topological index shuffled arguments around, updating expect files.
Pull Request resolved: pytorch#13005

Differential Revision: D10517246

Pulled By: michaelsuo

fbshipit-source-id: 8f95e4e4ca8ff51da0507f9b0eb838c23ddaa821
…ixed reverted bug) (pytorch#12848)

Summary:
Pull Request resolved: pytorch#12848

Updated all non-test uses of protobuf::MessageLite::SerializeAsString to call
SerializeAsString_EnforceCheck so that the return value is checked and can
throw an exception if failing.

Most of the affected code was called from classes derived from  BlobSerializeBase.
Didn't touch most tests and ENFORCE calls because they usually do checks
anyway.

Original commit changeset: c0760e73ecc7

Reviewed By: dzhulgakov

Differential Revision: D10453456

fbshipit-source-id: d2f2b7b4578e721924354149f08f627c7e3bf070
Summary:
TSIA - we want to deprecate numba in fbcode when moving to new compiler tiers.

Converted the old test to a non-numba regular python op test.

Reviewed By: xw285cornell

Differential Revision: D10519910

fbshipit-source-id: 0e9188a6d0fc159100f0db704b106fbfde3c5833
Summary:
Add support indexing tuples with constant integers by creating a new prim::TupleIndex operator.
Pull Request resolved: pytorch#11492

Differential Revision: D9811996

Pulled By: eellison

fbshipit-source-id: a458c2522b3c81476252d920e27a8d6c7b9a036b
Summary: Cleaning up the interface for nomnigraph in C++ world

Reviewed By: duc0

Differential Revision: D10438090

fbshipit-source-id: 6b4309b8a4b3730f3309edf0047d4006a001895b
Summary:
Basic ops.def update and converter.cc updates

This is the standard way to ingest networks into nomnigraph

Reviewed By: duc0

Differential Revision: D10412639

fbshipit-source-id: a4c523fda96bbe0e31de0d9fcf795ae9c7377c90
Summary:
`inspect.stack()` calls are slow since they access a bunch of extra info about the frame. This PR instead uses `inspect.currentframe()` and goes up the stack until it reaches the correct frame. [Context](stackoverflow.com/questions/17407119/python-inspect-stack-is-slow)
Pull Request resolved: pytorch#12859

Differential Revision: D10509912

Pulled By: driazati

fbshipit-source-id: b85325adf1b3c85a1a3a82e96e567b8be498531b
Summary:
Pull Request resolved: pytorch#13033

Basic graph manipulation exposed to python

Reviewed By: ZolotukhinM

Differential Revision: D10519720

fbshipit-source-id: 0f9a494d122289a3a9e23d4cff99ac0a21382ec6
Summary:
Pull Request resolved: pytorch#13006

In Caffe2, Concat can have 2 outputs. The second being the output shape of the 1st output. In ONNX, Concat only has 1 output. So when we do the exporting, we need to add a `Shape` to the first output and generate the second output from it.

Differential Revision: D10517698

fbshipit-source-id: 38e974423e2506b16d37b49d51c27ad87b73e63a
Differential Revision:
D10412639

Original commit changeset: a4c523fda96b

fbshipit-source-id: 973b6dd30b63b9a08069275278b0780b65067635
Differential Revision:
D10438090

Original commit changeset: 6b4309b8a4b3

fbshipit-source-id: 5f6a28cf032e0be2544f0b33508148f4f49e10c5
Summary:
Pull Request resolved: pytorch#12940

Dmytro was reading this code and requested that we rename the interface
to something that made it more obvious that pooling was going on.
Seems reasonable to me! Final name is a suggestion from Pieter.

Reviewed By: dzhulgakov

Differential Revision: D10492071

fbshipit-source-id: b1c2cac760f666968d58166be649dabfe1127c5e
Summary: Pull Request resolved: pytorch#12995

Reviewed By: Yangqing

Differential Revision: D10513246

fbshipit-source-id: 0c6d52e09166d7e8a786c1a0e21685ec9c35b12a
Summary:
Pull Request resolved: pytorch#13019

It just makes the semantic meaning of the int32_t a little
bit clearer.

Reviewed By: zou3519

Differential Revision: D10520295

fbshipit-source-id: 45b0bd1b6afddee17072b628d8e9b87d7c86e501
…h#13047)

Summary:
fix lint after new flake8 release added new style constraints
Pull Request resolved: pytorch#13047

Differential Revision: D10527804

Pulled By: soumith

fbshipit-source-id: 6f4d02662570b6339f69117b61037c8394b0bbd8
Summary:
In order to support tensorboardX and other visualization tools, we need to make sure a non-empty scope is set on all nodes added by the JIT. This attempts to do this, but is still a WIP.

This is a new version of pytorch#10749
Pull Request resolved: pytorch#12400

Reviewed By: ezyang

Differential Revision: D10224380

Pulled By: orionr

fbshipit-source-id: d1bccd0eee9ef7c4354112c6a39a5987bfac2994
Summary:
Closes ROCm#2119.

There was a small bug where the output_size got sliced with `[-2:]`
where we really meant to slice it as `[2:]` (to remove the batch and
channel dimensions).

Added a new test for this.
Pull Request resolved: pytorch#12952

Differential Revision: D10510678

Pulled By: zou3519

fbshipit-source-id: 4c04a5007fc6d002e1806d6fe981b43d33d6a4f2
Summary:
these are pretty spammy - unless we have a reason to keep them, let's not
Pull Request resolved: pytorch#13017

Differential Revision: D10528295

Pulled By: anderspapitto

fbshipit-source-id: 5514371a6e61e13ec070cc5517488523d42f2935
Summary:
Pull Request resolved: pytorch#12883

Attempting to do this again. last try broke oss ci: D10421896

Reallocation of strides_ if there's no change in dim seems to cause the error that broke internal flow last time. This fixes that. Found a potential race condition in caffe2 counter ops that might be the cause, we will investigate that.

Reviewed By: ezyang

Differential Revision: D10469960

fbshipit-source-id: 478186ff0d2f3dba1fbff6231db715322418d79c
Summary:
Reopen of pytorch#11253 after fixing bug in index_select
Pull Request resolved: pytorch#13001

Differential Revision: D10514987

Pulled By: SsnL

fbshipit-source-id: 399a83a1d3246877a3523baf99aaf1ce8066f33f
Summary:
This is used to patch our cmake cuda scripts - should be in the installation script.
Pull Request resolved: pytorch#13013

Reviewed By: ir413

Differential Revision: D10519104

Pulled By: Yangqing

fbshipit-source-id: 542049224ea41068f32d4c0f6399c7e8b684f764
Summary:
We are beginning to use this class in a wider reaching set of use-cases. This PR refactors it so that we always access schema properties through methods. This will make adding extra information like alias information easier (i.e. we can a version of `type()` that returns the type with alias information and another version that returns a type without that information).
Pull Request resolved: pytorch#12967

Differential Revision: D10502674

Pulled By: zdevito

fbshipit-source-id: a88783ed8f20ab3be6460c12da95f9f940891c44
Summary: Pull Request resolved: pytorch#12994

Reviewed By: anderspapitto

Differential Revision: D10515291

Pulled By: pjh5

fbshipit-source-id: 191054cdacff308b63e9063d22d62314398e4f88
Summary:
tested manually that this works

fixes pytorch#12395
obviates pytorch#12774
Pull Request resolved: pytorch#13055

Differential Revision: D10559788

Pulled By: anderspapitto

fbshipit-source-id: 5cd8bac6eff548280c8742f36a5e7f2748a24623
Summary: Pull Request resolved: pytorch#13064

Differential Revision: D10561008

Pulled By: yf225

fbshipit-source-id: c48364662efa82865a1bc1a7e2db3a9fb8af10d5
Summary: Pull Request resolved: pytorch#13059

Reviewed By: llyfacebook

Differential Revision: D10560147

Pulled By: sf-wind

fbshipit-source-id: c8f38b30c9acdf6ae494e56a5876fd4493696e5d
Summary: Pull Request resolved: pytorch#12969

Differential Revision: D10560824

Pulled By: ezyang

fbshipit-source-id: 86c21149682db5ebfd9610df9e9845688a3db3b0
Summary:
Pull Request resolved: pytorch#13014

Tensor method renaming using clangr

Reviewed By: ezyang

Differential Revision: D10467556

fbshipit-source-id: 7d7eaf5fc59bbb493c057d5b8bfdda03b140c97e
teng-li and others added 26 commits October 24, 2018 21:37
…ams for memcpy (pytorch#12954)

Summary:
- Moved sync_reduction to C++
- Use a dedicated CUDA stream for memcpy
- Also use a dedicated CUDA stream for memcpy in queue_reduction

Added test as well.

CI should cover both DDP and unittest
Pull Request resolved: pytorch#12954

Differential Revision: D10520069

Pulled By: teng-li

fbshipit-source-id: 64348e4e43c15f9695a4c28b036c232587ecfb65
Summary: Pull Request resolved: pytorch#12953

Differential Revision: D10850274

Pulled By: anderspapitto

fbshipit-source-id: 42296e6e49ad8c1845040e031eab95ddbaf58ae4
Summary:
Pull Request resolved: pytorch#13094

Expose operator_def property

Reviewed By: duc0

Differential Revision: D10847125

fbshipit-source-id: 67a066555b690715e1f5f04125fd446ab197f45a
Summary:
Pull Request resolved: pytorch#13043

memset on nullptr is undefined-behavior and as a result filament_test is failing in dev build. This diff is making operator to handle empty output properly, so we can return that test back.

I'm not sure either this is even valid to call this op with input that would require empty memset (empty batch?). Will leave this to ninghz and sunnieshang to decide.

Reviewed By: xianjiec

Differential Revision: D10525605

fbshipit-source-id: a911cdbd62fc3d948328981fd01cd205ec2ad99f
Summary:
It's empty.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: pytorch#13078

Differential Revision: D10843892

Pulled By: ezyang

fbshipit-source-id: 39e6f73b3a8be3e7573c1af727b65da246d4515b
Summary:
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: pytorch#13074

Differential Revision: D10852728

Pulled By: ezyang

fbshipit-source-id: 6b96c941f4655ba240adaa0678844efa2af81d06
…#12656)

Summary:
Pull Request resolved: pytorch#12656

I originally wanted to do this in two steps, but deleting the Storage-only
constructor also changes the default numel state (which breaks tests),
so easiest to do it all in one go.)

- I still need a way to compute the correct TensorTypeId for all of the
  Caffe2 constructors; rather than hard-code it, I wrote a function
  in at::detail::computeTensorTypeId() to do this calculation.  Maybe
  this function could be used more widely, but for now, it's used
  by Caffe2 only.
- Added a pile more TensorTypeId for all of Caffe2's supported DeviceTypes
- Because I still can't put arbitrary TypeMeta in TensorOptions, the
  TensorTypeId() calculation doesn't respect dtype.  For now, this is
  not a problem, but this might block work to split non-POD dtypes
  into their own TensorTypeId.

Reviewed By: li-roy

Differential Revision: D10380678

fbshipit-source-id: 10c5d12020596fc9f27d5579adffad00513af363
Summary:
Pull Request resolved: pytorch#13109

The "right" strategy of creating a socket, binding to an undefined port, closing the socket, and reusing the port it was bound to, was subject to a race condition. Another process could bind to that same port sooner than the tests would, causing an "Address already in use" failure when rank 0 would try and bind to that same port. The THD tests have been using a fixed port since forever. Time will tell if this fixes pytorch#12876.

Differential Revision: D10850614

fbshipit-source-id: c19f12bb4916141187ee8ddb52880f5f418310dc
Summary: Codemod generated with clangr shard mode, 25 files per diff, for renaming dims() to sizes()

Reviewed By: ezyang

Differential Revision: D10848643

fbshipit-source-id: ac75833be8be9162e35b00dcd352f616bc7bbafe
Summary:
This adds support for reductions like sum() and mul() to TensorIterator.
Performance is similar to existing optimized code for CPU, and generally
better than existing code for CUDA kernels.

The templatized CUDA kernel requires fewer instantiations than the
existing THCReduce/THCReduceAll code. For example, sum() previously
generated 43 CUDA kernels, while it now requires only one (larger)
CUDA kernel. I suspect this should reduce code-size and
compilation time, but I haven't measured it.

Below are timings for sum() on [CPU](https://ark.intel.com/products/81908/Intel-Xeon-Processor-E5-2680-v3-30M-Cache-2_50-GHz) (12 threads and 1 thread) and CUDA with various tensor sizes.

CPU

| Reduction (dim)      | Master  | PR      | Master (1 thread) | PR (1 thread) |
|----------------------|---------|---------|-------------------|---------------|
| 1024x1024 (all)      | 22 us   | 34 us   | 136 us            | 147 us        |
| 1024x1024 (0)        | 30 us   | 28 us   | 160 us            | 160 us        |
| 1024x1024 (1)        | 25 us   | 25 us   | 171 us            | 146 us        |
| 1024x10x1024 (all)   | 542 us  | 550 us  | 4.14 ms           | 3.11 ms       |
| 1024x10x1024 (0)     | 658 us  | 690 us  | 6.80 ms           | 5.93 ms       |
| 1024x10x1024 (1)     | 761 us  | 757 us  | 3.34 ms           | 3.52 ms       |
| 1024x10x1024 (2)     | 538 us  | 545 us  | 3.73 ms           | 3.04 ms       |
| 1024x1024x1024 (all) | 72 ms   | 71 ms   | 364 ms            | 357 ms        |
| 1024x1024x1024 (0)   | 94 ms   | 90 ms   | 935 ms            | 927 ms        |
| 1024x1024x1024 (1)   | 80 ms   | 86 ms   | 881 ms            | 688 ms        |
| 1024x1024x1024 (2)   | 71 ms   | 71 ms   | 456 ms            | 354 ms        |

CUDA

| Reduction (dim)      | M40 base | M40 PR  | P100 base | P100 PR   |
|----------------------|----------|---------|-----------|-----------|
| 1024x10x1024 (all)   | 238 us   | 182 us  | 136 us    | 97 us     |
| 1024x10x1024 (0)     | 166 us   | 179 us  | 105 us    | 84 us     |
| 1024x10x1024 (1)     | 181 us   | 182 us  | 89 us     | 91 us     |
| 1024x10x1024 (2)     | 180 us   | 168 us  | 88 us     | 79 us     |
| 1024x1024x1024 (all) | 17.5 ms  | 16.4 ms | 8.23 ms   | 7.48 ms   |
| 1024x1024x1024 (0)   | 27.2 ms  | 28.6 ms | 7.63 ms   | 7.38 ms   |
| 1024x1024x1024 (1)   | 16.5 ms  | 16.3 ms | 7.66 ms   | 7.40 ms   |
| 1024x1024x1024 (2)   | 17.8 ms  | 16.4 ms | 8.37 ms   | 7.31 ms   |

Timings were generated with this script:
https://gist.github.com/colesbury/d3238b266d8a9872fe6f68f77619b379
Pull Request resolved: pytorch#11908

Differential Revision: D10071760

Pulled By: colesbury

fbshipit-source-id: 40e37a0e6803f1628b94cc5a52a10dfbb601f3d6
Summary:
Adds the attribute name to the error message and fixes the corresponding
test to actually run
Pull Request resolved: pytorch#13072

Differential Revision: D10846622

Pulled By: driazati

fbshipit-source-id: a7eee6320c28140c4937ede3d4e4685cfce08d84
Summary:
std::memcpy has UB when either of src or dest are NULL, even if length
is 0. This can and does happen when the input tensors are scalar tensors.

This triggered UBSAN on pytorch#12824 but it is strange that it has not
been triggered before.
Pull Request resolved: pytorch#13121

Differential Revision: D10853113

Pulled By: zou3519

fbshipit-source-id: c4b4ad5e41de6f73dc755e0c25bc9947576a742d
Summary:
Pull Request resolved: pytorch#12843

This adds a cuda implementation for the UpsampleBilinearOp and UpsampleBilinearGradientOp.

The CUDA code is based off of the corresponding ResizeNearest operators but with bilinear interpolation logic taken from the CPU implementation.

Reviewed By: houseroad

Differential Revision: D10453776

fbshipit-source-id: b29ac330b72465974ddb27c0587bca590773fdec
Summary:
* Disable MIOpen convolution on double tensors
* MIOpen: set group count in convolution descriptor
* MIOpen: Honor Max Dim (ROCm 222)
* MIOpen: Batchnorm - Allow half/half and half/float, disallow double
* Limit MIOpen batchnorm to same-precision
* Fix maxdim check. (ROCm 246)
* Fix reversed logic in DISABLE_MIOPEN (ROCm 253)
* Export LANG/LC_ALL also for the test step.
* Make tensors contiguous before calling MIOpen batch norm
* Actually pass dilation to MIOpen.
* Do not use MIOpen if there is dilation and the group size is > 1. - This is officially not supported currently.
* Fixes for miopenforward bias call
* Modified init conv descriptor param values and used same value for dilation
* MIOpen: disable transposed convolutions

For attention: bddppq ezyang
Pull Request resolved: pytorch#13048

Differential Revision: D10785250

Pulled By: bddppq

fbshipit-source-id: f9d9797de644652280d59308e5ea5cc07d177fd4
Summary:
Pull Request resolved: pytorch#13096

Codemod generated with clangr shard mode, 25 files per diff, for renaming dims() to sizes()

Reviewed By: ezyang

Differential Revision: D10842875

fbshipit-source-id: 1784859735ed4d1bd5ccd7ca56e289498374a68f
Summary:
Removes test_jit.cpp, which was supposed to have been deleted in pytorch#12030

I had to move zou3519's dynamic DAG tests into `test/cpp/jit/tests.h` too. No other changes to `test_jit.cpp` seem to have happened in the meantime.

zdevito
Pull Request resolved: pytorch#12988

Differential Revision: D10854320

Pulled By: goldsborough

fbshipit-source-id: 7ab533e6e494e34a16ce39bbe62b1150e48fcb58
Summary: Pull Request resolved: pytorch#12713

Reviewed By: li-roy, dzhulgakov

Differential Revision: D10404407

fbshipit-source-id: cbc6be2172af068c3fc96e1f6da0b04b6f29ad4b
Summary:
Pull Request resolved: pytorch#13126

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D10454455

Pulled By: ezyang

fbshipit-source-id: 7018a41b94e316305751f2f8ad2c2d049799f5d4
Summary:
Pull Request resolved: pytorch#13065

- Open-source Caffe2 Int8 (quantized) operators

Reviewed By: Yangqing

Differential Revision: D10524381

fbshipit-source-id: 6daa153dc247572900c91e37262d033c368b382d
pytorch#13124)

Summary:
…size at dimension.
Pull Request resolved: pytorch#13124

Reviewed By: ezyang

Differential Revision: D10853167

Pulled By: gchanan

fbshipit-source-id: 76eeb922304bf19243d9bc52da87f2be8d1700ae
@iotamudelta iotamudelta merged commit 9bdd833 into ROCm:master Oct 26, 2018
akashveramd pushed a commit that referenced this pull request Jun 13, 2025
…int (#293)

Summary:
The profiler currently maintains a counter locally and that counter is
not synchronized with the checkpointed train step. This PR fixes the
issue.
amd-sriram pushed a commit that referenced this pull request Feb 26, 2026
Commit Messages:
- Create custom python operators for MixedFusedLayerNorm and MixedFusedRMSNorm. (#304)
- Add new apex module to jit load system (#294)

* add code to add loader module for jit module

* fix errors to create jit module adder - use correct file name to save code to

* fix errors to create jit module adder - use correct class name of the builder and parameter to supply builder module name

* fix errors to create jit module loader

* add description about jit module script to add jit loader for a jit module with builder provided

* add description about jit module script to add jit loader for a jit module with builder provided

* add attributes and methods to override when creating a jit module builder

* add extra new lines

* update jit module to take the builder file name and extract module name from the builder, update missing entries in the table in readme for adding new module in jit

* refine the description about module to jit

* add description about jit

* add description about jit

* add code to create a builder based on user inputs

* change the example from fused_dense to swiglu

* allow user to skip sources list

* change description of cxx and nvcc flags, add description of methods and fields in the initial builder code created by script
- add details of fused_conv_bias_relu in table of modules and fix error of maximum depth reached (#297)

* add details of fused_conv_bias_relu in table of modules and build flag

* solve the maximum depth error.
- Port fused_conv_bias_relu to ROCm (#295)

* Add support for conv bias relu

* Fix compilation failure

* omit check_cudnn_version_and_warn check (no cuDNN on ROCm)

* Flatten bias for PyTorch from 4D to 1D

* Implement fusion of Conv with ReLU with MIOpen

* Fix compilation issues

* Fix crash for ConvBias

* Fix merge issues

* Add support for ConvBias and ConvBiasMaskRelu

* Fix segmentation fault on bwd for ConvBias

* add code for fusing conv+bias for retinanet, add test case for retinanet

* Fix torch warning

* Fix warnings in a unit test file as well

* add builder and loader for fused_conv_bias_relu module

---------

Co-authored-by: Sergey Solovyev <sergey.solovyev@amd.com>
Co-authored-by: Mikko Tukiainen <mikko.tukiainen@amd.com>
- Bump version from 1.10.0 to 1.11.0 (#293)
- [REDUX] Refactor Apex build process to use the PyTorch JIT extension flow (#291)

* Created initial code for loading fused_dense module dynamically instead of building it. Code uses accelerator and op_builder modules from deepspeed code.

* add apex/git_version_info_installed.py to gitignore as it is dynamically created by setup.py for the build process

* add code for building fused rope dynamically

* add code for building fused bias swiglu dynamically

* fix the code so that fused rope and fused softmax are not compiled in jit mode, add csrc back to setup.py since it is not copied to apex wheel

* load the jit modules inside and this prevents them from building when building the wheel

* convert syncbn module to jit

* fix the unnecessary compile of syncbn module in wheel building due to imports in python module

* add fused layer norm module to jit build

* make focal loss module as jit module

* make focal loss module as jit module

* make xentropy module as jit module

* make bpn module as jit module

* add code to build individual extensions without JIT

* clean up the flags for the modules based on apex/setup.py

* add function to get the backward_pass_guard_args in CudaOpBuilder and make MLP JIT compile

* add fused weight gradient mlp to jit compile

* move fused_weight_gradient_mlp_cuda load inside so that it is not compiled during apex installation

* make fused index mul 2d jit compile and dd aten atomic header flag method to CUDAOpBuilder to support its jit compile

* make fast multihead attention as jit module, add generator_args to CudaOpBuilder support jit of this module

* make transducer loss and transducer joint modules as jit modules, add nvcc_threads_args method in CUDAOpBuilder to support these jit modules

* remove extra method - installed_cuda_version from CUDAOpBuilder

* add apex_C module to jit compile, add py-cpuinfo to requirements.txt as it is needed for TorchCPUOpBuilder

* make nccl allocator as a jit compile module, add nccl_args method to CUDAOpBuilder to support this

* make amp_C as a jit module

* add a few uses of amp_C jit module

* add a few uses of amp_C jit module

* make fused adam as a jit module

* add a few uses of amp_C jit module

* fix the issue with fused adam jit module

* make fused lamb as jit module

* make distributed adam as jit module

* make distributed lamb as jit module

* add remaining amp_C uses with jit loader

* add remaining usage of apexC jit module

* make nccl p2p module as jit compile

* make peer memory module as jit compile

* add code to check for minimum nccl version to compile nccl allocator module

* add provision to provide APEX_CPP_OPS=1  and APEX_CUDA_OPS=1 as replacement for --cpp_ext --cuda_ext command line arguments for building specific extensions in apex, save these settings for later use

* check for minimum torch version for nccl allocator, check if the module is compatible other removed from installed ops list

* add build as a dependency to support wheel building

* Replace is_compatible to check for installation conditions with is_supported, because there is an issue with loading nccl allocator

* Similar to pytorch we create a make command to install aiter, that the user can use. There will be no building aiter in the setup.py

* update extension import test so that it considers jit compile extensions

* clean up MultiTensorApply usages so that amp_C is not build in jit compile mode

* Adding missing modules from deepspeed repo. Remove extra code in setup.py. Use is_compatible instead of is_supported

* change name of apex_C module

* change the name of cpp and cuda build flags, remove APEX_BUILD_OPS, cleanup the logic to build specific modules

* add missing files used in cpu accelerator

* add make clean command to handle deleting torch extensions installed for jit modules, fix the cpu builder import error

* remove unused code in setup.py, fix the code to build for cpu mode

* Removing unused code

* remove accelerator package and refactor the used code into op_builder.all_ops BuilderUtils class

* remove accelerator package usages

* revert code that was removed by mistake

* Cleaning up the setup file and renaming functions and variable to more readable names.

* Fix the nccl version so that the nccl_allocator.so file can be loaded properly.

Setup() call has an argument called py_modules which copies the python class into sitepackages folder. The python modules in the compatibility folder do lazy load of the builder classes. First these files are copied in the parent folder so that the files themselves are copied into sitepackages so that the kernel can be loaded into python then these temporary files are deleted.

* Restore to original importing the extension code.

* renamed compatibility/scaled_masked_softmax_cuda.py, added some extra tests in the contrib test runner

* Added instructions for JIT load and changes in installation options

* Restructuring the README

* Added instructions for building wheel

* replaced TorchCPUBuilder with CPUBuilder, added a main method in contrib test runner

* create a script to build different jit conditions for running different tests

* add script to run tests with different jit builds, add instructions to run jit build and tests in readme, add other tests in readme

* fix the issues with running the tests - improper paths, counting .so files in apex folder

* add mad internal scripts

* remove print statement

* remove testing section from readme

* change location of result file

* remove multiple results file from models.json

* add platform specific description to wheel name even if no CppExtension or CUDAExtension is built with JIT load approach

* add ninja and wheel to requirements to be installed

* Update Release notes in Readme

* Exclude compatibility folder while installing apex

* Update README.md

* Update README.md

* Update README.md

* Adding modification note to the original copywrite

* fix the issue with symbolic links for op_builder, csrc when the apex repo is cloned in the docker

* assign the symbolically linked folders into a variable and then loop across the list entries

* remove unnecessary tabs

---------

Co-authored-by: skishore <sriramkumar.kishorekumar@amd.com>
Co-authored-by: sriram <sriram.kumar@silo.ai>
- Pow implementation is very expensive on AMD CDNA4. (#292)

This commit changes it to a mathematically equivalent
exp(y*log(x)) for x > 0.
However 1-2 ULP prec loss might be possible.
- Update README.md (#289)
- Update version to 1.10.0 (#282)
- add code to read BUILD_VERSION env variable, so that it is used instead of version.txt when creating a wheel (#278)

PRs:
- ROCm/apex#304

Fixes:
- https://example.com/issue-292
- https://example.com/issue-278
- https://example.com/issue-295
- https://example.com/issue-294
- https://example.com/issue-289
- https://example.com/issue-304
- https://example.com/issue-291
- https://example.com/issue-282
- https://example.com/issue-293
- https://example.com/issue-297
amd-sriram pushed a commit that referenced this pull request Mar 2, 2026
Commit Messages:
- Update README with release notes for version 1.11.0 (#310)

Added release notes for version 1.11.0, including new extensions and upgrades. Updated previous release notes for clarity.
- Create custom python operators for MixedFusedLayerNorm and MixedFusedRMSNorm. (#304)
- Add new apex module to jit load system (#294)

* add code to add loader module for jit module

* fix errors to create jit module adder - use correct file name to save code to

* fix errors to create jit module adder - use correct class name of the builder and parameter to supply builder module name

* fix errors to create jit module loader

* add description about jit module script to add jit loader for a jit module with builder provided

* add description about jit module script to add jit loader for a jit module with builder provided

* add attributes and methods to override when creating a jit module builder

* add extra new lines

* update jit module to take the builder file name and extract module name from the builder, update missing entries in the table in readme for adding new module in jit

* refine the description about module to jit

* add description about jit

* add description about jit

* add code to create a builder based on user inputs

* change the example from fused_dense to swiglu

* allow user to skip sources list

* change description of cxx and nvcc flags, add description of methods and fields in the initial builder code created by script
- add details of fused_conv_bias_relu in table of modules and fix error of maximum depth reached (#297)

* add details of fused_conv_bias_relu in table of modules and build flag

* solve the maximum depth error.
- Port fused_conv_bias_relu to ROCm (#295)

* Add support for conv bias relu

* Fix compilation failure

* omit check_cudnn_version_and_warn check (no cuDNN on ROCm)

* Flatten bias for PyTorch from 4D to 1D

* Implement fusion of Conv with ReLU with MIOpen

* Fix compilation issues

* Fix crash for ConvBias

* Fix merge issues

* Add support for ConvBias and ConvBiasMaskRelu

* Fix segmentation fault on bwd for ConvBias

* add code for fusing conv+bias for retinanet, add test case for retinanet

* Fix torch warning

* Fix warnings in a unit test file as well

* add builder and loader for fused_conv_bias_relu module

---------

Co-authored-by: Sergey Solovyev <sergey.solovyev@amd.com>
Co-authored-by: Mikko Tukiainen <mikko.tukiainen@amd.com>
- Bump version from 1.10.0 to 1.11.0 (#293)
- [REDUX] Refactor Apex build process to use the PyTorch JIT extension flow (#291)

* Created initial code for loading fused_dense module dynamically instead of building it. Code uses accelerator and op_builder modules from deepspeed code.

* add apex/git_version_info_installed.py to gitignore as it is dynamically created by setup.py for the build process

* add code for building fused rope dynamically

* add code for building fused bias swiglu dynamically

* fix the code so that fused rope and fused softmax are not compiled in jit mode, add csrc back to setup.py since it is not copied to apex wheel

* load the jit modules inside and this prevents them from building when building the wheel

* convert syncbn module to jit

* fix the unnecessary compile of syncbn module in wheel building due to imports in python module

* add fused layer norm module to jit build

* make focal loss module as jit module

* make focal loss module as jit module

* make xentropy module as jit module

* make bpn module as jit module

* add code to build individual extensions without JIT

* clean up the flags for the modules based on apex/setup.py

* add function to get the backward_pass_guard_args in CudaOpBuilder and make MLP JIT compile

* add fused weight gradient mlp to jit compile

* move fused_weight_gradient_mlp_cuda load inside so that it is not compiled during apex installation

* make fused index mul 2d jit compile and dd aten atomic header flag method to CUDAOpBuilder to support its jit compile

* make fast multihead attention as jit module, add generator_args to CudaOpBuilder support jit of this module

* make transducer loss and transducer joint modules as jit modules, add nvcc_threads_args method in CUDAOpBuilder to support these jit modules

* remove extra method - installed_cuda_version from CUDAOpBuilder

* add apex_C module to jit compile, add py-cpuinfo to requirements.txt as it is needed for TorchCPUOpBuilder

* make nccl allocator as a jit compile module, add nccl_args method to CUDAOpBuilder to support this

* make amp_C as a jit module

* add a few uses of amp_C jit module

* add a few uses of amp_C jit module

* make fused adam as a jit module

* add a few uses of amp_C jit module

* fix the issue with fused adam jit module

* make fused lamb as jit module

* make distributed adam as jit module

* make distributed lamb as jit module

* add remaining amp_C uses with jit loader

* add remaining usage of apexC jit module

* make nccl p2p module as jit compile

* make peer memory module as jit compile

* add code to check for minimum nccl version to compile nccl allocator module

* add provision to provide APEX_CPP_OPS=1  and APEX_CUDA_OPS=1 as replacement for --cpp_ext --cuda_ext command line arguments for building specific extensions in apex, save these settings for later use

* check for minimum torch version for nccl allocator, check if the module is compatible other removed from installed ops list

* add build as a dependency to support wheel building

* Replace is_compatible to check for installation conditions with is_supported, because there is an issue with loading nccl allocator

* Similar to pytorch we create a make command to install aiter, that the user can use. There will be no building aiter in the setup.py

* update extension import test so that it considers jit compile extensions

* clean up MultiTensorApply usages so that amp_C is not build in jit compile mode

* Adding missing modules from deepspeed repo. Remove extra code in setup.py. Use is_compatible instead of is_supported

* change name of apex_C module

* change the name of cpp and cuda build flags, remove APEX_BUILD_OPS, cleanup the logic to build specific modules

* add missing files used in cpu accelerator

* add make clean command to handle deleting torch extensions installed for jit modules, fix the cpu builder import error

* remove unused code in setup.py, fix the code to build for cpu mode

* Removing unused code

* remove accelerator package and refactor the used code into op_builder.all_ops BuilderUtils class

* remove accelerator package usages

* revert code that was removed by mistake

* Cleaning up the setup file and renaming functions and variable to more readable names.

* Fix the nccl version so that the nccl_allocator.so file can be loaded properly.

Setup() call has an argument called py_modules which copies the python class into sitepackages folder. The python modules in the compatibility folder do lazy load of the builder classes. First these files are copied in the parent folder so that the files themselves are copied into sitepackages so that the kernel can be loaded into python then these temporary files are deleted.

* Restore to original importing the extension code.

* renamed compatibility/scaled_masked_softmax_cuda.py, added some extra tests in the contrib test runner

* Added instructions for JIT load and changes in installation options

* Restructuring the README

* Added instructions for building wheel

* replaced TorchCPUBuilder with CPUBuilder, added a main method in contrib test runner

* create a script to build different jit conditions for running different tests

* add script to run tests with different jit builds, add instructions to run jit build and tests in readme, add other tests in readme

* fix the issues with running the tests - improper paths, counting .so files in apex folder

* add mad internal scripts

* remove print statement

* remove testing section from readme

* change location of result file

* remove multiple results file from models.json

* add platform specific description to wheel name even if no CppExtension or CUDAExtension is built with JIT load approach

* add ninja and wheel to requirements to be installed

* Update Release notes in Readme

* Exclude compatibility folder while installing apex

* Update README.md

* Update README.md

* Update README.md

* Adding modification note to the original copywrite

* fix the issue with symbolic links for op_builder, csrc when the apex repo is cloned in the docker

* assign the symbolically linked folders into a variable and then loop across the list entries

* remove unnecessary tabs

---------

Co-authored-by: skishore <sriramkumar.kishorekumar@amd.com>
Co-authored-by: sriram <sriram.kumar@silo.ai>
- Pow implementation is very expensive on AMD CDNA4. (#292)

This commit changes it to a mathematically equivalent
exp(y*log(x)) for x > 0.
However 1-2 ULP prec loss might be possible.
- Update README.md (#289)
- Update version to 1.10.0 (#282)
- add code to read BUILD_VERSION env variable, so that it is used instead of version.txt when creating a wheel (#278)

PRs:
- ROCm/apex#310

Fixes:
- https://example.com/issue-297
- https://example.com/issue-282
- https://example.com/issue-289
- https://example.com/issue-293
- https://example.com/issue-291
- https://example.com/issue-294
- https://example.com/issue-292
- https://example.com/issue-310
- https://example.com/issue-304
- https://example.com/issue-278
- https://example.com/issue-295
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.