[REDUX] Refactor Apex build process to use the PyTorch JIT extension flow by jithunnair-amd · Pull Request #291 · ROCm/apex

jithunnair-amd · 2025-12-04T23:35:19Z

Had to revert #247 due to a build breakage seen in AISW HUD runs. Removed the change from master branch until we can figure out the root cause. This PR will be used to re-merge the changes.

Motivation

Currently building apex takes around 30 minutes. The motivation behind JIT (just in time) load is to reduce the installation time to under 1 minute and then build the modules on demand when using them or running their tests.

In addition, this PR provides flexibility in building specific modules (based on the argument to indicate CPP or CUDA extension or specific modules).

Due to https://discuss.python.org/t/symbolic-links-in-wheels/1945/19, we process the symbolic links and remove them by copying the contents where they pointed to the location of symbolic links before we call setup in setup.py. The changes were tested locally as well as on AISW HUD runs.

Technical Details

By default all modules are JIT compiled
APEX_BUILD_<OP_NAME>=1 - precompile specific modules e.g. APEX_BUILD_FUSED_DENSE=1
APEX_BUILD_CPP_OPS and APEX_BUILD_CUDA_OPS env variables to support building specific modules (replaces cpp_ext, cuda_ext)

To install from source:
pip install . --no-build-isolation
To build the wheel and install from wheel:

python -m build --wheel --no-isolation 
pip install dist/apex-*.whl

Currently converted extensions include (which work on rocm)

Total - 27 extensions

The following extensions have been not been included in jit load in this PR as these are not used before (Nvidia gpu only)

fused_conv_bias_relu
fast_layer_norm
fmhalib
fast_bottleneck

Added custom code to support building apex modules

get backward pass guards
aten_atomic_args
generator_args
nvcc_threads_args
nccl_args
nccl_version
is_supported

Other changes

aiter is not build as a part of setup.py. The user can install aiter with the command make aiter similar to pytorch.
make clean to remove torch extensions created with JIT load.

Tested Unit tests

cd tests/L0
PYTHONUNBUFFERED=1 sh run_rocm.sh 2>&1 | tee log_results.txt

cd apex/contrib/test/
PYTHONUNBUFFERED=1 python3 run_rocm_extensions.py 2>&1 | tee log_results_contrib.txt

torchrun --nproc_per_node 8 apex/contrib/peer_memory/peer_halo_exchange_module_tests.py

cd tests/distributed/synced_batchnorm
sh unit_test.sh

Docker used for the testing (tested with CPU only and with GPU)
registry-sc-harbor.amd.com/framework/compute-rocm-rel-7.0:32_ubuntu22.04_py3.10_pytorch_release-2.8_d2d97084

Tested following commands

Running the different build instructions.

Different GPUs: MI200, MI300
Counting the number of failed unit tests (L0, contrib), counting the number of .so and torch extensions built
Created mad engine based scripts for running the different test conditions https://github.com/amd-sriram/Tools/blob/main/AutomatedBuildTest/models.json
DLM workload
madengine run --tags pyt_deepspeed_megatron_llama2_7b --live-output --additional-context "{'guest_os': 'UBUNTU', 'docker_build_arg':{'BASE_DOCKER':'registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16771_ubuntu24.04_py3.12_pytorch_rocm7.1_internal_testing_d1fb13a8'}}"

Running extensive tests on MI300

	failed L0 tests count	built so count	torch extensions count
python setup.py install --cpp_ext --cuda_ext (old command)	2	0	27
pip install . --no-build-isolation	2	0	27
APEX_BUILD_CPP_OPS=1 pip install . --no-build-isolation	2	1	26
APEX_BUILD_CUDA_OPS=1 pip install . --no-build-isolation	3	26	1
APEX_BUILD_CPP_OPS=1 APEX_BUILD_CUDA_OPS=1 pip install . --no-build-isolation	2	27	0
APEX_BUILD_FUSED_DENSE=1 pip install . --no-build-isolation	2	1	26
APEX_BUILD_FUSED_DENSE=1 ... pip install . --no-build-isolation	2	27	0
python -m build --wheel --no-isolation .	2	0	27
APEX_BUILD_CPP_OPS=1 python -m build --wheel --no-isolation	2	1	26
APEX_BUILD_CUDA_OPS=1 python -m build --wheel --no-isolation	2	26	1
APEX_BUILD_CPP_OPS=1 APEX_BUILD_CUDA_OPS=1 python -m build --wheel --no-isolation	2	27	0
APEX_BUILD_FUSED_DENSE=1 python -m build --wheel --no-isolation	2	1	26
APEX_BUILD_FUSED_DENSE=1 ... pip install . --no-build-isolation	2	27	0

Created an issue for the two errors: https://github.com/ROCm/frameworks-internal/issues/14438

Running a few commands on MI200

pip install . --no-build-isolation
APEX_BUILD_CPP_OPS=1 APEX_BUILD_CUDA_OPS=1 pip install . --no-build-isolation
python -m build --wheel --no-isolation .
APEX_BUILD_CPP_OPS=1 APEX_BUILD_CUDA_OPS=1 python -m build --wheel --no-isolation

Creating the wheels in a docker with CPU and running the tests with GPU docker (MI300)

	failed L0 tests count	built so count	torch extensions count
python -m build --wheel --no-isolation .	2	0	27
APEX_BUILD_CPP_OPS=1 python -m build --wheel --no-isolation	2	1	26
APEX_BUILD_CUDA_OPS=1 python -m build --wheel --no-isolation	2	26	1
APEX_BUILD_CPP_OPS=1 APEX_BUILD_CUDA_OPS=1 python -m build --wheel --no-isolation	2	27	0
APEX_BUILD_FUSED_DENSE=1 ... pip install . --no-build-isolation	2	27	0

Created scripts for testing JIT build and documented the instructions at https://amd.atlassian.net/wiki/spaces/MLSE/pages/1255652200/Testing

…ad of building it. Code uses accelerator and op_builder modules from deepspeed code.

…lly created by setup.py for the build process

… jit mode, add csrc back to setup.py since it is not copied to apex wheel

… building the wheel

… imports in python module

… make MLP JIT compile

…piled during apex installation

…thod to CUDAOpBuilder to support its jit compile

…daOpBuilder support jit of this module

… nvcc_threads_args method in CUDAOpBuilder to support these jit modules

…as it is needed for TorchCPUOpBuilder

…CUDAOpBuilder to support this

…repo is cloned in the docker

…across the list entries

amd-sriram · 2026-02-04T16:58:36Z

! cherry-pick --onto release/1.10.0

rocm-repo-management-api · 2026-02-04T17:04:46Z

Can't perform the cherry-pick keyword: unexpected error

Comment processed by Build

…flow (#291) * Created initial code for loading fused_dense module dynamically instead of building it. Code uses accelerator and op_builder modules from deepspeed code. * add apex/git_version_info_installed.py to gitignore as it is dynamically created by setup.py for the build process * add code for building fused rope dynamically * add code for building fused bias swiglu dynamically * fix the code so that fused rope and fused softmax are not compiled in jit mode, add csrc back to setup.py since it is not copied to apex wheel * load the jit modules inside and this prevents them from building when building the wheel * convert syncbn module to jit * fix the unnecessary compile of syncbn module in wheel building due to imports in python module * add fused layer norm module to jit build * make focal loss module as jit module * make focal loss module as jit module * make xentropy module as jit module * make bpn module as jit module * add code to build individual extensions without JIT * clean up the flags for the modules based on apex/setup.py * add function to get the backward_pass_guard_args in CudaOpBuilder and make MLP JIT compile * add fused weight gradient mlp to jit compile * move fused_weight_gradient_mlp_cuda load inside so that it is not compiled during apex installation * make fused index mul 2d jit compile and dd aten atomic header flag method to CUDAOpBuilder to support its jit compile * make fast multihead attention as jit module, add generator_args to CudaOpBuilder support jit of this module * make transducer loss and transducer joint modules as jit modules, add nvcc_threads_args method in CUDAOpBuilder to support these jit modules * remove extra method - installed_cuda_version from CUDAOpBuilder * add apex_C module to jit compile, add py-cpuinfo to requirements.txt as it is needed for TorchCPUOpBuilder * make nccl allocator as a jit compile module, add nccl_args method to CUDAOpBuilder to support this * make amp_C as a jit module * add a few uses of amp_C jit module * add a few uses of amp_C jit module * make fused adam as a jit module * add a few uses of amp_C jit module * fix the issue with fused adam jit module * make fused lamb as jit module * make distributed adam as jit module * make distributed lamb as jit module * add remaining amp_C uses with jit loader * add remaining usage of apexC jit module * make nccl p2p module as jit compile * make peer memory module as jit compile * add code to check for minimum nccl version to compile nccl allocator module * add provision to provide APEX_CPP_OPS=1 and APEX_CUDA_OPS=1 as replacement for --cpp_ext --cuda_ext command line arguments for building specific extensions in apex, save these settings for later use * check for minimum torch version for nccl allocator, check if the module is compatible other removed from installed ops list * add build as a dependency to support wheel building * Replace is_compatible to check for installation conditions with is_supported, because there is an issue with loading nccl allocator * Similar to pytorch we create a make command to install aiter, that the user can use. There will be no building aiter in the setup.py * update extension import test so that it considers jit compile extensions * clean up MultiTensorApply usages so that amp_C is not build in jit compile mode * Adding missing modules from deepspeed repo. Remove extra code in setup.py. Use is_compatible instead of is_supported * change name of apex_C module * change the name of cpp and cuda build flags, remove APEX_BUILD_OPS, cleanup the logic to build specific modules * add missing files used in cpu accelerator * add make clean command to handle deleting torch extensions installed for jit modules, fix the cpu builder import error * remove unused code in setup.py, fix the code to build for cpu mode * Removing unused code * remove accelerator package and refactor the used code into op_builder.all_ops BuilderUtils class * remove accelerator package usages * revert code that was removed by mistake * Cleaning up the setup file and renaming functions and variable to more readable names. * Fix the nccl version so that the nccl_allocator.so file can be loaded properly. Setup() call has an argument called py_modules which copies the python class into sitepackages folder. The python modules in the compatibility folder do lazy load of the builder classes. First these files are copied in the parent folder so that the files themselves are copied into sitepackages so that the kernel can be loaded into python then these temporary files are deleted. * Restore to original importing the extension code. * renamed compatibility/scaled_masked_softmax_cuda.py, added some extra tests in the contrib test runner * Added instructions for JIT load and changes in installation options * Restructuring the README * Added instructions for building wheel * replaced TorchCPUBuilder with CPUBuilder, added a main method in contrib test runner * create a script to build different jit conditions for running different tests * add script to run tests with different jit builds, add instructions to run jit build and tests in readme, add other tests in readme * fix the issues with running the tests - improper paths, counting .so files in apex folder * add mad internal scripts * remove print statement * remove testing section from readme * change location of result file * remove multiple results file from models.json * add platform specific description to wheel name even if no CppExtension or CUDAExtension is built with JIT load approach * add ninja and wheel to requirements to be installed * Update Release notes in Readme * Exclude compatibility folder while installing apex * Update README.md * Update README.md * Update README.md * Adding modification note to the original copywrite * fix the issue with symbolic links for op_builder, csrc when the apex repo is cloned in the docker * assign the symbolically linked folders into a variable and then loop across the list entries * remove unnecessary tabs --------- Co-authored-by: skishore <sriramkumar.kishorekumar@amd.com> Co-authored-by: sriram <sriram.kumar@silo.ai>

…flow (#291) (#296) * Created initial code for loading fused_dense module dynamically instead of building it. Code uses accelerator and op_builder modules from deepspeed code. * add apex/git_version_info_installed.py to gitignore as it is dynamically created by setup.py for the build process * add code for building fused rope dynamically * add code for building fused bias swiglu dynamically * fix the code so that fused rope and fused softmax are not compiled in jit mode, add csrc back to setup.py since it is not copied to apex wheel * load the jit modules inside and this prevents them from building when building the wheel * convert syncbn module to jit * fix the unnecessary compile of syncbn module in wheel building due to imports in python module * add fused layer norm module to jit build * make focal loss module as jit module * make focal loss module as jit module * make xentropy module as jit module * make bpn module as jit module * add code to build individual extensions without JIT * clean up the flags for the modules based on apex/setup.py * add function to get the backward_pass_guard_args in CudaOpBuilder and make MLP JIT compile * add fused weight gradient mlp to jit compile * move fused_weight_gradient_mlp_cuda load inside so that it is not compiled during apex installation * make fused index mul 2d jit compile and dd aten atomic header flag method to CUDAOpBuilder to support its jit compile * make fast multihead attention as jit module, add generator_args to CudaOpBuilder support jit of this module * make transducer loss and transducer joint modules as jit modules, add nvcc_threads_args method in CUDAOpBuilder to support these jit modules * remove extra method - installed_cuda_version from CUDAOpBuilder * add apex_C module to jit compile, add py-cpuinfo to requirements.txt as it is needed for TorchCPUOpBuilder * make nccl allocator as a jit compile module, add nccl_args method to CUDAOpBuilder to support this * make amp_C as a jit module * add a few uses of amp_C jit module * add a few uses of amp_C jit module * make fused adam as a jit module * add a few uses of amp_C jit module * fix the issue with fused adam jit module * make fused lamb as jit module * make distributed adam as jit module * make distributed lamb as jit module * add remaining amp_C uses with jit loader * add remaining usage of apexC jit module * make nccl p2p module as jit compile * make peer memory module as jit compile * add code to check for minimum nccl version to compile nccl allocator module * add provision to provide APEX_CPP_OPS=1 and APEX_CUDA_OPS=1 as replacement for --cpp_ext --cuda_ext command line arguments for building specific extensions in apex, save these settings for later use * check for minimum torch version for nccl allocator, check if the module is compatible other removed from installed ops list * add build as a dependency to support wheel building * Replace is_compatible to check for installation conditions with is_supported, because there is an issue with loading nccl allocator * Similar to pytorch we create a make command to install aiter, that the user can use. There will be no building aiter in the setup.py * update extension import test so that it considers jit compile extensions * clean up MultiTensorApply usages so that amp_C is not build in jit compile mode * Adding missing modules from deepspeed repo. Remove extra code in setup.py. Use is_compatible instead of is_supported * change name of apex_C module * change the name of cpp and cuda build flags, remove APEX_BUILD_OPS, cleanup the logic to build specific modules * add missing files used in cpu accelerator * add make clean command to handle deleting torch extensions installed for jit modules, fix the cpu builder import error * remove unused code in setup.py, fix the code to build for cpu mode * Removing unused code * remove accelerator package and refactor the used code into op_builder.all_ops BuilderUtils class * remove accelerator package usages * revert code that was removed by mistake * Cleaning up the setup file and renaming functions and variable to more readable names. * Fix the nccl version so that the nccl_allocator.so file can be loaded properly. Setup() call has an argument called py_modules which copies the python class into sitepackages folder. The python modules in the compatibility folder do lazy load of the builder classes. First these files are copied in the parent folder so that the files themselves are copied into sitepackages so that the kernel can be loaded into python then these temporary files are deleted. * Restore to original importing the extension code. * renamed compatibility/scaled_masked_softmax_cuda.py, added some extra tests in the contrib test runner * Added instructions for JIT load and changes in installation options * Restructuring the README * Added instructions for building wheel * replaced TorchCPUBuilder with CPUBuilder, added a main method in contrib test runner * create a script to build different jit conditions for running different tests * add script to run tests with different jit builds, add instructions to run jit build and tests in readme, add other tests in readme * fix the issues with running the tests - improper paths, counting .so files in apex folder * add mad internal scripts * remove print statement * remove testing section from readme * change location of result file * remove multiple results file from models.json * add platform specific description to wheel name even if no CppExtension or CUDAExtension is built with JIT load approach * add ninja and wheel to requirements to be installed * Update Release notes in Readme * Exclude compatibility folder while installing apex * Update README.md * Update README.md * Update README.md * Adding modification note to the original copywrite * fix the issue with symbolic links for op_builder, csrc when the apex repo is cloned in the docker * assign the symbolically linked folders into a variable and then loop across the list entries * remove unnecessary tabs --------- Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> Co-authored-by: sriram <sriram.kumar@silo.ai>

[REDUX] Refactor Apex build process to use the PyTorch JIT extension flow ([#291](ROCm/apex#291)) ([#296](ROCm/apex#296))

amd-sriram · 2026-03-02T08:36:18Z

! cherry-pick --onto release/1.9.0

amd-sriram · 2026-03-02T08:36:37Z

! cherry-pick --onto release/1.8.0

rocm-repo-management-api · 2026-03-02T08:42:11Z

Created branch autogenerated/release/1.9.0_cherry-pick_pr-291 and #305. It contains a merge conflict. Please resolve it

Comment processed by Build

rocm-repo-management-api · 2026-03-02T08:44:08Z

Created branch autogenerated/release/1.8.0_cherry-pick_pr-291 and #306. It contains a merge conflict. Please resolve it

Comment processed by Build

amd-sriram added 30 commits November 21, 2025 22:19

Created initial code for loading fused_dense module dynamically inste…

585502d

…ad of building it. Code uses accelerator and op_builder modules from deepspeed code.

add apex/git_version_info_installed.py to gitignore as it is dynamica…

d6e0ee4

…lly created by setup.py for the build process

add code for building fused rope dynamically

cb0c9ab

add code for building fused bias swiglu dynamically

f129b0d

fix the code so that fused rope and fused softmax are not compiled in…

be60325

… jit mode, add csrc back to setup.py since it is not copied to apex wheel

load the jit modules inside and this prevents them from building when…

7b9276c

… building the wheel

convert syncbn module to jit

eea4c0f

fix the unnecessary compile of syncbn module in wheel building due to…

d6ad398

… imports in python module

add fused layer norm module to jit build

497f54a

make focal loss module as jit module

12222eb

make focal loss module as jit module

1a72cb0

make xentropy module as jit module

5ee6115

make bpn module as jit module

6533731

add code to build individual extensions without JIT

4a1a8f8

clean up the flags for the modules based on apex/setup.py

01f22cd

add function to get the backward_pass_guard_args in CudaOpBuilder and…

58d87ad

… make MLP JIT compile

add fused weight gradient mlp to jit compile

d47d871

move fused_weight_gradient_mlp_cuda load inside so that it is not com…

fc60c28

…piled during apex installation

make fused index mul 2d jit compile and dd aten atomic header flag me…

ad7439a

…thod to CUDAOpBuilder to support its jit compile

make fast multihead attention as jit module, add generator_args to Cu…

b2a26fb

…daOpBuilder support jit of this module

make transducer loss and transducer joint modules as jit modules, add…

8acc5f5

… nvcc_threads_args method in CUDAOpBuilder to support these jit modules

remove extra method - installed_cuda_version from CUDAOpBuilder

1718d3a

add apex_C module to jit compile, add py-cpuinfo to requirements.txt …

844c8d4

…as it is needed for TorchCPUOpBuilder

make nccl allocator as a jit compile module, add nccl_args method to …

08939ea

…CUDAOpBuilder to support this

make amp_C as a jit module

fb451c9

add a few uses of amp_C jit module

c6daabd

add a few uses of amp_C jit module

b221825

make fused adam as a jit module

8973402

add a few uses of amp_C jit module

3b38cb8

fix the issue with fused adam jit module

2d29c4c

amd-sriram and others added 11 commits November 25, 2025 19:13

add ninja and wheel to requirements to be installed

484358c

Update Release notes in Readme

6388f5a

Exclude compatibility folder while installing apex

929f4ad

Update README.md

e16c45b

Update README.md

b52cb46

Update README.md

3f8f4fd

Merge branch 'master' into Refactor_build

2c6378a

Adding modification note to the original copywrite

5920a1b

fix the issue with symbolic links for op_builder, csrc when the apex …

2ee0684

…repo is cloned in the docker

assign the symbolically linked folders into a variable and then loop …

d097003

…across the list entries

remove unnecessary tabs

7c5e89a

jithunnair-amd merged commit 95043e3 into master Jan 26, 2026

jithunnair-amd deleted the Refactor_build branch January 26, 2026 21:22

jithunnair-amd restored the Refactor_build branch January 26, 2026 21:22

jithunnair-amd deleted the Refactor_build branch January 26, 2026 23:23

amd-sriram mentioned this pull request Feb 4, 2026

[release/1.10.0] Refactor Apex build process to use the PyTorch JIT extension flow #296

Merged

amd-sriram mentioned this pull request Feb 4, 2026

[release/2.10] Update related_commits ROCm/pytorch#2961

Merged

jithunnair-amd pushed a commit to ROCm/pytorch that referenced this pull request Feb 4, 2026

[release/2.10] Update related_commits (#2961)

4800712

[REDUX] Refactor Apex build process to use the PyTorch JIT extension flow ([#291](ROCm/apex#291)) ([#296](ROCm/apex#296))

rocm-repo-management-api Bot mentioned this pull request Feb 12, 2026

[AUTOGENERATED] [release/1.0.0] [REDUX] Refactor Apex build process to use the PyTorch JIT extension flow #298

Closed

ROCm deleted a comment from rocm-repo-management-api Bot Feb 12, 2026

rocm-repo-management-api Bot mentioned this pull request Mar 2, 2026

[AUTOGENERATED] [release/1.9.0] [REDUX] Refactor Apex build process to use the PyTorch JIT extension flow #305

Merged

rocm-repo-management-api Bot mentioned this pull request Mar 2, 2026

[AUTOGENERATED] [release/1.8.0] [REDUX] Refactor Apex build process to use the PyTorch JIT extension flow #306

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REDUX] Refactor Apex build process to use the PyTorch JIT extension flow#291

[REDUX] Refactor Apex build process to use the PyTorch JIT extension flow#291
jithunnair-amd merged 83 commits intomasterfrom
Refactor_build

jithunnair-amd commented Dec 4, 2025 •

edited

Loading

Uh oh!

amd-sriram commented Feb 4, 2026

Uh oh!

rocm-repo-management-api Bot commented Feb 4, 2026

Uh oh!

amd-sriram commented Mar 2, 2026

Uh oh!

amd-sriram commented Mar 2, 2026

Uh oh!

rocm-repo-management-api Bot commented Mar 2, 2026

Uh oh!

rocm-repo-management-api Bot commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jithunnair-amd commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Added custom code to support building apex modules

Other changes

Tested Unit tests

Tested following commands

Uh oh!

amd-sriram commented Feb 4, 2026

Uh oh!

rocm-repo-management-api Bot commented Feb 4, 2026

Uh oh!

amd-sriram commented Mar 2, 2026

Uh oh!

amd-sriram commented Mar 2, 2026

Uh oh!

rocm-repo-management-api Bot commented Mar 2, 2026

Uh oh!

rocm-repo-management-api Bot commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jithunnair-amd commented Dec 4, 2025 •

edited

Loading