Refactor Apex build process to use the PyTorch JIT extension flow by amd-sriram · Pull Request #247 · ROCm/apex

amd-sriram · 2025-07-10T15:56:50Z

Motivation

Currently building apex takes around 30 minutes. The motivation behind JIT (just in time) load is to reduce the installation time to under 1 minute and then build the modules on demand when using them or running their tests.

In addition, this PR provides flexibility in building specific modules (based on the argument to indicate CPP or CUDA extension or specific modules).

Technical Details

By default all modules are JIT compiled
APEX_BUILD_<OP_NAME>=1 - precompile specific modules e.g. APEX_BUILD_FUSED_DENSE=1
APEX_BUILD_CPP_OPS and APEX_BUILD_CUDA_OPS env variables to support building specific modules (replaces cpp_ext, cuda_ext)

To install from source:
pip install . --no-build-isolation
To build the wheel and install from wheel:

python -m build --wheel --no-isolation 
pip install dist/apex-*.whl

Currently converted extensions include (which work on rocm)

Total - 27 extensions

The following extensions have been not been included in jit load in this PR as these are not used before (Nvidia gpu only)

fused_conv_bias_relu
fast_layer_norm
fmhalib
fast_bottleneck

Added custom code to support building apex modules

get backward pass guards
aten_atomic_args
generator_args
nvcc_threads_args
nccl_args
nccl_version
is_supported

Other changes

aiter is not build as a part of setup.py. The user can install aiter with the command make aiter similar to pytorch.
make clean to remove torch extensions created with JIT load.

Tested Unit tests

cd tests/L0
PYTHONUNBUFFERED=1 sh run_rocm.sh 2>&1 | tee log_results.txt

cd apex/contrib/test/
PYTHONUNBUFFERED=1 python3 run_rocm_extensions.py 2>&1 | tee log_results_contrib.txt

torchrun --nproc_per_node 8 apex/contrib/peer_memory/peer_halo_exchange_module_tests.py

cd tests/distributed/synced_batchnorm
sh unit_test.sh

Docker used for the testing (tested with CPU only and with GPU)
registry-sc-harbor.amd.com/framework/compute-rocm-rel-7.0:32_ubuntu22.04_py3.10_pytorch_release-2.8_d2d97084

Tested following commands

Running the different build instructions.

Different GPUs: MI200, MI300
Counting the number of failed unit tests (L0, contrib), counting the number of .so and torch extensions built
Created mad engine based scripts for running the different test conditions https://github.com/amd-sriram/Tools/blob/main/AutomatedBuildTest/models.json
DLM workload
madengine run --tags pyt_deepspeed_megatron_llama2_7b --live-output --additional-context "{'guest_os': 'UBUNTU', 'docker_build_arg':{'BASE_DOCKER':'registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16771_ubuntu24.04_py3.12_pytorch_rocm7.1_internal_testing_d1fb13a8'}}"

Running extensive tests on MI300

	failed L0 tests count	built so count	torch extensions count
python setup.py install --cpp_ext --cuda_ext (old command)	2	0	27
pip install . --no-build-isolation	2	0	27
APEX_BUILD_CPP_OPS=1 pip install . --no-build-isolation	2	1	26
APEX_BUILD_CUDA_OPS=1 pip install . --no-build-isolation	3	26	1
APEX_BUILD_CPP_OPS=1 APEX_BUILD_CUDA_OPS=1 pip install . --no-build-isolation	2	27	0
APEX_BUILD_FUSED_DENSE=1 pip install . --no-build-isolation	2	1	26
APEX_BUILD_FUSED_DENSE=1 ... pip install . --no-build-isolation	2	27	0
python -m build --wheel --no-isolation .	2	0	27
APEX_BUILD_CPP_OPS=1 python -m build --wheel --no-isolation	2	1	26
APEX_BUILD_CUDA_OPS=1 python -m build --wheel --no-isolation	2	26	1
APEX_BUILD_CPP_OPS=1 APEX_BUILD_CUDA_OPS=1 python -m build --wheel --no-isolation	2	27	0
APEX_BUILD_FUSED_DENSE=1 python -m build --wheel --no-isolation	2	1	26
APEX_BUILD_FUSED_DENSE=1 ... pip install . --no-build-isolation	2	27	0

Created an issue for the two errors: https://github.com/ROCm/frameworks-internal/issues/14438

Running a few commands on MI200

pip install . --no-build-isolation
APEX_BUILD_CPP_OPS=1 APEX_BUILD_CUDA_OPS=1 pip install . --no-build-isolation
python -m build --wheel --no-isolation .
APEX_BUILD_CPP_OPS=1 APEX_BUILD_CUDA_OPS=1 python -m build --wheel --no-isolation

Creating the wheels in a docker with CPU and running the tests with GPU docker (MI300)

	failed L0 tests count	built so count	torch extensions count
python -m build --wheel --no-isolation .	2	0	27
APEX_BUILD_CPP_OPS=1 python -m build --wheel --no-isolation	2	1	26
APEX_BUILD_CUDA_OPS=1 python -m build --wheel --no-isolation	2	26	1
APEX_BUILD_CPP_OPS=1 APEX_BUILD_CUDA_OPS=1 python -m build --wheel --no-isolation	2	27	0
APEX_BUILD_FUSED_DENSE=1 ... pip install . --no-build-isolation	2	27	0

Created scripts for testing JIT build and documented the instructions at https://amd.atlassian.net/wiki/spaces/MLSE/pages/1255652200/Testing

…ad of building it. Code uses accelerator and op_builder modules from deepspeed code.

…lly created by setup.py for the build process

… jit mode, add csrc back to setup.py since it is not copied to apex wheel

… building the wheel

… imports in python module

… make MLP JIT compile

…piled during apex installation

…thod to CUDAOpBuilder to support its jit compile

…daOpBuilder support jit of this module

… tests in the contrib test runner

…rib test runner

…nt tests

…o run jit build and tests in readme, add other tests in readme

…files in apex folder

…on or CUDAExtension is built with JIT load approach

jithunnair-amd · 2025-11-26T22:20:17Z

+# Copyright (c) Microsoft Corporation.
+# SPDX-License-Identifier: Apache-2.0
+
+# DeepSpeed Team


@amd-sriram Do we keep this text snippet?

This header is legally required. The code was adapted from DeepSpeed which has Apache 2 licensed and section 4c of the Apache License mandates retaining all copyright and attribution notices in derivative works.
Reference: https://www.apache.org/licenses/LICENSE-2.0.txt (Section 4)

If we exclude this snippet, it will voilate the license.

Apache 2 license is compatible with Apex BSD 3 license. so there are no legal conflicts in including this snippet.

jithunnair-amd

This is a significant rehaul of the Apex build process. Thank you @amd-sriram for the extensive testing and multiple rounds of refactoring! Both JIT and non-JIT builds seem to build correctly in my testing as well. I was also able to run unit tests on a gfx90a GPU with the JIT build and got expected results.

The only packaging-related issue I noticed so far was that the JIT build installs the extension .py files in the site-packages installation directory, outside the apex subdirectory. This seems to be a violation of the usual packaging conventions. However, it also appears that there are valid use-cases that import apex extensions without the apex module being specified eg. import fused_weight_gradient_mlp_cuda (more in https://github.com/ROCm/frameworks-internal/issues/12681#issuecomment-3591503775). We need to explore in a follow-up PR if one can import these extensions without having their .py files present in site-packages (for eg. having separate directories for each extension in site-packages and an init.py and extension python file inside it). But that should be addressed in a follow-up issue, so we can merge this and proceed with testing it in our flows.

amd-sriram requested review from jithunnair-amd and pruthvistony July 10, 2025 15:56

amd-sriram self-assigned this Jul 10, 2025

amd-sriram force-pushed the Refactor_build branch from 562ce49 to a31a714 Compare July 16, 2025 19:14

amd-sriram marked this pull request as ready for review July 28, 2025 09:30

amd-sriram marked this pull request as draft July 28, 2025 11:04

amd-sriram marked this pull request as ready for review July 30, 2025 14:57

amd-sriram marked this pull request as draft August 6, 2025 16:29

amd-sriram marked this pull request as ready for review August 13, 2025 20:26

amd-sriram force-pushed the Refactor_build branch from 6ed31cd to bcd53ae Compare November 11, 2025 09:58

amd-sriram added 20 commits November 21, 2025 22:19

Created initial code for loading fused_dense module dynamically inste…

585502d

…ad of building it. Code uses accelerator and op_builder modules from deepspeed code.

add apex/git_version_info_installed.py to gitignore as it is dynamica…

d6e0ee4

…lly created by setup.py for the build process

add code for building fused rope dynamically

cb0c9ab

add code for building fused bias swiglu dynamically

f129b0d

fix the code so that fused rope and fused softmax are not compiled in…

be60325

… jit mode, add csrc back to setup.py since it is not copied to apex wheel

load the jit modules inside and this prevents them from building when…

7b9276c

… building the wheel

convert syncbn module to jit

eea4c0f

fix the unnecessary compile of syncbn module in wheel building due to…

d6ad398

… imports in python module

add fused layer norm module to jit build

497f54a

make focal loss module as jit module

12222eb

make focal loss module as jit module

1a72cb0

make xentropy module as jit module

5ee6115

make bpn module as jit module

6533731

add code to build individual extensions without JIT

4a1a8f8

clean up the flags for the modules based on apex/setup.py

01f22cd

add function to get the backward_pass_guard_args in CudaOpBuilder and…

58d87ad

… make MLP JIT compile

add fused weight gradient mlp to jit compile

d47d871

move fused_weight_gradient_mlp_cuda load inside so that it is not com…

fc60c28

…piled during apex installation

make fused index mul 2d jit compile and dd aten atomic header flag me…

ad7439a

…thod to CUDAOpBuilder to support its jit compile

make fast multihead attention as jit module, add generator_args to Cu…

b2a26fb

…daOpBuilder support jit of this module

amd-sriram added 13 commits November 21, 2025 22:19

renamed compatibility/scaled_masked_softmax_cuda.py, added some extra…

c21c31a

… tests in the contrib test runner

Added instructions for JIT load and changes in installation options

7d2bb4c

Restructuring the README

f80e434

Added instructions for building wheel

4b4b774

replaced TorchCPUBuilder with CPUBuilder, added a main method in cont…

71f9d67

…rib test runner

create a script to build different jit conditions for running differe…

a569854

…nt tests

add script to run tests with different jit builds, add instructions t…

f263567

…o run jit build and tests in readme, add other tests in readme

fix the issues with running the tests - improper paths, counting .so …

cdf3a31

…files in apex folder

add mad internal scripts

22b5340

remove print statement

d569d5d

remove testing section from readme

84ccba8

change location of result file

a60e200

remove multiple results file from models.json

5df477c

amd-sriram force-pushed the Refactor_build branch from 11518d2 to 5df477c Compare November 21, 2025 22:20

amd-sriram and others added 8 commits November 22, 2025 10:27

add platform specific description to wheel name even if no CppExtensi…

c758841

…on or CUDAExtension is built with JIT load approach

add ninja and wheel to requirements to be installed

484358c

Update Release notes in Readme

6388f5a

Exclude compatibility folder while installing apex

929f4ad

Update README.md

e16c45b

Update README.md

b52cb46

Update README.md

3f8f4fd

Merge branch 'master' into Refactor_build

2c6378a

jithunnair-amd reviewed Nov 26, 2025

View reviewed changes

jithunnair-amd previously approved these changes Nov 29, 2025

View reviewed changes

Adding modification note to the original copywrite

5920a1b

amd-sriram dismissed jithunnair-amd’s stale review via 5920a1b December 1, 2025 11:19

amd-sriram merged commit 44e6f25 into master Dec 1, 2025
4 checks passed

amd-sriram deleted the Refactor_build branch December 1, 2025 11:24

jithunnair-amd restored the Refactor_build branch December 4, 2025 23:10

jithunnair-amd mentioned this pull request Dec 4, 2025

[REDUX] Refactor Apex build process to use the PyTorch JIT extension flow #291

Merged

27 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Apex build process to use the PyTorch JIT extension flow#247

Refactor Apex build process to use the PyTorch JIT extension flow#247
amd-sriram merged 80 commits intomasterfrom
Refactor_build

amd-sriram commented Jul 10, 2025 •

edited

Loading

Uh oh!

jithunnair-amd Nov 26, 2025

Uh oh!

amd-sriram Dec 1, 2025

Uh oh!

jithunnair-amd left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

amd-sriram commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Added custom code to support building apex modules

Other changes

Tested Unit tests

Tested following commands

Uh oh!

jithunnair-amd Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

amd-sriram Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

jithunnair-amd left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amd-sriram commented Jul 10, 2025 •

edited

Loading

jithunnair-amd left a comment •

edited

Loading