Skip to content

Refactor Apex build process to use the PyTorch JIT extension flow#247

Merged
amd-sriram merged 80 commits intomasterfrom
Refactor_build
Dec 1, 2025
Merged

Refactor Apex build process to use the PyTorch JIT extension flow#247
amd-sriram merged 80 commits intomasterfrom
Refactor_build

Conversation

@amd-sriram
Copy link
Copy Markdown
Collaborator

@amd-sriram amd-sriram commented Jul 10, 2025

Motivation

Currently building apex takes around 30 minutes. The motivation behind JIT (just in time) load is to reduce the installation time to under 1 minute and then build the modules on demand when using them or running their tests.

In addition, this PR provides flexibility in building specific modules (based on the argument to indicate CPP or CUDA extension or specific modules).

Technical Details

  • By default all modules are JIT compiled
  • APEX_BUILD_<OP_NAME>=1 - precompile specific modules e.g. APEX_BUILD_FUSED_DENSE=1
  • APEX_BUILD_CPP_OPS and APEX_BUILD_CUDA_OPS env variables to support building specific modules (replaces cpp_ext, cuda_ext)

To install from source:
pip install . --no-build-isolation
To build the wheel and install from wheel:

python -m build --wheel --no-isolation 
pip install dist/apex-*.whl 

Currently converted extensions include (which work on rocm)

  • apex_C
  • distributed_adam_cuda
  • distributed_lamb_cuda
  • amp_C
  • syncbn
  • fused_layer_norm_cuda
  • fused_dense_cuda
  • fused_weight_gradient_mlp_cuda
  • mlp_cuda
  • scaled_upper_triang_masked_softmax_cuda
  • generic_scaled_masked_softmax_cuda
  • scaled_masked_softmax_cuda
  • scaled_softmax_cuda
  • fused_rotary_positional_embedding
  • fused_bias_swiglu
  • bnp
  • xentropy_cuda
  • focal_loss_cuda
  • fused_index_mul_2d
  • deprecated_fused_adam
  • fused_lamb_cuda
  • fast_multihead_attn
  • transducer_joint_cuda
  • transducer_loss_cuda
  • peer_memory_cuda
  • nccl_p2p_cuda
  • _apex_nccl_allocator

Total - 27 extensions

The following extensions have been not been included in jit load in this PR as these are not used before (Nvidia gpu only)

  • fused_conv_bias_relu
  • fast_layer_norm
  • fmhalib
  • fast_bottleneck

Added custom code to support building apex modules

  • get backward pass guards
  • aten_atomic_args
  • generator_args
  • nvcc_threads_args
  • nccl_args
  • nccl_version
  • is_supported

Other changes

  • aiter is not build as a part of setup.py. The user can install aiter with the command make aiter similar to pytorch.
  • make clean to remove torch extensions created with JIT load.

Tested Unit tests

cd tests/L0
PYTHONUNBUFFERED=1 sh run_rocm.sh 2>&1 | tee log_results.txt

cd apex/contrib/test/
PYTHONUNBUFFERED=1 python3 run_rocm_extensions.py 2>&1 | tee log_results_contrib.txt 

torchrun --nproc_per_node 8 apex/contrib/peer_memory/peer_halo_exchange_module_tests.py

cd tests/distributed/synced_batchnorm
sh unit_test.sh

Docker used for the testing (tested with CPU only and with GPU)
registry-sc-harbor.amd.com/framework/compute-rocm-rel-7.0:32_ubuntu22.04_py3.10_pytorch_release-2.8_d2d97084

Tested following commands

Running the different build instructions.

  • Different GPUs: MI200, MI300
  • Counting the number of failed unit tests (L0, contrib), counting the number of .so and torch extensions built
  • Created mad engine based scripts for running the different test conditions https://github.com/amd-sriram/Tools/blob/main/AutomatedBuildTest/models.json
  • DLM workload
    madengine run --tags pyt_deepspeed_megatron_llama2_7b --live-output --additional-context "{'guest_os': 'UBUNTU', 'docker_build_arg':{'BASE_DOCKER':'registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16771_ubuntu24.04_py3.12_pytorch_rocm7.1_internal_testing_d1fb13a8'}}"

Running extensive tests on MI300

failed L0 tests count failed contrib tests built so count torch extensions count
python setup.py install --cpp_ext --cuda_ext (old command) 2 0 0 27
pip install . --no-build-isolation 2 0 0 27
APEX_BUILD_CPP_OPS=1 pip install . --no-build-isolation 2 0 1 26
APEX_BUILD_CUDA_OPS=1 pip install . --no-build-isolation 3 0 26 1
APEX_BUILD_CPP_OPS=1 APEX_BUILD_CUDA_OPS=1 pip install . --no-build-isolation 2 0 27 0
APEX_BUILD_FUSED_DENSE=1 pip install . --no-build-isolation 2 0 1 26
APEX_BUILD_FUSED_DENSE=1 ... pip install . --no-build-isolation 2 0 27 0
python -m build --wheel --no-isolation . 2 0 0 27
APEX_BUILD_CPP_OPS=1 python -m build --wheel --no-isolation 2 0 1 26
APEX_BUILD_CUDA_OPS=1 python -m build --wheel --no-isolation 2 0 26 1
APEX_BUILD_CPP_OPS=1 APEX_BUILD_CUDA_OPS=1 python -m build --wheel --no-isolation 2 0 27 0
APEX_BUILD_FUSED_DENSE=1 python -m build --wheel --no-isolation 2 0 1 26
APEX_BUILD_FUSED_DENSE=1 ... pip install . --no-build-isolation 2 0 27 0

Created an issue for the two errors: https://github.com/ROCm/frameworks-internal/issues/14438

Running a few commands on MI200

  • pip install . --no-build-isolation ​
  • APEX_BUILD_CPP_OPS=1 APEX_BUILD_CUDA_OPS=1 pip install . --no-build-isolation​
  • python -m build --wheel --no-isolation ​.
  • APEX_BUILD_CPP_OPS=1 APEX_BUILD_CUDA_OPS=1 python -m build --wheel --no-isolation ​

Creating the wheels in a docker with CPU and running the tests with GPU docker (MI300)

failed L0 tests count failed contrib tests built so count torch extensions count
python -m build --wheel --no-isolation . 2 0 0 27
APEX_BUILD_CPP_OPS=1 python -m build --wheel --no-isolation 2 0 1 26
APEX_BUILD_CUDA_OPS=1 python -m build --wheel --no-isolation 2 0 26 1
APEX_BUILD_CPP_OPS=1 APEX_BUILD_CUDA_OPS=1 python -m build --wheel --no-isolation 2 0 27 0
APEX_BUILD_FUSED_DENSE=1 ... pip install . --no-build-isolation 2 0 27 0

Created scripts for testing JIT build and documented the instructions at https://amd.atlassian.net/wiki/spaces/MLSE/pages/1255652200/Testing

@amd-sriram amd-sriram self-assigned this Jul 10, 2025
@amd-sriram amd-sriram marked this pull request as ready for review July 28, 2025 09:30
@amd-sriram amd-sriram marked this pull request as draft July 28, 2025 11:04
@amd-sriram amd-sriram marked this pull request as ready for review July 30, 2025 14:57
@amd-sriram amd-sriram marked this pull request as draft August 6, 2025 16:29
@amd-sriram amd-sriram marked this pull request as ready for review August 13, 2025 20:26
…ad of building it. Code uses accelerator and op_builder modules from deepspeed code.
…lly created by setup.py for the build process
… jit mode, add csrc back to setup.py since it is not copied to apex wheel
…thod to CUDAOpBuilder to support its jit compile
Comment thread apex/git_version_info.py
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0

# DeepSpeed Team
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amd-sriram Do we keep this text snippet?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This header is legally required. The code was adapted from DeepSpeed which has Apache 2 licensed and section 4c of the Apache License mandates retaining all copyright and attribution notices in derivative works.
Reference: https://www.apache.org/licenses/LICENSE-2.0.txt (Section 4)

If we exclude this snippet, it will voilate the license.

Apache 2 license is compatible with Apex BSD 3 license. so there are no legal conflicts in including this snippet.

jithunnair-amd
jithunnair-amd previously approved these changes Nov 29, 2025
Copy link
Copy Markdown
Collaborator

@jithunnair-amd jithunnair-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a significant rehaul of the Apex build process. Thank you @amd-sriram for the extensive testing and multiple rounds of refactoring! Both JIT and non-JIT builds seem to build correctly in my testing as well. I was also able to run unit tests on a gfx90a GPU with the JIT build and got expected results.

The only packaging-related issue I noticed so far was that the JIT build installs the extension .py files in the site-packages installation directory, outside the apex subdirectory. This seems to be a violation of the usual packaging conventions. However, it also appears that there are valid use-cases that import apex extensions without the apex module being specified eg. import fused_weight_gradient_mlp_cuda (more in https://github.com/ROCm/frameworks-internal/issues/12681#issuecomment-3591503775). We need to explore in a follow-up PR if one can import these extensions without having their .py files present in site-packages (for eg. having separate directories for each extension in site-packages and an init.py and extension python file inside it). But that should be addressed in a follow-up issue, so we can merge this and proceed with testing it in our flows.

@amd-sriram amd-sriram merged commit 44e6f25 into master Dec 1, 2025
4 checks passed
@amd-sriram amd-sriram deleted the Refactor_build branch December 1, 2025 11:24
@jithunnair-amd jithunnair-amd restored the Refactor_build branch December 4, 2025 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants