Refactor Apex build process to use the PyTorch JIT extension flow#247
Refactor Apex build process to use the PyTorch JIT extension flow#247amd-sriram merged 80 commits intomasterfrom
Conversation
562ce49 to
a31a714
Compare
6ed31cd to
bcd53ae
Compare
…ad of building it. Code uses accelerator and op_builder modules from deepspeed code.
…lly created by setup.py for the build process
… jit mode, add csrc back to setup.py since it is not copied to apex wheel
… building the wheel
… imports in python module
… make MLP JIT compile
…piled during apex installation
…thod to CUDAOpBuilder to support its jit compile
…daOpBuilder support jit of this module
… tests in the contrib test runner
…o run jit build and tests in readme, add other tests in readme
…files in apex folder
11518d2 to
5df477c
Compare
…on or CUDAExtension is built with JIT load approach
| # Copyright (c) Microsoft Corporation. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| # DeepSpeed Team |
There was a problem hiding this comment.
This header is legally required. The code was adapted from DeepSpeed which has Apache 2 licensed and section 4c of the Apache License mandates retaining all copyright and attribution notices in derivative works.
Reference: https://www.apache.org/licenses/LICENSE-2.0.txt (Section 4)
If we exclude this snippet, it will voilate the license.
Apache 2 license is compatible with Apex BSD 3 license. so there are no legal conflicts in including this snippet.
There was a problem hiding this comment.
This is a significant rehaul of the Apex build process. Thank you @amd-sriram for the extensive testing and multiple rounds of refactoring! Both JIT and non-JIT builds seem to build correctly in my testing as well. I was also able to run unit tests on a gfx90a GPU with the JIT build and got expected results.
The only packaging-related issue I noticed so far was that the JIT build installs the extension .py files in the site-packages installation directory, outside the apex subdirectory. This seems to be a violation of the usual packaging conventions. However, it also appears that there are valid use-cases that import apex extensions without the apex module being specified eg. import fused_weight_gradient_mlp_cuda (more in https://github.com/ROCm/frameworks-internal/issues/12681#issuecomment-3591503775). We need to explore in a follow-up PR if one can import these extensions without having their .py files present in site-packages (for eg. having separate directories for each extension in site-packages and an init.py and extension python file inside it). But that should be addressed in a follow-up issue, so we can merge this and proceed with testing it in our flows.
Motivation
Currently building apex takes around 30 minutes. The motivation behind JIT (just in time) load is to reduce the installation time to under 1 minute and then build the modules on demand when using them or running their tests.
In addition, this PR provides flexibility in building specific modules (based on the argument to indicate CPP or CUDA extension or specific modules).
Technical Details
To install from source:
pip install . --no-build-isolationTo build the wheel and install from wheel:
Currently converted extensions include (which work on rocm)
Total - 27 extensions
The following extensions have been not been included in jit load in this PR as these are not used before (Nvidia gpu only)
Added custom code to support building apex modules
Other changes
make aitersimilar to pytorch.make cleanto remove torch extensions created with JIT load.Tested Unit tests
cd tests/L0
PYTHONUNBUFFERED=1 sh run_rocm.sh 2>&1 | tee log_results.txt
cd apex/contrib/test/
PYTHONUNBUFFERED=1 python3 run_rocm_extensions.py 2>&1 | tee log_results_contrib.txt
torchrun --nproc_per_node 8 apex/contrib/peer_memory/peer_halo_exchange_module_tests.py
cd tests/distributed/synced_batchnorm
sh unit_test.sh
Docker used for the testing (tested with CPU only and with GPU)
registry-sc-harbor.amd.com/framework/compute-rocm-rel-7.0:32_ubuntu22.04_py3.10_pytorch_release-2.8_d2d97084
Tested following commands
Running the different build instructions.
madengine run --tags pyt_deepspeed_megatron_llama2_7b --live-output --additional-context "{'guest_os': 'UBUNTU', 'docker_build_arg':{'BASE_DOCKER':'registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16771_ubuntu24.04_py3.12_pytorch_rocm7.1_internal_testing_d1fb13a8'}}"Running extensive tests on MI300
Created an issue for the two errors: https://github.com/ROCm/frameworks-internal/issues/14438
Running a few commands on MI200
Creating the wheels in a docker with CPU and running the tests with GPU docker (MI300)
Created scripts for testing JIT build and documented the instructions at https://amd.atlassian.net/wiki/spaces/MLSE/pages/1255652200/Testing