fix(studio/worker): inject --gcc-install-dir for HIP source builds on Ubuntu 24.04#5517
Conversation
… Ubuntu 24.04
On Ubuntu 24.04 + ROCm clang-20, the HIP source-build fallback in
`_install_package_wheel_first` (causal-conv1d, mamba-ssm source fallback,
flash-attn source fallback) dies at:
/opt/rocm-X.Y/lib/llvm/lib/clang/20/include/__clang_hip_runtime_wrapper.h:112:10:
fatal error: 'cstdlib' file not found
Root cause: clang-20 picks the highest-numbered /usr/lib/gcc/x86_64-linux-gnu/<N>
runtime dir by default. On 24.04 that's gcc-14, whose runtime objects ship in
the gcc-14 package but whose C++ headers (/usr/include/c++/14) come from
libstdc++-14-dev — NOT in the default apt set. libstdc++-13-dev IS in the
default set, so /usr/include/c++/13 exists. clang has no way to discover
that asymmetry and the build fails.
Fix: new `_hipcc_gcc_install_dir()` helper iterates gcc 14 → 11 and returns
the first /usr/lib/gcc/x86_64-linux-gnu/<N> dir where BOTH the runtime AND
/usr/include/c++/<N> exist. The HIP branch of `_install_package_wheel_first`
appends `--gcc-install-dir=<that path>` to HIPCC_COMPILE_FLAGS_APPEND before
invoking pip. Respects an existing `--gcc-install-dir` in the env var
(user-set takes precedence); preserves any other flags the user has set
(appends to the end rather than overwriting). No-op on non-HIP, non-Linux,
non-x86_64.
Mirrors the same fix bbf004c added to studio/setup.sh for the llama.cpp HIP
build branch (unslothai#5301), but via env var since pip-driven source builds can't
take CMake flags directly.
Verified on Ryzen AI MAX+ 395 / Radeon 8060S (gfx1151) / Ubuntu 24.04 /
ROCm 7.13 nightly: `_hipcc_gcc_install_dir()` returns
`/usr/lib/gcc/x86_64-linux-gnu/13`, which matches the manual workaround
that already lets `pip install causal-conv1d` succeed on this hardware.
Tests added (8 new in test_training_worker_flash_attn.py):
- test_hipcc_gcc_install_dir_picks_highest_with_headers
- test_hipcc_gcc_install_dir_picks_14_when_headers_exist
- test_hipcc_gcc_install_dir_returns_none_when_no_match
- test_hipcc_gcc_install_dir_returns_none_on_non_linux
- test_hipcc_gcc_install_dir_returns_none_on_non_x86_64
- test_install_injects_gcc_install_dir_on_hip_source_build
- test_install_appends_to_existing_hipcc_compile_flags
- test_install_respects_user_gcc_install_dir
- test_install_does_not_inject_env_on_cuda
Per @danielhanchen's suggestion in
unslothai#5434 (comment)
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Code Review
This pull request addresses HIP source build failures on Ubuntu 24.04 by implementing a mechanism to detect and inject a compatible GCC installation directory into the build environment. It introduces the _hipcc_gcc_install_dir helper to locate GCC versions with both runtime and C++ headers and updates the installation logic to append the necessary compiler flags. The changes are accompanied by extensive unit tests. Reviewer feedback recommends dynamically detecting GCC versions to improve maintainability and suggests a safer pattern for updating environment variables to avoid accidentally overwriting existing settings.
|
|
||
| if _platform.machine().lower() != "x86_64": | ||
| return None | ||
| for _ver in (14, 13, 12, 11): |
There was a problem hiding this comment.
The list of GCC versions is hardcoded to (14, 13, 12, 11). This approach will require manual updates as new GCC versions are released (e.g., GCC 15) or if older versions are encountered on different distributions. A more robust approach would be to dynamically detect available versions by listing the directories in /usr/lib/gcc/x86_64-linux-gnu/ and sorting them in descending order to pick the highest compatible version.
There was a problem hiding this comment.
Intentionally kept hardcoded (14, 13, 12, 11) here. Two reasons:
-
Mirrors
bbf004cinstudio/setup.shwhich uses the same literal loopfor _gcc_ver in 14 13 12 11. Keeping the two policies in lockstep means when GCC 15 lands they both update together as one coordinated diff — rather than drifting if only one side gets the dynamic-enumeration treatment. -
Dynamic
listdiradds edge cases (non-numeric subdirs likecurrentsymlinks on some distros,x86-64-pc-linux-gnucross-arch entries, sorting"9"vs"10"lexicographically vs numerically). The hardcoded range covers every gcc shipped on Ubuntu/Debian/Fedora since 22.04, and the next bump is one trivial line edit.
Happy to revisit if setup.sh switches to dynamic enumeration first — they should stay aligned.
| "HIPCC_COMPILE_FLAGS_APPEND": _appended, | ||
| } | ||
| logger.info( | ||
| "HIP source build for %s: appended " |
There was a problem hiding this comment.
The current implementation overwrites the env key in _run_kwargs by creating a new dictionary from os.environ. While safe in the current state of the function, this pattern is fragile as it would silently discard any environment modifications made to _run_kwargs['env'] earlier in the function. It is safer to retrieve and update the existing environment if it has already been initialized.
_env = _run_kwargs.get("env", os.environ).copy()
_env["HIPCC_COMPILE_FLAGS_APPEND"] = _appended
_run_kwargs["env"] = _envThere was a problem hiding this comment.
Good catch — applied in aa30ae5d. Today both forms are equivalent (nothing earlier in the function sets _run_kwargs["env"]), but the get(...).copy() + key-mutation pattern survives any future env modification added upstream without silently dropping it. No behavioural change; the existing tests assert the final HIPCC_COMPILE_FLAGS_APPEND value, not the env-construction pattern.
Use _run_kwargs.get("env", os.environ).copy() + key-mutation instead of
rebuilding env from os.environ directly. Today both forms are equivalent
(no earlier code in _install_package_wheel_first sets _run_kwargs["env"]),
but the .get().copy() pattern survives any future env modification added
upstream of this block without silently throwing it away.
No behavioural change; tests already assert the final HIPCC_COMPILE_FLAGS_APPEND
value, not the env-construction pattern.
Per unslothai#5517 (comment)... (gemini-code-assist[bot])
|
Local pytest run on real This test file alone ( All 9 new tests from this PR + the 8 pre-existing tests in the file, no regressions. Full backend suite (mirrors Same 35 deselections CI uses (GPU-bound tests not relevant on a CPU runner). Zero failures. Env: Python 3.12 sacrificial venv, Full log + reproduction recipe: https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide/tree/main/pr-5517-validation. |
|
Thanks for the PR again! |
|
Happy to keep contributing in any way I can. Been hammering this Strix box for 6 months — fun being an alpha tester for the hardware. |
Per @danielhanchen's invitation in #5434 comment — splitting the
causal-conv1d(and any other HIP source-build fallback) cstdlib issue out of the FLA/tilelang PR.Problem
On Ubuntu 24.04 + ROCm clang-20, the HIP source-build fallback inside
_install_package_wheel_first(causal-conv1d,mamba-ssmsource fallback,flash-attnsource fallback) dies at:Root cause: clang-20 picks the highest-numbered
/usr/lib/gcc/x86_64-linux-gnu/<N>runtime dir by default. On 24.04 that's gcc-14, whose runtime objects ship in thegcc-14apt package, but whose C++ headers (/usr/include/c++/14) come fromlibstdc++-14-dev— not in the default apt set.libstdc++-13-devis in the default set, so/usr/include/c++/13exists. clang has no way to discover that asymmetry, so the build fails.Same root cause as the llama.cpp HIP build trip that
bbf004cfixed forstudio/setup.shin #5301.Fix
New
_hipcc_gcc_install_dir()helper iterates gcc 14 → 11 and returns the first/usr/lib/gcc/x86_64-linux-gnu/<N>dir where both the runtime AND/usr/include/c++/<N>exist. The HIP branch of_install_package_wheel_firstappends--gcc-install-dir=<that path>toHIPCC_COMPILE_FLAGS_APPENDin the subprocess env before invoking pip.Properties of the env injection:
--gcc-install-dirinHIPCC_COMPILE_FLAGS_APPEND(skip injection so the user-set path always wins — avoids two competing flags on the clang command line)This mirrors the loop in
bbf004c'ssetup.shfix; the difference is the delivery mechanism (env var vsCMAKE_HIP_FLAGS), because the worker path is pip-driven and can't pass CMake flags directly.Tested
On Ryzen AI MAX+ 395 / Radeon 8060S (
gfx1151) / Ubuntu 24.04 / kernel 6.19.14 mainline / ROCm 7.13 nightly:_hipcc_gcc_install_dir()returns/usr/lib/gcc/x86_64-linux-gnu/13(gcc-14 runtime exists but no/usr/include/c++/14; gcc-13 has both — exact Ubuntu 24.04 default layout)HIPCC_COMPILE_FLAGS_APPEND='--gcc-install-dir=/usr/lib/gcc/x86_64-linux-gnu/13'beforepip install causal-conv1dis what already lets the build succeed on this hardware (per the validation thread for studio: install flash-linear-attention and tilelang for Qwen3.5 family #5434 — repro at https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide/tree/main/pr-5434-validation). This PR just wires that workaround into the worker path automatically.causal_conv1d_fn+causal_conv1d_updatekernels run cleanly on gfx1151 (B=1 D=512 T=8192 K=4 bf16forward;D=512 K=4 bf16step-mode update).Test plan
studio/backend/tests/test_training_worker_flash_attn.py— 8 new tests covering the helper logic + env injection paths. CI Backend job will exercise them. (Local sanity: standalone replica of the helper returns the expected path; both modified filespy_compileclean.)monkeypatch.setattr(worker.os.path, "isdir", ...)).test_install_does_not_inject_env_on_cudaraises if the helper is consulted).Scope notes
causal-conv1dauto-install path (predates this PR; it's been in Studio since before studio: install flash-linear-attention and tilelang for Qwen3.5 family #5434). This PR only adds the env knob so the pre-existing path doesn't fail at compile time on 24.04+ROCm.flash-linear-attention+tilelangfor the Qwen3.5 family; this PR is a no-op on either FLA or tilelang.Happy to iterate on the heuristic (e.g. also probe
/usr/lib/gcc/aarch64-linux-gnu/<N>if anyone is doing HIP on aarch64) — currently aarch64 returns None and the env is left alone, matching today's behaviour.