Skip to content

fix(studio/worker): inject --gcc-install-dir for HIP source builds on Ubuntu 24.04#5517

Merged
danielhanchen merged 4 commits into
unslothai:mainfrom
h34v3nzc0dex:fix/hipcc-gcc-install-dir-ubuntu-24-04
May 18, 2026
Merged

fix(studio/worker): inject --gcc-install-dir for HIP source builds on Ubuntu 24.04#5517
danielhanchen merged 4 commits into
unslothai:mainfrom
h34v3nzc0dex:fix/hipcc-gcc-install-dir-ubuntu-24-04

Conversation

@h34v3nzc0dex

Copy link
Copy Markdown
Contributor

Per @danielhanchen's invitation in #5434 comment — splitting the causal-conv1d (and any other HIP source-build fallback) cstdlib issue out of the FLA/tilelang PR.

Problem

On Ubuntu 24.04 + ROCm clang-20, the HIP source-build fallback inside _install_package_wheel_first (causal-conv1d, mamba-ssm source fallback, flash-attn source fallback) dies at:

/opt/rocm-X.Y/lib/llvm/lib/clang/20/include/__clang_hip_runtime_wrapper.h:112:10:
  fatal error: 'cstdlib' file not found

Root cause: clang-20 picks the highest-numbered /usr/lib/gcc/x86_64-linux-gnu/<N> runtime dir by default. On 24.04 that's gcc-14, whose runtime objects ship in the gcc-14 apt package, but whose C++ headers (/usr/include/c++/14) come from libstdc++-14-devnot in the default apt set. libstdc++-13-dev is in the default set, so /usr/include/c++/13 exists. clang has no way to discover that asymmetry, so the build fails.

Same root cause as the llama.cpp HIP build trip that bbf004c fixed for studio/setup.sh in #5301.

Fix

New _hipcc_gcc_install_dir() helper iterates gcc 14 → 11 and returns the first /usr/lib/gcc/x86_64-linux-gnu/<N> dir where both the runtime AND /usr/include/c++/<N> exist. The HIP branch of _install_package_wheel_first appends --gcc-install-dir=<that path> to HIPCC_COMPILE_FLAGS_APPEND in the subprocess env before invoking pip.

Properties of the env injection:

  • Respects user --gcc-install-dir in HIPCC_COMPILE_FLAGS_APPEND (skip injection so the user-set path always wins — avoids two competing flags on the clang command line)
  • Preserves any other flags the user has set (appends to the end, doesn't overwrite)
  • No-op on non-HIP (CUDA path is unchanged)
  • No-op on non-Linux / non-x86_64 (different libstdc++ layouts; safer to leave alone than guess)
  • Returns None when no gcc dir has both halves → skip injection, surface the original error rather than guess wrong

This mirrors the loop in bbf004c's setup.sh fix; the difference is the delivery mechanism (env var vs CMAKE_HIP_FLAGS), because the worker path is pip-driven and can't pass CMake flags directly.

Tested

On Ryzen AI MAX+ 395 / Radeon 8060S (gfx1151) / Ubuntu 24.04 / kernel 6.19.14 mainline / ROCm 7.13 nightly:

Test plan

  • studio/backend/tests/test_training_worker_flash_attn.py — 8 new tests covering the helper logic + env injection paths. CI Backend job will exercise them. (Local sanity: standalone replica of the helper returns the expected path; both modified files py_compile clean.)
  • Helper is fully mockable — no filesystem dependence at test time (monkeypatch.setattr(worker.os.path, "isdir", ...)).
  • CUDA path explicitly asserted no-op (test_install_does_not_inject_env_on_cuda raises if the helper is consulted).

Scope notes

Happy to iterate on the heuristic (e.g. also probe /usr/lib/gcc/aarch64-linux-gnu/<N> if anyone is doing HIP on aarch64) — currently aarch64 returns None and the env is left alone, matching today's behaviour.

… Ubuntu 24.04

On Ubuntu 24.04 + ROCm clang-20, the HIP source-build fallback in
`_install_package_wheel_first` (causal-conv1d, mamba-ssm source fallback,
flash-attn source fallback) dies at:

  /opt/rocm-X.Y/lib/llvm/lib/clang/20/include/__clang_hip_runtime_wrapper.h:112:10:
    fatal error: 'cstdlib' file not found

Root cause: clang-20 picks the highest-numbered /usr/lib/gcc/x86_64-linux-gnu/<N>
runtime dir by default. On 24.04 that's gcc-14, whose runtime objects ship in
the gcc-14 package but whose C++ headers (/usr/include/c++/14) come from
libstdc++-14-dev — NOT in the default apt set. libstdc++-13-dev IS in the
default set, so /usr/include/c++/13 exists. clang has no way to discover
that asymmetry and the build fails.

Fix: new `_hipcc_gcc_install_dir()` helper iterates gcc 14 → 11 and returns
the first /usr/lib/gcc/x86_64-linux-gnu/<N> dir where BOTH the runtime AND
/usr/include/c++/<N> exist. The HIP branch of `_install_package_wheel_first`
appends `--gcc-install-dir=<that path>` to HIPCC_COMPILE_FLAGS_APPEND before
invoking pip. Respects an existing `--gcc-install-dir` in the env var
(user-set takes precedence); preserves any other flags the user has set
(appends to the end rather than overwriting). No-op on non-HIP, non-Linux,
non-x86_64.

Mirrors the same fix bbf004c added to studio/setup.sh for the llama.cpp HIP
build branch (unslothai#5301), but via env var since pip-driven source builds can't
take CMake flags directly.

Verified on Ryzen AI MAX+ 395 / Radeon 8060S (gfx1151) / Ubuntu 24.04 /
ROCm 7.13 nightly: `_hipcc_gcc_install_dir()` returns
`/usr/lib/gcc/x86_64-linux-gnu/13`, which matches the manual workaround
that already lets `pip install causal-conv1d` succeed on this hardware.

Tests added (8 new in test_training_worker_flash_attn.py):
- test_hipcc_gcc_install_dir_picks_highest_with_headers
- test_hipcc_gcc_install_dir_picks_14_when_headers_exist
- test_hipcc_gcc_install_dir_returns_none_when_no_match
- test_hipcc_gcc_install_dir_returns_none_on_non_linux
- test_hipcc_gcc_install_dir_returns_none_on_non_x86_64
- test_install_injects_gcc_install_dir_on_hip_source_build
- test_install_appends_to_existing_hipcc_compile_flags
- test_install_respects_user_gcc_install_dir
- test_install_does_not_inject_env_on_cuda

Per @danielhanchen's suggestion in
unslothai#5434 (comment)

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses HIP source build failures on Ubuntu 24.04 by implementing a mechanism to detect and inject a compatible GCC installation directory into the build environment. It introduces the _hipcc_gcc_install_dir helper to locate GCC versions with both runtime and C++ headers and updates the installation logic to append the necessary compiler flags. The changes are accompanied by extensive unit tests. Reviewer feedback recommends dynamically detecting GCC versions to improve maintainability and suggests a safer pattern for updating environment variables to avoid accidentally overwriting existing settings.


if _platform.machine().lower() != "x86_64":
return None
for _ver in (14, 13, 12, 11):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The list of GCC versions is hardcoded to (14, 13, 12, 11). This approach will require manual updates as new GCC versions are released (e.g., GCC 15) or if older versions are encountered on different distributions. A more robust approach would be to dynamically detect available versions by listing the directories in /usr/lib/gcc/x86_64-linux-gnu/ and sorting them in descending order to pick the highest compatible version.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentionally kept hardcoded (14, 13, 12, 11) here. Two reasons:

  1. Mirrors bbf004c in studio/setup.sh which uses the same literal loop for _gcc_ver in 14 13 12 11. Keeping the two policies in lockstep means when GCC 15 lands they both update together as one coordinated diff — rather than drifting if only one side gets the dynamic-enumeration treatment.

  2. Dynamic listdir adds edge cases (non-numeric subdirs like current symlinks on some distros, x86-64-pc-linux-gnu cross-arch entries, sorting "9" vs "10" lexicographically vs numerically). The hardcoded range covers every gcc shipped on Ubuntu/Debian/Fedora since 22.04, and the next bump is one trivial line edit.

Happy to revisit if setup.sh switches to dynamic enumeration first — they should stay aligned.

Comment thread studio/backend/core/training/worker.py Outdated
Comment on lines +264 to +267
"HIPCC_COMPILE_FLAGS_APPEND": _appended,
}
logger.info(
"HIP source build for %s: appended "

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation overwrites the env key in _run_kwargs by creating a new dictionary from os.environ. While safe in the current state of the function, this pattern is fragile as it would silently discard any environment modifications made to _run_kwargs['env'] earlier in the function. It is safer to retrieve and update the existing environment if it has already been initialized.

                _env = _run_kwargs.get("env", os.environ).copy()
                _env["HIPCC_COMPILE_FLAGS_APPEND"] = _appended
                _run_kwargs["env"] = _env

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — applied in aa30ae5d. Today both forms are equivalent (nothing earlier in the function sets _run_kwargs["env"]), but the get(...).copy() + key-mutation pattern survives any future env modification added upstream without silently dropping it. No behavioural change; the existing tests assert the final HIPCC_COMPILE_FLAGS_APPEND value, not the env-construction pattern.

Use _run_kwargs.get("env", os.environ).copy() + key-mutation instead of
rebuilding env from os.environ directly. Today both forms are equivalent
(no earlier code in _install_package_wheel_first sets _run_kwargs["env"]),
but the .get().copy() pattern survives any future env modification added
upstream of this block without silently throwing it away.

No behavioural change; tests already assert the final HIPCC_COMPILE_FLAGS_APPEND
value, not the env-construction pattern.

Per unslothai#5517 (comment)... (gemini-code-assist[bot])
@h34v3nzc0dex

Copy link
Copy Markdown
Contributor Author

Local pytest run on real gfx1151 against the Backend CI dep set — clean across the board.

This test file alone (test_training_worker_flash_attn.py):

17 passed in 1.16s

All 9 new tests from this PR + the 8 pre-existing tests in the file, no regressions.

Full backend suite (mirrors studio-backend-ci.yml's pytest tests/ invocation with the same -k deselect filter):

1073 passed, 47 skipped, 35 deselected, 7 subtests passed in 12.51s

Same 35 deselections CI uses (GPU-bound tests not relevant on a CPU runner). Zero failures.

Env: Python 3.12 sacrificial venv, studio/backend/requirements/studio.txt + the import-chain extras CI installs (python-multipart aiofiles sqlalchemy cryptography pyyaml jinja2 mammoth unpdf requests numpy<3 pytest pytest-asyncio httpx) + CPU torch>=2.4,<2.11 + transformers>=4.51,<5.5. Production venv untouched.

Full log + reproduction recipe: https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide/tree/main/pr-5517-validation.

@rolandtannous

Copy link
Copy Markdown
Contributor

@LeoBorcherding

@LeoBorcherding

Copy link
Copy Markdown
Collaborator

looks good, is complimentary to #5301 and will not run into merge conflicts with #5301

@danielhanchen

Copy link
Copy Markdown
Member

Thanks for the PR again!

@danielhanchen danielhanchen merged commit 388ade4 into unslothai:main May 18, 2026
27 of 55 checks passed
@h34v3nzc0dex

h34v3nzc0dex commented May 18, 2026

Copy link
Copy Markdown
Contributor Author

Happy to keep contributing in any way I can. Been hammering this Strix box for 6 months — fun being an alpha tester for the hardware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants