Skip to content

Documentation for FLASH_ATTENTION_SKIP_CUDA_BUILD is misleading and causes silent installation of broken packages. #17794

@lingkerio

Description

@lingkerio

Summary

The documentation for installing flash-attn (specifically the section on extra-build-variables) is misleading and can lead to a "silent failure" state where uv reports a successful installation, but the installed package is a "hollow" shell containing no compiled CUDA extensions.

The documentation states:

"The FLASH_ATTENTION_SKIP_CUDA_BUILD environment variable ensures that flash-attn is installed from a compatible, pre-built wheel..."

However, this variable only disables local compilation. If uv resolves to a version combination (e.g., latest Torch + Flash Attn) for which no official pre-built wheel exists, passing this variable causes setup.py to simply skip compilation and install a pure-Python package without errors. This results in a broken runtime environment.

Reproduction

I have created a minimal reproduction case that demonstrates how following the documentation can lead to a broken installation when version pinning is not strict.

1. pyproject.toml

Note: I am intentionally NOT pinning versions to simulate a scenario where uv picks a newer Torch version that flash-attn does not yet have a wheel for.

[project]
name = "flash-attn-repro"
version = "0.1.0"
requires-python = ">=3.10,<3.13"
dependencies = [
    "torch>=2.4.0",
    "flash-attn",
]

[tool.uv.extra-build-dependencies]
flash-attn = [{ requirement = "torch", match-runtime = true }]

[tool.uv.extra-build-variables]
# The docs suggest this ensures a wheel install, but it actually forces a hollow install if no wheel is found.
flash-attn = { FLASH_ATTENTION_SKIP_CUDA_BUILD = "TRUE" }

2. Commands run

uv sync -v
uv run python -c "import flash_attn"

Output

uv sync completes successfully, giving a false sense of security:

DEBUG Guessing wheel URL:  https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.10cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
DEBUG Precompiled wheel not found. Building from source...
...
Installed 2 package in 10ms
 + flash-attn==2.8.3
 + torch==2.10.0

However, running the verification script reveals the package is broken:

ModuleNotFoundError: No module named 'flash_attn_2_cuda'

Analysis

This issue seems related to the edge cases of the build dependency functionality introduced in #13959 and #6437.

Upon analyzing flash-attn's setup.py, it appears that the current documentation advises an anti-pattern that makes the installation process less robust.

The FLASH_ATTENTION_SKIP_CUDA_BUILD variable is redundant for success and harmful for failure:

  1. Redundant when things work: flash-attn's setup.py (via CachedWheelsCommand) already prioritizes downloading wheels before attempting any compilation. If a valid wheel exists, it is installed regardless of this variable. Setting it provides no benefit here.
  2. Harmful when things fail: If uv resolves to a version combination (e.g., a newer Torch) for which no pre-built wheel exists:
  • Without this variable: The setup falls back to local compilation, fails due to missing CUDA/nvcc (in a clean build env), and raises a loud, helpful error. This is the desired fail-fast behavior.
  • With this variable: The setup falls back to local compilation, sees the flag, silently skips all CUDA extensions, and successfully installs a broken, pure-Python "hollow" package.

The current recommended configuration creates a dangerous trap where a "Build Failure" (which is easy to diagnose) is suppressed and converted into a "Runtime Failure" (which is confusing and occurs later).

Expected Behavior

  1. Documentation Update: The documentation should stop recommending FLASH_ATTENTION_SKIP_CUDA_BUILD as a standard practice for ensuring wheel installation, as it doesn't actually "ensure" anything that the script doesn't already do by default.
  2. Warning Added: If the variable is mentioned, the docs must warn that enabling it disables the safety mechanism (compilation failure) and can result in silent installation of non-functional packages if version pinning is not strict.
  3. Best Practice: The docs should instead emphasize explicitly pinning both torch and flash-attn to known-good combinations to ensure the resolver picks versions that have matching wheels.

Platform

Linux 5.15.0-160-generic x86_64 GNU/Linux

Version

uv 0.9.28

Python version

Python 3.12.12

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentation

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions