ROCM support#3279
Conversation
|
windows support may also be possible but i would need some help testing this as i do not have a windows machine |
|
docs changes: diff --git a/get-started/installing-+-updating/pip-install.md b/get-started/installing-+-updating/pip-install.md
index c1f0975..5f66dbf 100644
--- a/get-started/installing-+-updating/pip-install.md
+++ b/get-started/installing-+-updating/pip-install.md
@@ -24,6 +24,16 @@ pip uninstall unsloth unsloth_zoo -y && pip install --no-deps git+https://github
If you're installing Unsloth in Jupyter, Colab, or other notebooks, be sure to prefix the command with `!`. This isn't necessary when using a terminal
+**To install Unsloth on AMD GPUs:**
+
+{% hint style="info" %}
+You can safely ignore errors about CUDA not being linked properly if you are installing Unsloth on AMD GPUs.
+{% endhint %}
+
+```bash
+pip install "unsloth[rocm64-torch280]"
+```
+
## Uninstall + Reinstall
If you're still encountering dependency issues with Unsloth, many users have resolved them by forcing uninstalling and reinstalling Unsloth:diff --git a/get-started/beginner-start-here/unsloth-requirements.md b/get-started/beginner-start-here/unsloth-requirements.md
index 793bd63..b5f5429 100644
--- a/get-started/beginner-start-here/unsloth-requirements.md
+++ b/get-started/beginner-start-here/unsloth-requirements.md
@@ -8,7 +8,7 @@ description: Here are Unsloth's requirements including system and GPU VRAM requi
* **Operating System**: Works on Linux and Windows.
* Supports NVIDIA GPUs since 2018+ including [Blackwell RTX 50](../../basics/training-llms-with-blackwell-rtx-50-series-and-unsloth) series. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40, A100, H100, L40 etc) [Check your GPU!](https://developer.nvidia.com/cuda-gpus) GTX 1070, 1080 works, but is slow.
-* Unsloth should work on [AMD](https://github.com/unslothai/unsloth/pull/2520) and [Intel](https://github.com/unslothai/unsloth/pull/2621) GPUs! Apple/Silicon/MLX is in the works.
+* Unsloth should work on [AMD](../installing-+-updating/pip-install#amd-installation) and [Intel](https://github.com/unslothai/unsloth/pull/2621) GPUs! Apple/Silicon/MLX is in the works.
* If you have different versions of torch, transformers etc., `pip install unsloth` will automatically install all the latest versions of those libraries so you don't need to worry about version compatibility.
* Your device must have `xformers`, `torch`, `BitsandBytes` and `triton` support.
|
|
seems like 4bit exporting has some issues as 64 blocksize is not supported with rocm (ROCm/bitsandbytes#10), it is possible to have 64 blocksize though depending on warp size so i will look into submitting a pr to bitsandbytes |
|
i have found a likely solution, if it works maybe i can switch over the builds to my fork until its merged in so 4bit works |
|
marking as draft until i get this issue fixed as it is fairly major |
|
pr created: bitsandbytes-foundation/bitsandbytes#1748 |
|
should work now, testing changes |
|
works |
|
Works great on AMD MI100. I added this to my vllm Dockerfile and it just worked. RUN git clone --recurse https://github.com/ROCm/bitsandbytes && cd bitsandbytes && git checkout rocm_enabled_multi_backend && pip install -r requirements-dev.txt && cmake -DCOMPUTE_BACKEND=hip -S . && make -j && pip install .
RUN git clone https://github.com/electron271/unsloth-rocm.git && cd unsloth-rocm && pip install .
RUN pip install unsloth_zooThanks |
great to hear! you also shouldn't need to use the rocm fork of bitsandbytes (afaik), this branch will install rocm supported bitsandbytes as a dependency and if you want to manually install it was merged into main so you can use main bitsandbytes |
|
I ran |
4bit is broken on CDNA gpus as they do not support 64 block size, i am unaware if there is a solution or not |
|
Hi @electron271 , glad to see this fabulous contribution for amd GPU. Let me help on verifying on more kinds of devices and hope to collaborate on this. |
|
I like the way to provide our end user the fresh prebuilt bnb binary directly in the patch. Somehow this does not work in some environment |
i think a dockerfile would be beneficial for systems that dont support this. this error is caused by having a out of date system, the minimally usable version of gcc is GCC 13.2, released July 27, 2023. i will note that i had a lot of issues with dockerized rocm when i was trying to get unsloth working on rocm initially, so i'm not sure if i am able to help with it. |
|
the upstream bitsandbytes pr should hopefully be able to be merged soon |
|
Hi @electron271 With that said, the official bitsandbytes wheels we build and will eventually publish are compatible with Ubuntu 22.04 (and other supported systems with glibc>=2.24). I am going to go ahead and merge that PR on bitsandbytes soon; we'll drop the ROCm 6.1 build and keep 6.2/6.3/6.4/7.0. We still need to add the RDNA4/CDNA4 build targets (RX 9070/9060, MI350X/MI355X), and need to keep in mind that while this can enable blocksize 64 on RDNA (consumer) it won't for CDNA (datacenter). |
done, my bitsandbytes builds are temporarily broken though as i reached maximum git lfs bandwidth and the limit resets in ~30 days. will think of a potential solution |
limit ended up resetting so it works now, i may look into hosting it myself but its probably a bad idea to have sources from unreliable urls in unsloth so it may be best to wait until bitsandbytes has updated |
…nslothai#3859) * add int8 weight-only QAT scheme, add test, fix tests for current torchao version * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * change quantization to PerAxis * lambda =/ * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add torchao messages, remove group_size from int8 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * raise exception on missing torchao * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * touch up the torchao imports * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
updates: - [github.com/astral-sh/ruff-pre-commit: v0.14.11 → v0.14.13](astral-sh/ruff-pre-commit@v0.14.11...v0.14.13) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Implement vLLM patch for notebook detection Add patch for vLLM compatibility in notebook environments. * Fix sys.stdout.fileno for vLLM compatibility Patch sys.stdout.fileno for vLLM compatibility in notebooks. * Add patch_vllm_for_notebooks to initialization * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden vLLM notebook stdout patch * Use logger for vLLM notebook patch * Clarify vLLM notebook patch log message --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: danielhanchen <danielhanchen@users.noreply.github.com>
* Handle Transformers 5 vLLM import errors * Deduplicate vLLM transformers mismatch handling --------- Co-authored-by: danielhanchen <danielhanchen@users.noreply.github.com>
… models (unslothai#3719) * add FastSentenceTransformer * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Gemini code review suggestions * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * unsloth-zoo patch only fixed usage for XLMRobertaForMaskedLM, this is a fix for XLMRobertaModel * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refactor do_lower_case * add some comments * force disable FP8 loading * refactor pooling detection, add missing pooling types * add save_pretrained_merged method which gets modules and config * fix _save_pretrained_merged * rename read_pooling_mode, load modules instead of hard-coding em * comment * revert save_pretrained_merged change * propagate trust_remote_code properly * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add super hacky mpnet patch from hell * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refactor _load_modules, add for_inference to from_pretrained, add transformers 5 code for mpnet, add distilbert patches * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add ModernBert * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * deberta-v2 support (provisional), fix remote_code * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add generic add_pooling_layer logic * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix for missing config * add push_to_hub_merged * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * edit messages, throw exception if no HF token * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix device_map mismatch * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add comments, move import, other suggestions by Datta0 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * re-add adapter removal to save_pretrained_merged, but if saving to folder which had adapters before, leave them * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add unsloth branding to save_pretrained_merged * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * propagate dtype to internal module when loading for inference * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix mpnet gradient checkpointing for torch >= 2.9 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * same thing for transformers 5, oops =) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix FastSentenceTransformer performance: 6x speedup via torch.compile + SDPA The original implementation was 31% slower than naive SentenceTransformer due to conflicting decorators from Unsloth's auto-compiler (@torch.compile on attention modules but @torch.compiler.disable on sub-modules). Changes: - Add fast encoder path that bypasses Unsloth patching for encoder models - Use native torch.compile with mode="reduce-overhead" for 6x speedup - Auto-detect and enable SDPA for models that support it (BERT, RoBERTa, etc.) - Change defaults: load_in_16bit=True, load_in_4bit=False (16-bit is optimal) - Change default: use_gradient_checkpointing=False (conflicts with torch.compile) - Add UNSLOTH_COMPILE_DISABLE=1 env var to fall back to old path if needed Supported encoder types: mpnet, bert, distilbert, roberta, xlm-roberta, albert, electra Benchmark results (BS=32, seq_len=128): - Naive 16-bit LoRA: 13-50ms per iter - Unsloth 16-bit LoRA: 2-9ms per iter (5.4x-6.7x faster) - Memory usage: 61MB-1.3GB (even largest model fits easily) Note: 4-bit + torch.compile has a PyTorch bug (pytorch/pytorch#90665). 4-bit is also 1.7-1.9x slower than 16-bit due to dequantization overhead, so 16-bit is recommended for these small encoder models anyway. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use Unsloth's prepare_model_for_kbit_training for consistency Changed from peft.prepare_model_for_kbit_training to unsloth.models._utils.prepare_model_for_kbit_training. Unsloth's version provides: - Float32 mixed precision upcasting for LoRA layers - Better numerical stability - Consistency with rest of Unsloth codebase * Use relative imports and add float16 machine support - Changed absolute import to relative: from ._utils import prepare_model_for_kbit_training - Added SUPPORTS_BFLOAT16 import for proper dtype detection - Handle devices that don't support bfloat16 by falling back to float16 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add save_pretrained_torchao * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add auto-compile for torch.compile based on training step breakeven analysis Changes: - Change default compile_mode from "reduce-overhead" to "default" since CUDA Graphs (used by reduce-overhead) is incompatible with PEFT/LoRA - Add _estimate_compile_threshold() to calculate minimum steps needed for torch.compile to be beneficial based on model parameter count - Add _apply_torch_compile() helper with accelerate unwrap_model bug workaround - Defer torch.compile application to trainer initialization time so we can check max_steps against the breakeven threshold - Patch SentenceTransformerTrainer to auto-apply compile when max_steps exceeds the calculated threshold Breakeven thresholds (with 1.2x safety margin): - 22M params (MiniLM): ~1388 steps - 110M params (mpnet): ~242 steps - 335M params (snowflake): ~203 steps This ensures torch.compile warmup cost is only paid when training is long enough to benefit from the speedup. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * do QAT preparation for fast path * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix double loading model, thanks Etherl * do mpnet gradient checkpoint patch if gc is enabled * remove distilbert patches from mpnet fix * sanity check on model params, thanks Etherl * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add save_pretrained_gguf, thanks Etherl * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refine compile threshold estimation for sentence transformers * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com> Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* Guard torch.compile on ROCm when triton_key missing * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update unsloth/import_fixes.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Tighten ROCm Triton import handling * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Rachel Li <rachelliqx07@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
any updates at all |
|
@electron271 Testing various ROCm versions atm, will open a PR soon, the unsloth_zoo issues seem to be resolved. |
|
@electron271 I opened a PR to your fork, can you take a look? |
|
looking right now |
ROCm/PyTorch install combinations
|
@danielhanchen was this closed by mistake or is there another PR up that incorporates these changes? |
|
@electron271 Sorry Unsloth had a rebase, so had to recover your PR at #4271. |
|
@electron271 I also recovered the broader ROCm install-matrix portion of your PR at #4272. |


closes #37