Strix Halo AI Toolboxes

The Project

In August 2025, I got my hands on a Strix Halo machine. I needed to run local inference for some Cyber Security work where Cloud LLMs were not an option.

I quickly realized the software ecosystem wasn't ready. Stuff wasn't working. So I started digging, learning, and fixing things. I shared my findings in a video. People found it useful, so I've continued to maintain these toolboxes to help others unlock the potential of their hardware.

Thanks to support from the Strix Halo Home Lab community, Framework, and AMD, I've continued to maintain these "Toolboxes" to help others reproduce this setup and run AI workloads on Strix Halo hardware.

// WHOAMI

Donato Capitella

Software Engineer and Ethical Hacker. I enjoy understanding systems by breaking them down and documenting the process.

YouTube Channel LinkedIn Profile LLM Chronicles

Support my work:

☕ Buy me a coffee

What is Strix Halo?

// RYZEN AI MAX+

"Strix Halo" (Ryzen AI MAX+) is AMD's high-performance mobile processor platform. Its key feature for AI workloads is Unified Memory, allowing the iGPU to access up to 128GB of system RAM, significantly increasing the model size capacity compared to traditional consumer GPUs.

Architecture Zen 5 + RDNA 3.5

GPU ID gfx1151

Max Unified Memory 128 GB

-> Official Product Page

Active Toolboxes

// MAINTAINED CONTAINERS

These are containerized environments built on Toolbx (Docker/Podman). This approach allows you to easily get the specific runtime needed for Strix Halo, keep the host system clean, and instantly switch between different ROCm or software versions without dependency conflicts.

Llama.cpp Toolboxes

Setup for LLM inference. Supports clustering via RDMA and Vulkan/ROCm backends.

View Repo ->

ComfyUI Toolboxes

Environment for Image & Video generation. Validated for LTX2, Wan 2.2, HunyuanVideo, and Qwen.

View Repo ->

vLLM Toolboxes

Serving server setup. Includes custom RCCL patches for high-speed clustering.

View Repo ->

LLM Fine-tuning

Training environment. QLoRA and Full Fine-Tuning support for Gemma 3, Qwen 3, and generic models.

View Repo ->

DwarfStar

A small native inference engine optimized first for DeepSeek V4 Flash.

View Repo ->

Llama Cockpit

// TUI CONTAINER MANAGER

Llama Cockpit is a Terminal User Interface (TUI) that makes it easier to manage llama.cpp toolboxes and GGUF weights. It also includes a server mode that doesn't require toolbox or distrobox, running natively via docker/podman to ensure compatibility with any Linux distribution.

GitHub Repo ->

$ pipx install git+https://github.com/kyuz0/llama-toolboxes-cockpit.git

Tutorials & Guides

// YOUTUBE VIDEOS

Performance Benchmarks

// HARDWARE CAPABILITIES

Llama.cpp Benchmarks

Token generation speeds (tokens/sec) across various GGUF models.

View Benchmarks ->

vLLM Benchmarks

Peak multi-user throughput (tokens/sec) and RDMA/RoCE clustering performance.

View Benchmarks ->

ComfyUI Benchmarks

Generation speeds (seconds/it) for HunyuanVideo, Wan 2.2, and Qwen image workflows.

View Benchmarks ->

DwarfStar Benchmarks

Inference performance metrics (tokens/sec) for DeepSeek V4 Flash models.

View Benchmarks ->

SWE-bench Verified Mini

Coding capability and speed metrics on the SWE-bench Verified Mini dataset. Measures real-world execution on Strix Halo using accessible quantized models rather than full-precision server clusters.

View Benchmarks ->

Host Config

// TUNED FOR PERFORMANCE

This is the configuration I use on my Framework Desktop to maintain and benchmark all toolboxes.

My Rig - Sent to me by Framework

System Specifications

Model Framework Desktop

CPU Ryzen AI MAX+ 395 "Strix Halo"

Total RAM 128 GB DDR5

OS Fedora 43 (Linux 6.18.5)

Kernel Parameters

Why Custom Kernel Parameters?

Many guides suggest statically partitioning memory between the CPU and iGPU (e.g., locking 32GB for video). However, this is a waste. With Unified Dynamic Memory, I can let the GPU access nearly all system RAM (up to ~124GB) on demand, while keeping the flexibility to use it for the CPU when needed.

Performance Note: Benchmarking by Lars Urban (Issue #66) shows a 5-12% performance increase by setting amd_iommu=off instead of the previously recommended pass-through mode.

root@strix-halo:~

# Add these to GRUB_CMDLINE_LINUX in /etc/default/grub

$ sudo vim /etc/default/grub

GRUB_CMDLINE_LINUX="... amd_iommu=off amdgpu.gttsize=126976 ttm.pages_limit=32505856"

$ sudo grub2-mkconfig -o /boot/grub2/grub.cfg

Parameter	How it enables Unified Memory
amd_iommu=off	Disable AMD IOMMU: Disables the AMD IOMMU entirely, which can improve GPU memory access performance on Strix Halo unified memory setups.
amdgpu.gttsize=126976	GTT Size (Graphics Translation Table): Explicitly sets the maximum unified memory addressable by the GPU to ~124GB (126976 MB), overriding default driver limits.
ttm.pages_limit=32505856	Pinned Memory Limit: Allows the TTM (Translation Table Manager) to pin up to ~124GB of pages in high-speed system RAM, ensuring the GPU has direct access without swapping.

Container Engine & Permissions

Depending on your Linux distribution, you will need a different container engine to properly access the GPU. Select your OS below for specific instructions.

I test this setup on recent Fedora distributions because they have native support for Toolbox. This allows accessing containers in a seamless and convenient way. The command to use passes additional parameters to explicitly map the GPU devices, as shown below:

user@fedora:~

# Create your toolbox mapped to the host's GPU

$ toolbox create <TOOLBOX_NAME> \

--image <IMAGE_URL> \

-- --device /dev/dri --device /dev/kfd \

--group-add video --group-add render --security-opt seccomp=unconfined

Note: <TOOLBOX_NAME> and <IMAGE_URL> are placeholders. Check the specific toolbox repository for the correct values.

Users running Ubuntu have reported permission issues with the default toolbox package that can break GPU access. They have shared the following configuration using Distrobox, which works:

user@ubuntu:~

# Add your user to required GPU groups

$ sudo usermod -aG video,render $USER

# Ensure the compute device is accessible (persists across reboots)

$ echo -e 'SUBSYSTEM=="kfd", KERNEL=="kfd", MODE="0666"\nSUBSYSTEM=="drm", KERNEL=="renderD*", MODE="0666"' | sudo tee /etc/udev/rules.d/70-kfd.rules

$ sudo udevadm control --reload-rules && sudo udevadm trigger

# Create your distrobox mapped to the host's GPU

$ distrobox create <TOOLBOX_NAME> \

--image <IMAGE_URL> \

-- --device /dev/dri --device /dev/kfd \

--group-add video --group-add render --group-add sudo --security-opt seccomp=unconfined

Note: <TOOLBOX_NAME> and <IMAGE_URL> are placeholders. Check the specific toolbox repository for the correct values.

Note: This Distrobox configuration has been tested on Ubuntu 25.10 with Mainline Kernel 6.18.7-061807. To enable mainline kernels on Ubuntu, you can use the Ubuntu Mainline Kernel Installer.

Power & Performance Tuning

Following the documentation here, I set a performance profile to get max performance.

root@fedora:~

$ sudo dnf install tuned

$ sudo systemctl enable --now tuned

$ tuned-adm list | grep accelerator

- accelerator-performance - Throughput performance based tuning with disabled higher latency STOP states

$ sudo tuned-adm profile accelerator-performance

$ tuned-adm active

Current active profile: accelerator-performance

root@ubuntu:~

$ sudo apt update && sudo apt install tuned

$ sudo systemctl enable --now tuned

$ tuned-adm list | grep accelerator

- accelerator-performance - Throughput performance based tuning with disabled higher latency STOP states

$ sudo tuned-adm profile accelerator-performance

$ tuned-adm active

Current active profile: accelerator-performance

Community & Support

Connect with other Strix Halo owners, share benchmarks, and get help.

This is a hobby project that takes a lot of time to maintain and test. If you find these toolboxes useful, consider supporting the work.

Strix Halo Home Lab Wiki Join the Discord ☕ Buy me a Coffee