Skip to content

hpenedones/fastflowlm-docker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FastFlowLM Docker — Run LLMs on AMD Ryzen AI NPU (Linux)

🤖 This entire project — Dockerfile, README, diagnosis, build process, and testing — was generated by Claude Opus 4.6 running inside GitHub Copilot CLI. A human provided the hardware and the goal ("run an LLM on my NPU"); the AI figured out the rest, including discovering that building FastFlowLM from source on Linux works despite no official support yet.

Run large language models on your AMD Ryzen AI NPU under Linux, using FastFlowLM inside Docker.

Demo — Llama 3.2 1B running on AMD Ryzen AI NPU at ~60 tokens/s

What this is

As of March 2026, running LLMs on AMD's XDNA2 NPU under Linux is not straightforward. AMD's official Ryzen AI 1.7 stack is missing a critical shared library (onnxruntime_providers_ryzenai.so) on Linux (amd/RyzenAI-SW#333), and FastFlowLM doesn't yet ship Linux binaries in releases (FastFlowLM#381).

This Dockerfile builds FastFlowLM from source and packages everything needed into a minimal container that talks directly to the NPU.

Supported hardware

Any AMD processor with an XDNA/XDNA2 NPU, including:

  • Ryzen AI 9 HX 370/375 (Strix Point — XDNA2)
  • Ryzen AI 9 HX 395 (Strix Halo — XDNA2)
  • Ryzen AI 7 PRO 360 (NPU4 / AIE2P) — confirmed by community (#1)
  • Ryzen AI Max / Max+ (Kraken Point)
  • And other XDNA-based APUs

Host prerequisites

Requirement How to check
Linux kernel ≥ 6.11 with amdxdna driver lsmod | grep amdxdna
NPU device node ls -la /dev/accel/accel0
NPU firmware ≥ 1.1.0.0 ls /lib/firmware/amdnpu/
Docker installed docker --version
memlock unlimited (recommended) ulimit -l

Quick host setup (Ubuntu 24.04)

The container builds XRT from source internally, so you only need the host-side kernel driver and firmware:

# Install the amdxdna kernel driver (from AMD's xdna-driver repo or PPA)
# See https://github.com/amd/xdna-driver for full instructions
sudo add-apt-repository ppa:amd-team/xrt
sudo apt update && sudo apt install libxrt-npu2

# Set memlock to unlimited (needs reboot)
echo -e "* soft memlock unlimited\n* hard memlock unlimited" | sudo tee -a /etc/security/limits.conf
sudo reboot

Note: If the PPA's XRT doesn't work with your kernel (e.g. kernel 6.17+), you may need to build the xdna-driver from source on the host. See Troubleshooting below.

Build the Docker image

git clone https://github.com/hpenedones/fastflowlm-docker.git
cd fastflowlm-docker
docker build -t fastflowlm .

The build takes ~15-25 minutes (XRT source build + Rust compilation + FLM C++ build). The resulting image is ~440MB (3-stage build, runtime-only).

Usage

# List available NPU models
docker run --rm fastflowlm list

# Download a model (mount cache so it persists)
docker run --rm -v ~/.config/flm:/root/.config/flm fastflowlm pull llama3.2:1b

# Chat with the model on your NPU
docker run -it --rm \
  --device=/dev/accel/accel0 \
  --ulimit memlock=-1:-1 \
  -v ~/.config/flm:/root/.config/flm \
  fastflowlm run llama3.2:1b

# Validate NPU setup
docker run --rm \
  --device=/dev/accel/accel0 \
  --ulimit memlock=-1:-1 \
  fastflowlm validate

# Run OpenAI-compatible API server on port 8000
docker run -d --rm \
  --device=/dev/accel/accel0 \
  --ulimit memlock=-1:-1 \
  -v ~/.config/flm:/root/.config/flm \
  -p 8000:8000 \
  fastflowlm serve

Benchmarks

Measured on an AMD Ryzen AI 9 HX 370 (Strix Point) with 32GB LPDDR5x, Ubuntu 24.04, kernel 6.17, FLM v0.9.35, power mode: performance.

Prompt: "Explain quantum computing in 200 words."

Model Params TTFT Prefill Decode Tokens
Qwen3 0.6B 0.6B 535 ms 52.4 tok/s 88.7 tok/s 161
LFM2 1.2B 1.2B 363 ms 49.6 tok/s 62.9 tok/s 240
Llama 3.2 1B 1B 460 ms 95.9 tok/s 60.1 tok/s 271
Qwen3 1.7B 1.7B 640 ms 37.5 tok/s 40.4 tok/s 434
Gemma3 1B 1B 550 ms 34.6 tok/s 37.9 tok/s 201
Llama 3.2 3B 3B 957 ms 46.0 tok/s 24.4 tok/s 294
Phi-4 Mini 4B 4B 926 ms 11.9 tok/s 20.0 tok/s 935
Qwen3 4B 4B 1040 ms 23.1 tok/s 18.7 tok/s 551
  • TTFT = Time To First Token (lower is better)
  • Prefill = prompt processing speed
  • Decode = token generation speed (the number you feel when chatting)
  • Tokens = total tokens generated in the response

All inference runs entirely on the NPU — zero GPU or CPU compute for the model forward pass. The Qwen3 0.6B and Llama 3.2 1B models are the sweet spot for interactive use, delivering 60-89 tokens/s decode speed.

Available models

As of FLM v0.9.35, these models run on the NPU:

Model Size Command
Llama 3.2 1B ~1.2 GB run llama3.2:1b
Llama 3.2 3B ~1.8 GB run llama3.2:3b
Llama 3.1 8B ~4.5 GB run llama3.1:8b
Qwen3 0.6B ~0.4 GB run qwen3:0.6b
Qwen3 1.7B ~1.0 GB run qwen3:1.7b
Qwen3 4B ~2.3 GB run qwen3:4b
Gemma3 1B ~0.7 GB run gemma3:1b
DeepSeek-R1 8B ~4.5 GB run deepseek-r1:8b
Phi-4 Mini 4B ~2.3 GB run phi4-mini-it:4b
Whisper V3 Turbo ~0.6 GB serve --asr 1 (see Whisper section)

Run flm list for the complete list.

Speech-to-text with Whisper

FastFlowLM also supports Whisper V3 Turbo for NPU-accelerated speech recognition. Whisper is not an LLM — it uses the --asr flag instead of being run directly.

Whisper demo — JFK Moon speech transcribed on NPU

A 30-second sample of JFK's 1962 Rice University "We choose to go to the Moon" speech is included in this repo (source: Wikimedia Commons, public domain, trimmed to 30s at the 9-minute mark).

API server mode

# Start the server with Whisper (+ an LLM for multimodal use)
docker run -d --rm \
  --device=/dev/accel/accel0 \
  --ulimit memlock=-1:-1 \
  -v ~/.config/flm:/root/.config/flm \
  -p 52625:52625 \
  fastflowlm serve gemma3:1b --asr 1

# Transcribe an audio file (OpenAI-compatible API)
curl -s http://localhost:52625/v1/audio/transcriptions \
  -F "file=@sample_audio.ogg" \
  -F "model=whisper-v3:turbo" | python3 -m json.tool

Supported audio formats: .wav, .mp3, .ogg, .m4a, .flac — anything FFmpeg can decode.

On a Ryzen AI 9 HX 370, a 30-second clip transcribes in ~5 seconds; the full 17-minute speech transcribes in ~2.5 minutes — all on the NPU.

Interactive CLI mode

# Start a chat session with Whisper enabled
docker run -it --rm \
  --device=/dev/accel/accel0 \
  --ulimit memlock=-1:-1 \
  -v ~/.config/flm:/root/.config/flm \
  fastflowlm run gemma3:1b --asr 1

Then inside the chat, use /input "path/to/audio.mp3" summarize it to transcribe and discuss audio files with the LLM.

How it works

The Dockerfile uses a 3-stage build:

  1. XRT builder: Builds XRT base and NPU plugin from the xdna-driver source (no PPA dependency)
  2. FLM builder: Installs build dependencies (cmake, ninja, Rust, Boost, FFmpeg, FFTW3), clones FastFlowLM, and compiles against the source-built XRT
  3. Runtime stage: Copies only the built binary, NPU kernel libraries (.so), and xclbin files into a minimal Ubuntu image with runtime dependencies

The container accesses the NPU via --device=/dev/accel/accel0. The host kernel's amdxdna driver handles the actual hardware communication.

Troubleshooting

flm validate shows no NPU: Make sure you passed --device=/dev/accel/accel0 and the host has the amdxdna driver loaded.

Permission denied on /dev/accel/accel0: Check device permissions on the host (ls -la /dev/accel/accel0). You may need to add your user to the render group or run the container with --group-add render.

Low memlock limit: The NPU needs a high memlock limit. Pass --ulimit memlock=-1:-1 to Docker, or set unlimited memlock in /etc/security/limits.conf on the host and reboot.

XRT version mismatch on the host: The PPA's XRT (2.20.0) may not work with newer kernels (6.17+). If amdxdna fails to load, build the xdna-driver from source on the host. Note: the Docker image already builds XRT from source internally, so this only affects the host-side kernel driver.

NPU4 / AIE2P firmware (e.g. Ryzen AI 7 PRO 360): Kernel 6.17 may require protocol-specific firmware under /usr/lib/firmware/amdnpu/17f0_10/. If flm validate fails, check that the firmware files are present and symlinked correctly — see issue #1 for a detailed walkthrough.

Credits

License

The Dockerfile itself is MIT licensed. FastFlowLM and AMD's libraries have their own licenses — see their respective repositories.

About

Run LLMs on AMD Ryzen AI NPU (Linux)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors