FastFlowLM Docker — Run LLMs on AMD Ryzen AI NPU (Linux)

🤖 This entire project — Dockerfile, README, diagnosis, build process, and testing — was generated by Claude Opus 4.6 running inside GitHub Copilot CLI. A human provided the hardware and the goal ("run an LLM on my NPU"); the AI figured out the rest, including discovering that building FastFlowLM from source on Linux works despite no official support yet.

Run large language models on your AMD Ryzen AI NPU under Linux, using FastFlowLM inside Docker.

What this is

As of March 2026, running LLMs on AMD's XDNA2 NPU under Linux is not straightforward. AMD's official Ryzen AI 1.7 stack is missing a critical shared library (onnxruntime_providers_ryzenai.so) on Linux (amd/RyzenAI-SW#333), and FastFlowLM doesn't yet ship Linux binaries in releases (FastFlowLM#381).

This Dockerfile builds FastFlowLM from source and packages everything needed into a minimal container that talks directly to the NPU.

Supported hardware

Any AMD processor with an XDNA/XDNA2 NPU, including:

Ryzen AI 9 HX 370/375 (Strix Point — XDNA2)
Ryzen AI 9 HX 395 (Strix Halo — XDNA2)
Ryzen AI 7 PRO 360 (NPU4 / AIE2P) — confirmed by community (#1)
Ryzen AI Max / Max+ (Kraken Point)
And other XDNA-based APUs

Host prerequisites

Requirement	How to check
Linux kernel ≥ 6.11 with `amdxdna` driver	`lsmod \| grep amdxdna`
NPU device node	`ls -la /dev/accel/accel0`
NPU firmware ≥ 1.1.0.0	`ls /lib/firmware/amdnpu/`
Docker installed	`docker --version`
memlock unlimited (recommended)	`ulimit -l`

Quick host setup (Ubuntu 24.04)

The container builds XRT from source internally, so you only need the host-side kernel driver and firmware:

# Install the amdxdna kernel driver (from AMD's xdna-driver repo or PPA)
# See https://github.com/amd/xdna-driver for full instructions
sudo add-apt-repository ppa:amd-team/xrt
sudo apt update && sudo apt install libxrt-npu2

# Set memlock to unlimited (needs reboot)
echo -e "* soft memlock unlimited\n* hard memlock unlimited" | sudo tee -a /etc/security/limits.conf
sudo reboot

Note: If the PPA's XRT doesn't work with your kernel (e.g. kernel 6.17+), you may need to build the xdna-driver from source on the host. See Troubleshooting below.

Build the Docker image

git clone https://github.com/hpenedones/fastflowlm-docker.git
cd fastflowlm-docker
docker build -t fastflowlm .

The build takes ~15-25 minutes (XRT source build + Rust compilation + FLM C++ build). The resulting image is ~440MB (3-stage build, runtime-only).

Usage

# List available NPU models
docker run --rm fastflowlm list

# Download a model (mount cache so it persists)
docker run --rm -v ~/.config/flm:/root/.config/flm fastflowlm pull llama3.2:1b

# Chat with the model on your NPU
docker run -it --rm \
  --device=/dev/accel/accel0 \
  --ulimit memlock=-1:-1 \
  -v ~/.config/flm:/root/.config/flm \
  fastflowlm run llama3.2:1b

# Validate NPU setup
docker run --rm \
  --device=/dev/accel/accel0 \
  --ulimit memlock=-1:-1 \
  fastflowlm validate

# Run OpenAI-compatible API server on port 8000
docker run -d --rm \
  --device=/dev/accel/accel0 \
  --ulimit memlock=-1:-1 \
  -v ~/.config/flm:/root/.config/flm \
  -p 8000:8000 \
  fastflowlm serve

Benchmarks

Measured on an AMD Ryzen AI 9 HX 370 (Strix Point) with 32GB LPDDR5x, Ubuntu 24.04, kernel 6.17, FLM v0.9.35, power mode: performance.

Prompt: "Explain quantum computing in 200 words."

Model	Params	TTFT	Prefill	Decode	Tokens
Qwen3 0.6B	0.6B	535 ms	52.4 tok/s	88.7 tok/s	161
LFM2 1.2B	1.2B	363 ms	49.6 tok/s	62.9 tok/s	240
Llama 3.2 1B	1B	460 ms	95.9 tok/s	60.1 tok/s	271
Qwen3 1.7B	1.7B	640 ms	37.5 tok/s	40.4 tok/s	434
Gemma3 1B	1B	550 ms	34.6 tok/s	37.9 tok/s	201
Llama 3.2 3B	3B	957 ms	46.0 tok/s	24.4 tok/s	294
Phi-4 Mini 4B	4B	926 ms	11.9 tok/s	20.0 tok/s	935
Qwen3 4B	4B	1040 ms	23.1 tok/s	18.7 tok/s	551

TTFT = Time To First Token (lower is better)
Prefill = prompt processing speed
Decode = token generation speed (the number you feel when chatting)
Tokens = total tokens generated in the response

All inference runs entirely on the NPU — zero GPU or CPU compute for the model forward pass. The Qwen3 0.6B and Llama 3.2 1B models are the sweet spot for interactive use, delivering 60-89 tokens/s decode speed.

Available models

As of FLM v0.9.35, these models run on the NPU:

Model	Size	Command
Llama 3.2 1B	~1.2 GB	`run llama3.2:1b`
Llama 3.2 3B	~1.8 GB	`run llama3.2:3b`
Llama 3.1 8B	~4.5 GB	`run llama3.1:8b`
Qwen3 0.6B	~0.4 GB	`run qwen3:0.6b`
Qwen3 1.7B	~1.0 GB	`run qwen3:1.7b`
Qwen3 4B	~2.3 GB	`run qwen3:4b`
Gemma3 1B	~0.7 GB	`run gemma3:1b`
DeepSeek-R1 8B	~4.5 GB	`run deepseek-r1:8b`
Phi-4 Mini 4B	~2.3 GB	`run phi4-mini-it:4b`
Whisper V3 Turbo	~0.6 GB	`serve --asr 1` (see Whisper section)

Run flm list for the complete list.

Speech-to-text with Whisper

FastFlowLM also supports Whisper V3 Turbo for NPU-accelerated speech recognition. Whisper is not an LLM — it uses the --asr flag instead of being run directly.

A 30-second sample of JFK's 1962 Rice University "We choose to go to the Moon" speech is included in this repo (source: Wikimedia Commons, public domain, trimmed to 30s at the 9-minute mark).

API server mode

# Start the server with Whisper (+ an LLM for multimodal use)
docker run -d --rm \
  --device=/dev/accel/accel0 \
  --ulimit memlock=-1:-1 \
  -v ~/.config/flm:/root/.config/flm \
  -p 52625:52625 \
  fastflowlm serve gemma3:1b --asr 1

# Transcribe an audio file (OpenAI-compatible API)
curl -s http://localhost:52625/v1/audio/transcriptions \
  -F "file=@sample_audio.ogg" \
  -F "model=whisper-v3:turbo" | python3 -m json.tool

Supported audio formats: .wav, .mp3, .ogg, .m4a, .flac — anything FFmpeg can decode.

On a Ryzen AI 9 HX 370, a 30-second clip transcribes in ~5 seconds; the full 17-minute speech transcribes in ~2.5 minutes — all on the NPU.

Interactive CLI mode

# Start a chat session with Whisper enabled
docker run -it --rm \
  --device=/dev/accel/accel0 \
  --ulimit memlock=-1:-1 \
  -v ~/.config/flm:/root/.config/flm \
  fastflowlm run gemma3:1b --asr 1

Then inside the chat, use /input "path/to/audio.mp3" summarize it to transcribe and discuss audio files with the LLM.

How it works

The Dockerfile uses a 3-stage build:

XRT builder: Builds XRT base and NPU plugin from the xdna-driver source (no PPA dependency)
FLM builder: Installs build dependencies (cmake, ninja, Rust, Boost, FFmpeg, FFTW3), clones FastFlowLM, and compiles against the source-built XRT
Runtime stage: Copies only the built binary, NPU kernel libraries (.so), and xclbin files into a minimal Ubuntu image with runtime dependencies

The container accesses the NPU via --device=/dev/accel/accel0. The host kernel's amdxdna driver handles the actual hardware communication.

Troubleshooting

flm validate shows no NPU: Make sure you passed --device=/dev/accel/accel0 and the host has the amdxdna driver loaded.

Permission denied on /dev/accel/accel0: Check device permissions on the host (ls -la /dev/accel/accel0). You may need to add your user to the render group or run the container with --group-add render.

Low memlock limit: The NPU needs a high memlock limit. Pass --ulimit memlock=-1:-1 to Docker, or set unlimited memlock in /etc/security/limits.conf on the host and reboot.

XRT version mismatch on the host: The PPA's XRT (2.20.0) may not work with newer kernels (6.17+). If amdxdna fails to load, build the xdna-driver from source on the host. Note: the Docker image already builds XRT from source internally, so this only affects the host-side kernel driver.

NPU4 / AIE2P firmware (e.g. Ryzen AI 7 PRO 360): Kernel 6.17 may require protocol-specific firmware under /usr/lib/firmware/amdnpu/17f0_10/. If flm validate fails, check that the firmware files are present and symlinked correctly — see issue #1 for a detailed walkthrough.

Credits

FastFlowLM — the NPU LLM runtime
AMD XDNA Driver — Linux NPU driver and XRT

License

The Dockerfile itself is MIT licensed. FastFlowLM and AMD's libraries have their own licenses — see their respective repositories.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Dockerfile		Dockerfile
README.md		README.md
demo.gif		demo.gif
sample_audio.ogg		sample_audio.ogg
whisper-demo.gif		whisper-demo.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FastFlowLM Docker — Run LLMs on AMD Ryzen AI NPU (Linux)

What this is

Supported hardware

Host prerequisites

Quick host setup (Ubuntu 24.04)

Build the Docker image

Usage

Benchmarks

Available models

Speech-to-text with Whisper

API server mode

Interactive CLI mode

How it works

Troubleshooting

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FastFlowLM Docker — Run LLMs on AMD Ryzen AI NPU (Linux)

What this is

Supported hardware

Host prerequisites

Quick host setup (Ubuntu 24.04)

Build the Docker image

Usage

Benchmarks

Available models

Speech-to-text with Whisper

API server mode

Interactive CLI mode

How it works

Troubleshooting

Credits

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages