Intel B70 Llama.cpp & vLLM Toolboxes

This project provides pre-built containers (“toolboxes”) for running LLMs on Intel Arc B70 (and other modern Intel GPUs) using llama.cpp and vLLM. Toolbx is the standard developer container system in Fedora (and works on Ubuntu, openSUSE, Arch, etc).

📊 Interactive Benchmarks: Live performance results are available at kyuz0.github.io/intel-b70-ai-toolboxes/.

Interactive Benchmarks

Interactive performance benchmark results for different backends, models, and quantization levels running on the Intel Arc B70 GPU are published at:

👉 Intel Arc B70 LLM Benchmarks

Supported Toolboxes

You can check the containers on DockerHub: kyuz0/intel-b70-ai-toolboxes.

Container / Repo	Backend/Stack	Purpose / Notes
`kyuz0/intel-b70-ai-toolboxes:sycl`	Intel oneAPI SYCL	Native Intel backend for llama.cpp. Fastest generation performance, utilizes Level Zero. Requires Intel oneAPI Base Toolkit components installed inside the container.
`kyuz0/intel-b70-ai-toolboxes:vulkan`	Vulkan (Mesa/Intel)	Universal backend for llama.cpp using Vulkan. Recommended for compatibility across different host setups and older Intel hardware.
`kyuz0/intel-b70-ai-toolboxes:openvino`	Intel OpenVINO	OpenVINO backend for llama.cpp. Translates GGML graphs into OpenVINO for Intel-optimized inference on CPUs and GPUs. Auto-configured to target the discrete GPU with stateful KV cache.
`kyuz0/intel-b70-vllm-toolbox:latest`	Intel vLLM Scaler	Official Intel vLLM stack optimized for Arc Pro B70, featuring an interactive TUI launcher (`start-vllm`).

The Llama.cpp containers are automatically rebuilt whenever the Llama.cpp master branch is updated. The vLLM container can be rebuilt using the provided GitHub action.

Quick Start

Create and enter your toolbox of choice. (Ubuntu users: remember to use distrobox instead of toolbox in the commands below).

Option A: Vulkan (Intel ANV) - best for compatibility

toolbox create b70-llama-vulkan \
  --image docker.io/kyuz0/intel-b70-ai-toolboxes:vulkan \
  -- --device /dev/dri --group-add video --group-add render --security-opt seccomp=unconfined

toolbox enter b70-llama-vulkan

Option B: SYCL (Native Intel) - best for performance

toolbox create b70-llama-sycl \
  --image docker.io/kyuz0/intel-b70-ai-toolboxes:sycl \
  -- --device /dev/dri --group-add video --group-add render --group-add sudo --security-opt seccomp=unconfined

toolbox enter b70-llama-sycl

Option C: OpenVINO (Intel) - graph-compiled Intel inference

toolbox create b70-llama-openvino \
  --image docker.io/kyuz0/intel-b70-ai-toolboxes:openvino \
  -- --device /dev/dri --group-add video --group-add render --security-opt seccomp=unconfined

toolbox enter b70-llama-openvino

Option D: vLLM (Intel Scaler) - best for high-throughput serving

toolbox create b70-vllm \
  --image docker.io/kyuz0/intel-b70-vllm-toolbox:latest \
  -- --device /dev/dri --shm-size 200g --security-opt seccomp=unconfined --env no_proxy=localhost,127.0.0.1

toolbox enter b70-vllm

Tip: You can also use the included ./refresh-toolboxes.sh [all|b70-llama-vulkan|b70-llama-sycl|b70-llama-openvino|b70-vllm] script to automate the container pulling and creation process.

OpenVINO Notes: Although an OpenVINO backend is available in the toolboxes, it is not recommended and is excluded from standard benchmarking. It fails to run with most modern models (especially Mixture-of-Experts/MoE architectures, which trigger OpenCL compilation crashes with CL_OUT_OF_RESOURCES errors on the GPU). If used, the container auto-exports GGML_OPENVINO_DEVICE=GPU and GGML_OPENVINO_STATEFUL_EXECUTION=1 on entry. When benchmarking with llama-bench, you must pass -fa 1 (flash attention) — this is an upstream requirement for the OpenVINO backend.

⚠️ VRAM reporting: llama-cli --list-devices will report system RAM (~64 GiB) instead of GPU VRAM (32 GiB). This is an upstream llama.cpp limitation — the OpenVINO backend does not query device VRAM. Use gguf-vram-estimator.py to check if a model fits in GPU memory before loading.

2. Check GPU Access

Inside the toolbox:

# For SYCL / vLLM
llama-cli --list-devices
# or
sycl-ls

3. Run Inference

For Llama.cpp toolboxes: Download your GGUF models and run them natively.

Server Mode (API):

llama-server -m models/your-model.gguf -c 8192 -ngl 999

CLI Mode:

llama-cli -ngl 999 -m models/your-model.gguf -p "Write a haiku about Intel graphics."

For vLLM toolbox: The vLLM toolbox comes with an interactive TUI. Simply run:

start-vllm

Host Configuration

Ensure you are running an up-to-date kernel (6.8+) for the best Intel GPU driver support (i915 or xe drivers).

For some advanced hardware scheduling features or to enable GuC/HuC firmware on older kernels, you may need to add the following to your GRUB boot parameters: i915.enable_guc=3 or intel_iommu=on depending on the hardware platform (not always necessary for B70 which uses xe out of the box in newer kernels).

Memory Planning and VRAM Estimator

To estimate VRAM requirements for models (including context overhead), use the included tool:

gguf-vram-estimator.py models/my-model.gguf --contexts 32768

Building Locally

You can build the containers yourself to customize packages or llama.cpp versions.

cd toolboxes
docker build -t llama-sycl -f Dockerfile.sycl .

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github/workflows		.github/workflows
benchmark		benchmark
docs		docs
toolboxes		toolboxes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
README.md		README.md
refresh-toolboxes.sh		refresh-toolboxes.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intel B70 Llama.cpp & vLLM Toolboxes

Table of Contents

Interactive Benchmarks

Supported Toolboxes

Quick Start

2. Check GPU Access

3. Run Inference

Host Configuration

Memory Planning and VRAM Estimator

Building Locally

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Intel B70 Llama.cpp & vLLM Toolboxes

Table of Contents

Interactive Benchmarks

Supported Toolboxes

Quick Start

2. Check GPU Access

3. Run Inference

Host Configuration

Memory Planning and VRAM Estimator

Building Locally

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages