This project provides pre-built containers (“toolboxes”) for running LLMs on Intel Arc B70 (and other modern Intel GPUs) using llama.cpp and vLLM. Toolbx is the standard developer container system in Fedora (and works on Ubuntu, openSUSE, Arch, etc).
📊 Interactive Benchmarks: Live performance results are available at kyuz0.github.io/intel-b70-ai-toolboxes/.
Interactive performance benchmark results for different backends, models, and quantization levels running on the Intel Arc B70 GPU are published at:
👉 Intel Arc B70 LLM Benchmarks
You can check the containers on DockerHub: kyuz0/intel-b70-ai-toolboxes.
| Container / Repo | Backend/Stack | Purpose / Notes |
|---|---|---|
kyuz0/intel-b70-ai-toolboxes:sycl |
Intel oneAPI SYCL | Native Intel backend for llama.cpp. Fastest generation performance, utilizes Level Zero. Requires Intel oneAPI Base Toolkit components installed inside the container. |
kyuz0/intel-b70-ai-toolboxes:vulkan |
Vulkan (Mesa/Intel) | Universal backend for llama.cpp using Vulkan. Recommended for compatibility across different host setups and older Intel hardware. |
kyuz0/intel-b70-ai-toolboxes:openvino |
Intel OpenVINO | OpenVINO backend for llama.cpp. Translates GGML graphs into OpenVINO for Intel-optimized inference on CPUs and GPUs. Auto-configured to target the discrete GPU with stateful KV cache. |
kyuz0/intel-b70-vllm-toolbox:latest |
Intel vLLM Scaler | Official Intel vLLM stack optimized for Arc Pro B70, featuring an interactive TUI launcher (start-vllm). |
The Llama.cpp containers are automatically rebuilt whenever the Llama.cpp master branch is updated. The vLLM container can be rebuilt using the provided GitHub action.
Create and enter your toolbox of choice. (Ubuntu users: remember to use distrobox instead of toolbox in the commands below).
Option A: Vulkan (Intel ANV) - best for compatibility
toolbox create b70-llama-vulkan \
--image docker.io/kyuz0/intel-b70-ai-toolboxes:vulkan \
-- --device /dev/dri --group-add video --group-add render --security-opt seccomp=unconfined
toolbox enter b70-llama-vulkanOption B: SYCL (Native Intel) - best for performance
toolbox create b70-llama-sycl \
--image docker.io/kyuz0/intel-b70-ai-toolboxes:sycl \
-- --device /dev/dri --group-add video --group-add render --group-add sudo --security-opt seccomp=unconfined
toolbox enter b70-llama-syclOption C: OpenVINO (Intel) - graph-compiled Intel inference
toolbox create b70-llama-openvino \
--image docker.io/kyuz0/intel-b70-ai-toolboxes:openvino \
-- --device /dev/dri --group-add video --group-add render --security-opt seccomp=unconfined
toolbox enter b70-llama-openvinoOption D: vLLM (Intel Scaler) - best for high-throughput serving
toolbox create b70-vllm \
--image docker.io/kyuz0/intel-b70-vllm-toolbox:latest \
-- --device /dev/dri --shm-size 200g --security-opt seccomp=unconfined --env no_proxy=localhost,127.0.0.1
toolbox enter b70-vllmTip: You can also use the included
./refresh-toolboxes.sh [all|b70-llama-vulkan|b70-llama-sycl|b70-llama-openvino|b70-vllm]script to automate the container pulling and creation process.
OpenVINO Notes: Although an OpenVINO backend is available in the toolboxes, it is not recommended and is excluded from standard benchmarking. It fails to run with most modern models (especially Mixture-of-Experts/MoE architectures, which trigger OpenCL compilation crashes with
CL_OUT_OF_RESOURCESerrors on the GPU). If used, the container auto-exportsGGML_OPENVINO_DEVICE=GPUandGGML_OPENVINO_STATEFUL_EXECUTION=1on entry. When benchmarking withllama-bench, you must pass-fa 1(flash attention) — this is an upstream requirement for the OpenVINO backend.
⚠️ VRAM reporting:llama-cli --list-deviceswill report system RAM (~64 GiB) instead of GPU VRAM (32 GiB). This is an upstream llama.cpp limitation — the OpenVINO backend does not query device VRAM. Usegguf-vram-estimator.pyto check if a model fits in GPU memory before loading.
Inside the toolbox:
# For SYCL / vLLM
llama-cli --list-devices
# or
sycl-lsFor Llama.cpp toolboxes: Download your GGUF models and run them natively.
Server Mode (API):
llama-server -m models/your-model.gguf -c 8192 -ngl 999CLI Mode:
llama-cli -ngl 999 -m models/your-model.gguf -p "Write a haiku about Intel graphics."For vLLM toolbox: The vLLM toolbox comes with an interactive TUI. Simply run:
start-vllmEnsure you are running an up-to-date kernel (6.8+) for the best Intel GPU driver support (i915 or xe drivers).
For some advanced hardware scheduling features or to enable GuC/HuC firmware on older kernels, you may need to add the following to your GRUB boot parameters:
i915.enable_guc=3 or intel_iommu=on depending on the hardware platform (not always necessary for B70 which uses xe out of the box in newer kernels).
To estimate VRAM requirements for models (including context overhead), use the included tool:
gguf-vram-estimator.py models/my-model.gguf --contexts 32768You can build the containers yourself to customize packages or llama.cpp versions.
cd toolboxes
docker build -t llama-sycl -f Dockerfile.sycl .