AI Workloads

Colima provides an ideal environment to run GPU-powered AI workloads on Apple Silicon devices.

By leveraging Krunkit for GPU access, AI models run in confined, isolated and secure containers.

Requirements

  • Apple Silicon Mac (M1, M2, M3, or newer)
  • macOS 13 or newer
  • Sufficient RAM for your chosen model (8GB+ recommended)
  • Krunkit installed (see below)

Installing Krunkit

Krunkit is required for GPU access on Apple Silicon. Install it via Homebrew:

brew tap slp/krunkit && brew install krunkit

Getting Started

1. Start Colima with Krunkit

The krunkit VM type enables GPU access for AI workloads:

colima start --vm-type krunkit

2. Run a Model

Run an AI model:

# Start an interactive chat session
colima model run gemma3

# Or run a one-off prompt
colima model run gemma3 "Explain quantum computing in simple terms"

This downloads the model (if not cached) and either starts an interactive chat session or returns the response for a one-off prompt.

Model Runners

Colima supports two model runner backends:

  • Docker Model Runner (default) - Uses Docker’s native AI model runner. Supports Docker AI Registry and HuggingFace.
  • Ramalama - Uses Ramalama for model execution. Supports HuggingFace and Ollama registries.

To switch between runners:

# Use docker runner (default)
colima model run gemma3 --runner docker

# Use ramalama runner
colima model run gemma3 --runner ramalama

The runner can also be configured in the Colima config file. See Configuration - AI Workloads for details.

Supported Model Registries

Colima supports models from multiple registries:

Docker AI Registry (Default)

NOTE: Docker AI Registry is only available with the Docker Model Runner.

The Docker AI Registry provides curated, optimized models for local inference. When no prefix is specified, Docker AI Registry is used by default.

# Run models from Docker AI Registry (default, no prefix needed)
colima model run gemma3
colima model run llama3.2
colima model run qwen2.5
colima model run phi4
colima model run mistral

Browse available models at hub.docker.com/u/ai.

HuggingFace Hub

HuggingFace hosts thousands of open-source models. The prefix syntax differs by runner:

# Docker Model Runner uses hf.co/ prefix
colima model run hf.co/microsoft/Phi-3-mini-4k-instruct-gguf

# Ramalama uses hf:// prefix
colima model run hf://microsoft/Phi-3-mini-4k-instruct-gguf --runner ramalama

Browse models at huggingface.co/models.

Ollama Registry

NOTE: Ollama Registry is only available with the Ramalama runner.

Ollama provides a curated collection of popular open-source models optimized for local inference. Use the ollama:// prefix.

# Run Ollama models with ollama:// prefix (requires ramalama runner)
colima model run ollama://gemma3 --runner ramalama
colima model run ollama://llama3.2 --runner ramalama

Browse available models at ollama.com/library.

Available Commands

Running Models

Without a prompt, you enter an interactive chat session for continuous conversation:

# Start an interactive chat session
colima model run gemma3

For one-off prompts, specify the prompt before the model name:

# One-off prompt
colima model run gemma3 "What is the capital of France?"

More examples:

# Run different models from Docker AI Registry
colima model run llama3.2
colima model run qwen2.5
colima model run phi4

# Run a HuggingFace model (requires hf.co/ prefix)
colima model run hf.co/microsoft/Phi-3-mini-4k-instruct-gguf

# See all available options
colima model --help

Serving Models

Serve a model with a web-based chat interface:

# Serve a model (available at localhost:8080)
colima model serve gemma3

# Or serve a HuggingFace model
colima model serve hf.co/microsoft/Phi-3-mini-4k-instruct-gguf

# Serve on a custom port
colima model serve gemma3 --port 9000

The chat interface will be available at http://localhost:8080 by default. Use --port to specify a different port.

Open WebUI

Open WebUI is a feature-rich, self-hosted web interface for AI models. You can use it with Colima by serving a model and pointing OpenWebUI to the served endpoint.

First, serve a model:

colima model serve gemma3

Then run OpenWeb UI in Docker, specifying the Colima model server as the OpenAI API base URL:

docker run --rm -it -p 3000:8080 \
  -e OPENAI_API_BASE_URL=http://host.docker.internal:8080 \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

OpenWebUI will be available at http://localhost:3000.

With web search enabled:

docker run --rm -it -p 3000:8080 \
  -e OPENAI_API_BASE_URL=http://host.docker.internal:8080 \
  -e ENABLE_WEB_SEARCH=true \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Note: host.docker.internal allows the Docker container to access services running on the host machine (in this case, the Colima model server on port 8080).

Resource Management

One of the key benefits of running AI workloads in Colima is the ability to control resource usage. This prevents models from consuming all available system resources.

CPU and Memory Limits

Allocate specific resources to your AI instance:

# Start with 4 CPUs and 8GB RAM
colima start --vm-type krunkit --cpus 4 --memory 8

Disk Space

Set the disk size for model storage:

# Start with 50GB disk
colima start --vm-type krunkit --disk 50
Model Size Minimum RAM Recommended RAM Examples
Tiny (1-2B) 4GB 8GB TinyLlama (1.1B), Gemma 2B
Small (3-4B) 8GB 12GB Phi-3 Mini (3.8B), Gemma 3
Medium (7-8B) 12GB 16GB Llama 3.2 (8B), Mistral 7B
Large (13B+) 16GB 32GB Phi-4 (14B), Llama 13B, Mixtral

Tip: Check the model page on HuggingFace or Ollama for specific hardware requirements before running a model.

Note: TinyLlama is useful for validating your setup but has limited capabilities. For meaningful results, use models with 3B+ parameters.

Using Profiles for AI

Create a dedicated profile for AI workloads to keep them separate from your container development:

# Create an AI-specific profile
colima start ai --vm-type krunkit --cpus 4 --memory 16 --disk 50

# Setup and run models on the AI profile
colima model setup -p ai
colima model run gemma3 -p ai

# Your main development profile remains unaffected
colima start  # default profile with Docker

This allows you to:

  • Keep AI resources isolated from development containers
  • Quickly switch between AI and development workloads
  • Delete AI resources without affecting your development environment

Security

Colima’s AI workloads benefit from multiple layers of security:

Container Isolation

Models run inside containerized environments, isolated from:

  • Your host filesystem
  • Other running containers
  • Network access (unless explicitly enabled)

Resource Limits

Unlike running models directly on your host, Colima enforces:

  • CPU limits to prevent system freeze
  • Memory limits to prevent OOM conditions
  • Disk quotas to manage storage

Extra Isolation

While the mounted host volumes are not accessible to AI containers by default, you can take security a step further by completely disabling volume mounts:

colima start --vm-type krunkit --mount=none

This ensures no host filesystem access is possible from within the VM, providing maximum isolation for untrusted models.

Troubleshooting

Model Download Issues

If a model fails to download:

# Check available disk space
colima ssh -- df -h

# Increase disk if needed (stop, then start with larger disk)
colima stop
colima start --vm-type krunkit --disk 100

GPU Not Detected

Ensure you’re using the krunkit VM type:

# Check current VM type
colima status

# Restart with krunkit
colima stop
colima start --vm-type krunkit

Out of Memory

If models crash or run slowly:

# Stop current instance
colima stop

# Start with more memory
colima start --vm-type krunkit --memory 16

Resetting AI Setup

To completely reset your AI environment:

# Delete the profile and its data
colima delete --data

# Start fresh
colima start --vm-type krunkit
colima model setup