Clarifai provides an easy-to-use platform to serve AI/ML models in production.
This guide walks you through deploying models on Clarifai — from scaffolding a project to running predictions in production.
Three commands to go from zero to a deployed model:
pip install -U clarifai
# 1. Authenticate
clarifai login
# 2. Scaffold a model (auto-selects GPU based on model size)
clarifai model init --toolkit vllm --model-name Qwen/Qwen3-0.6B
# 3. Deploy to production (auto-creates all infrastructure)
clarifai model deploy ./Qwen3-0.6BThe CLI handles compute cluster creation, nodepool setup, image building, and deployment automatically. When it finishes you'll see:
── Ready ──────────────────────────────────────────────
Model deployed successfully!
Model: https://clarifai.com/you/main/models/qwen3-0-6b
Deployment: deploy-qwen3-0-6b-abc123
Instance: g5.xlarge
── Next Steps ─────────────────────────────────────────
Predict: clarifai model predict you/main/models/qwen3-0-6b "Hello"
Logs: clarifai model logs --deployment "deploy-qwen3-0-6b-abc123"
Undeploy: clarifai model undeploy --deployment "deploy-qwen3-0-6b-abc123"
| I want to... | Do this |
|---|---|
| Deploy a HuggingFace LLM (Qwen, Llama, Gemma, etc.) | clarifai model init --toolkit vllm --model-name <hf-repo> then clarifai model deploy |
| Deploy an MCP tool server | clarifai model init --toolkit mcp then write your tools |
| Wrap an existing API or local server | clarifai model init --toolkit openai or --toolkit ollama |
| Build a fully custom model (vision, embeddings, etc.) | clarifai model init then edit model.py — see Writing a Custom Model |
- Quick Start
- Installation & Authentication
- Model Development Workflow
- Writing a Custom Model
- Model Types
- Using the Python SDK Client
- Available Examples
- Supported Data Types
- Advanced Topics
pip install -U clarifai# Interactive (prompts for your PAT)
clarifai login
# Non-interactive
clarifai login --pat $MY_PAT
# Org account
clarifai login --pat $PAT --user-id my-org
# Dev environment
clarifai login https://api-dev.clarifai.comThe CLI saves your credentials locally. Verify with:
clarifai whoamiYou can manage multiple environments (prod, staging, org accounts) using named contexts:
clarifai config ls # List all contexts
clarifai config use <name> # Switch active contextNote: You can generate a PAT in your Clarifai account under Personal Settings → Security.
init → serve (optional) → deploy → predict → manage
scaffold test locally push to run status / logs /
project before deploying production inference undeploy
Generate a ready-to-deploy model project with a single command:
# Deploy a HuggingFace LLM with vLLM
clarifai model init --toolkit vllm --model-name Qwen/Qwen3-0.6B
# Deploy with SGLang
clarifai model init --toolkit sglang --model-name Qwen/Qwen2-7B
# Deploy a HuggingFace model directly
clarifai model init --toolkit huggingface --model-name google/gemma-2b
# Wrap a local Ollama model
clarifai model init --toolkit ollama --model-name llama3.1
# Create an MCP tool server
clarifai model init --toolkit mcp my-mcp-server
# Wrap an OpenAI-compatible API
clarifai model init --toolkit openai my-wrapper
# Blank Python model (full control)
clarifai model init my-modelAvailable toolkits:
| Toolkit | Use Case |
|---|---|
vllm |
High-throughput LLM serving with vLLM |
sglang |
Fast LLM serving with SGLang |
huggingface |
HuggingFace Transformers (direct inference) |
ollama |
Wrap a local Ollama model |
lmstudio |
Wrap a local LM Studio model |
mcp |
MCP tool server (FastMCP) |
openai |
OpenAI-compatible API wrapper |
python |
Blank Python model (default) |
What it creates:
my-model/
├── config.yaml # Simplified config (auto-filled at deploy time)
├── requirements.txt # Dependencies
└── 1/
└── model.py # Model implementation
Smart GPU selection: For HuggingFace models, the CLI queries the model's metadata (parameter count, quantization, architecture) and auto-selects the smallest GPU instance that fits:
$ clarifai model init --toolkit vllm --model-name Qwen/Qwen3-4B
Instance: g5.xlarge (Estimated 15.9 GiB VRAM, fits g5.xlarge (22 GiB))
Before deploying, test your model locally:
# Run in current Python environment (fastest)
clarifai model serve ./my-model
# Auto-create virtualenv and install deps
clarifai model serve ./my-model --mode env
# Build and run inside Docker (recommended for production parity)
clarifai model serve ./my-model --mode container
# Standalone gRPC server (no login required, offline development)
clarifai model serve ./my-model --grpc --port 9000What happens (default API-connected mode):
- Validates your
config.yaml - Starts the model server locally
- Registers with Clarifai so you can test via the Playground UI
- Displays a predict command and Playground URL
- Cleans up on Ctrl+C
| Option | Default | Description |
|---|---|---|
MODEL_PATH |
. |
Model directory containing config.yaml |
--mode |
none |
none (current env), env (auto-venv), container (Docker) |
--grpc |
off | Standalone gRPC server, no API connection needed |
--port |
8000 | Server port (with --grpc) |
--concurrency |
32 | Max concurrent requests |
--keep-image |
off | Keep Docker image after exit (container mode) |
Deploy your model to Clarifai cloud compute in one command. All infrastructure (compute cluster, nodepool, deployment) is created automatically:
# Deploy (uses instance from config.yaml)
clarifai model deploy ./my-model
# Override instance at deploy time
clarifai model deploy ./my-model --instance g6e.2xlarge
# GPU shorthands work too
clarifai model deploy ./my-model --instance a10g
# Specific cloud and region
clarifai model deploy ./my-model --instance g5.xlarge --cloud aws --region us-west-2
# Autoscaling
clarifai model deploy ./my-model --min-replicas 2 --max-replicas 10
# Deploy an already-uploaded model by URL
clarifai model deploy --model-url https://clarifai.com/user/app/models/id --instance g5.xlargeBrowse available instances:
$ clarifai list-instances
Instance Cloud Region GPUs Accelerator GPU Mem
──────────────────────────────────────────────────────────────────
g5.xlarge AWS us-east-1 1 NVIDIA-A10G 24 GiB
g6e.2xlarge AWS us-east-1 1 NVIDIA-L40S 48 GiB
t3a.2xlarge AWS us-east-1 0 CPU -
n1-standard-4-t4 GCP us-central1 1 NVIDIA-T4 16 GiB
...| Option | Default | Description |
|---|---|---|
MODEL_PATH |
. |
Local model directory |
--instance |
auto | Hardware instance (e.g., g5.xlarge, a10g) |
--model-url |
- | Deploy an already-uploaded model (skip upload) |
--min-replicas |
1 | Minimum running replicas |
--max-replicas |
5 | Maximum replicas for autoscaling |
--cloud |
auto | Cloud provider (aws, gcp, vultr) |
--region |
auto | Cloud region (e.g., us-east-1) |
Deploy phases:
── Validate ── Config validation, HuggingFace repo access check
── Upload ──── Build and push model image
── Deploy ──── Create compute cluster, nodepool, deployment
── Monitor ─── Stream pod events until ready
── Ready ───── Show model URL, predict command, status/log commands
Once deployed, run predictions directly from the CLI:
# Text input
clarifai model predict user/app/models/my-model "Hello world"
# Pipe from stdin
echo "Explain quantum computing" | clarifai model predict user/app/models/my-model
# Image/video/audio file
clarifai model predict user/app/models/detector --file photo.jpg
# Image URL
clarifai model predict user/app/models/detector --url https://example.com/img.jpg
# Named parameters
clarifai model predict user/app/models/llm -i prompt="Hello" -i max_tokens=200
# OpenAI chat mode
clarifai model predict openai/chat-completion/models/GPT-4 --chat "What is AI?"
# JSON output
clarifai model predict user/app/models/my-model "Hello" -o json
# Inspect available methods and their parameters
clarifai model predict user/app/models/my-model --info| Option | Description |
|---|---|
MODEL |
Model as user/app/models/id or full URL |
INPUT |
Text input (positional) |
--file PATH |
Local file (image, audio, video) |
--url URL |
Remote media URL |
-i KEY=VALUE |
Named parameter (repeatable) |
--chat TEXT |
OpenAI chat mode |
--info |
Show available methods, then exit |
-o text|json |
Output format (default: text) |
--deployment ID |
Route to a specific deployment |
Check status:
clarifai model status --deployment deploy-abc123
clarifai model status user/app/models/my-modelStream logs:
# Model stdout/stderr
clarifai model logs --deployment deploy-abc123
# Kubernetes scheduling/scaling events
clarifai model logs --deployment deploy-abc123 --log-type events
# Print current logs and exit
clarifai model logs --deployment deploy-abc123 --no-followRemove a deployment:
clarifai model undeploy --deployment deploy-abc123Most users don't need this section. If you're deploying a HuggingFace model,
clarifai model init --toolkit vllmgenerates everything for you. Read on if you want to customize the generated code or build a model from scratch.
my-model/
├── config.yaml # Model configuration
├── requirements.txt # Python dependencies
└── 1/
└── model.py # Model implementation
The simplified config requires only a few fields. Everything else is auto-filled at deploy time:
model:
id: "my-model"
build_info:
python_version: "3.11"
compute:
instance: g5.xlarge
checkpoints:
repo_id: Qwen/Qwen3-0.6B
type: huggingface
when: runtimeKey fields:
| Field | Required | Description |
|---|---|---|
model.id |
Yes | Model identifier |
model.model_type_id |
No | Defaults to any-to-any |
compute.instance |
Yes (for GPU) | Instance type (run clarifai list-instances to see options) |
build_info.python_version |
No | Defaults to 3.12 |
checkpoints.repo_id |
No | HuggingFace repo for auto-download |
checkpoints.when |
No | runtime (default), build, or upload |
Tip:
user_idandapp_idare resolved automatically from your CLI context — you don't need to put them in config.yaml.
Tip: For large models, use
when: runtimeto download checkpoints at runtime. This keeps Docker images small and speeds up builds.
Using a custom Docker base image (for vLLM, SGLang, etc.):
build_info:
image: "lmsysorg/sglang:latest"
checkpoints:
repo_id: Qwen/Qwen3-4B
type: huggingface
when: runtime
compute:
instance: g5.xlarge
model:
id: qwen3-4bYour model must define a class that inherits from one of the base classes (ModelClass, OpenAIModelClass, or MCPModelClass) and expose at least one method decorated with @<BaseClass>.method.
from typing import Iterator
from clarifai.runners.models.model_class import ModelClass
class MyModel(ModelClass):
def load_model(self):
"""Called once at startup. Load weights, initialize pipelines, etc."""
self.model = ...
@ModelClass.method
def predict(self, prompt: str = "", max_tokens: int = 256) -> str:
"""Single request → single response."""
return self.model(prompt, max_tokens=max_tokens)
@ModelClass.method
def generate(self, prompt: str = "") -> Iterator[str]:
"""Single request → streamed response."""
for token in self.model.stream(prompt):
yield tokenMethod types (determined by type hints):
| Pattern | Signature | Use Case |
|---|---|---|
| Unary | def predict(self, ...) -> str |
Standard request-response |
| Server streaming | def generate(self, ...) -> Iterator[str] |
LLM token streaming |
| Bidirectional streaming | def stream(self, input: Iterator[str]) -> Iterator[str] |
Real-time processing |
load_model() (optional): Use this for expensive one-time initialization — loading model weights, creating pipelines, downloading checkpoints. Called once when the container starts.
Testing: When you run clarifai model serve, it loads your model and you can test via the Playground UI or CLI predict.
List all Python dependencies your model needs:
torch==2.5.1
transformers
accelerate
Tip: If your model requires PyTorch, specify the version explicitly (e.g.,
torch==2.5.1ortorch==2.4.1). Clarifai provides optimized pre-built images for these versions that significantly speed up builds and reduce cold start times.
Clarifai provides three base classes. Pick the one that matches your use case:
| Base Class | When to Use |
|---|---|
ModelClass |
Custom Python inference — classification, detection, embeddings, any logic |
OpenAIModelClass |
LLM serving via OpenAI-compatible API (vLLM, SGLang, LMDeploy) |
MCPModelClass |
MCP tool servers for AI agents (FastMCP) |
For custom inference logic — classification, detection, embeddings, or any Python code.
from clarifai.runners.models.model_class import ModelClass
from clarifai.runners.utils.data_types import Image, Text
class ImageCaptioner(ModelClass):
def load_model(self):
from transformers import pipeline
self.pipe = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base")
@ModelClass.method
def predict(self, image: Image) -> str:
pil_image = image.to_pil()
result = self.pipe(pil_image)
return result[0]["generated_text"]See: hello-world/ for a minimal example.
For models served via an OpenAI-compatible API (vLLM, SGLang, LMDeploy, etc.):
from typing import Iterator, List
from openai import OpenAI
from clarifai.runners.models.openai_class import OpenAIModelClass
from clarifai.runners.utils.data_utils import Param
from clarifai.runners.utils.openai_convertor import build_openai_messages
class MyLLM(OpenAIModelClass):
client = OpenAI(api_key="local-key", base_url="http://localhost:8000/v1")
model = client.models.list().data[0].id
def load_model(self):
"""Start the inference server (vLLM, SGLang, etc.)."""
pass
@OpenAIModelClass.method
def predict(
self,
prompt: str = "",
chat_history: List[dict] = None,
max_tokens: int = Param(default=256, description="Max tokens to generate."),
temperature: float = Param(default=1.0, description="Sampling temperature."),
) -> str:
messages = build_openai_messages(prompt, chat_history)
response = self.client.chat.completions.create(
model=self.model, messages=messages,
max_completion_tokens=max_tokens, temperature=temperature,
)
return response.choices[0].message.content
@OpenAIModelClass.method
def generate(
self,
prompt: str = "",
chat_history: List[dict] = None,
max_tokens: int = Param(default=256, description="Max tokens to generate."),
temperature: float = Param(default=1.0, description="Sampling temperature."),
) -> Iterator[str]:
messages = build_openai_messages(prompt, chat_history)
for chunk in self.client.chat.completions.create(
model=self.model, messages=messages,
max_completion_tokens=max_tokens, temperature=temperature, stream=True,
):
if chunk.choices:
yield chunk.choices[0].delta.content or ""See: llm/vllm-qwen3-0.6B/ for a vLLM example, or llm/sglang-qwen3.5-4b/ for SGLang.
Tip: The fastest way to deploy an LLM is
clarifai model init --toolkit vllm --model-name <hf-repo>— it generates the full model.py, config.yaml, and requirements.txt with the right server setup.
For deploying MCP tool servers that AI agents can use:
from fastmcp import FastMCP
from pydantic import Field
from clarifai.runners.models.mcp_class import MCPModelClass
server = FastMCP("my-tools", instructions="Useful tools for agents.", stateless_http=True)
@server.tool("hello", description="Say hello to someone")
def hello(name: str = Field(description="Name to greet")) -> str:
return f"Hello, {name}!"
@server.resource("config://version")
def get_version():
return "1.0.0"
class MyModel(MCPModelClass):
def get_server(self) -> FastMCP:
return serverSee: mcp/math/ for a minimal example, or the mcp/ directory for real-world MCP servers (GitHub, Postgres, browser tools, code execution, etc.).
In addition to the CLI (clarifai model predict), you can call your deployed model from Python code. The SDK client mirrors the exact method signatures defined in your model.py:
from clarifai.client.model import Model
# Initialize by model URL
model = Model(model_url="https://clarifai.com/user/app/models/my-model")
# Or by IDs
model = Model(user_id="user", app_id="app", model_id="my-model")Discover methods and signatures:
# List available methods
print(model.available_methods())
# Get method signature (parameter names, types, defaults)
print(model.method_signature(method_name="predict"))
# Generate a ready-to-use code snippet
print(model.generate_client_script())Unary prediction (request → response):
result = model.predict(prompt="Hello world")
print(result)
# Batch prediction (multiple inputs)
results = model.predict([
{"prompt": "Hello"},
{"prompt": "World"},
])Streaming prediction (request → stream of responses):
for chunk in model.generate(prompt="Explain quantum computing"):
print(chunk, end="", flush=True)Bidirectional streaming:
from clarifai.runners.utils.data_types import Audio
for transcript in model.transcribe(audio=iter(Audio(bytes=b"..."))):
print(transcript.text)| Example | Toolkit | Description |
|---|---|---|
vllm-qwen3-0.6B |
vLLM | Qwen3 0.6B with vLLM |
sglang-qwen3.5-4b |
SGLang | Qwen3.5 4B with SGLang |
lmdeploy-qwen3-0.6b |
LMDeploy | Qwen3 0.6B with LMDeploy |
hf-llama-3_2-1b-instruct |
HuggingFace | Llama 3.2 1B with Transformers |
agentic-gpt-5_1 |
Agentic | Agentic model with tool use |
agentic-gpt-oss-20b |
Agentic | Open-source agentic model |
| Example | Description |
|---|---|
browser-tools |
Web browsing, scraping, and search |
github-mcp-server |
GitHub operations (issues, PRs, repos) |
postgres |
PostgreSQL database operations |
google-drive |
Google Drive file management |
code-execution-docker-version |
Sandboxed code execution |
slack-tools-server |
Slack integration |
web-search |
Web search functionality |
| Example | Description |
|---|---|
qwen2_5-VL-3B-Instruct-vllm |
Qwen2.5 Vision 3B with vLLM |
sglang-deepseek-ocr |
DeepSeek OCR with SGLang |
| Example | Description |
|---|---|
hello-world |
Minimal starter (text in → text out) |
image-classifier |
Image classification |
image-detector |
Object detection |
image-segmenter |
Image segmentation |
text-embedder |
Text embeddings |
ocr |
Optical character recognition |
multimodal-models |
Multi-modal examples |
Clarifai's model framework supports rich data typing for both inputs and outputs. Each method parameter and return type must be annotated.
Common types:
| Type | Import | Example |
|---|---|---|
str, int, float, bool |
built-in | prompt: str = "" |
Text |
clarifai.runners.utils.data_types |
Text("hello"), Text(url="...") |
Image |
clarifai.runners.utils.data_types |
Image(url="..."), Image.from_pil(img) |
Audio |
clarifai.runners.utils.data_types |
Audio(bytes=b"...") |
Video |
clarifai.runners.utils.data_types |
Video(url="...") |
List[T], Dict[K, V] |
typing |
List[Image], Dict[str, float] |
Iterator[T] |
typing |
Iterator[str] (for streaming) |
For the full reference including Concept, Region, Frame, NamedFields, and container types, see SUPPORTED_DATATYPE.md.
Models like Llama, Gemma, and other gated repos require an HuggingFace token:
# Set in environment (recommended)
export HF_TOKEN=hf_...
# Or add to config.yaml
checkpoints:
repo_id: meta-llama/Llama-3.1-8B-Instruct
hf_token: hf_...The CLI validates HuggingFace access early (before building the Docker image) and provides actionable error messages if access is denied.
Note: When
HF_TOKENis in your environment, the CLI automatically persists it toconfig.yamlduring deploy so the container can access it.
Browse all available GPU instances across cloud providers:
clarifai list-instances # All instances
clarifai list-instances --cloud aws # AWS only
clarifai list-instances --gpu H100 # H100 instances
clarifai list-instances --min-gpus 2 # Multi-GPU
clarifai list-instances --min-gpu-mem 48Gi # 48+ GiB GPU memoryIf you want to upload a model image without immediately deploying it:
clarifai model upload ./my-modelYou can deploy it later with:
clarifai model deploy --model-url https://clarifai.com/user/app/models/my-model --instance g5.xlargeFor classification or detection models that output concepts/labels, define them in config.yaml:
concepts:
- id: "0"
name: bus
- id: "1"
name: person
- id: "2"
name: carNote: If your model loads from HuggingFace with
checkpointsdefined, concepts are inferred automatically.
Any command can use a different authentication context for a single invocation:
clarifai --context staging model deploy ./my-model
clarifai --context prod-openai model predict openai/chat-completion/models/GPT-4 "Hello"