gemma3

Summary

This example demonstrates how to export and run Google's Gemma 3 vision-language multimodal model locally on ExecuTorch with CUDA backend support.

Exporting the model

To export the model, we use Optimum ExecuTorch, a repo that enables exporting models straight from the source - from HuggingFace's Transformers repo.

Setting up Optimum ExecuTorch

Install through pip package:

pip install optimum-executorch

Or install from source:

git clone https://github.com/huggingface/optimum-executorch.git
cd optimum-executorch
python install_dev.py

CUDA Support

This guide focuses on CUDA backend support for Gemma3, which provides accelerated performance on NVIDIA GPUs.

Exporting with CUDA

optimum-cli export executorch \
  --model "google/gemma-3-4b-it" \
  --task "multimodal-text-to-text" \
  --recipe "cuda" \
  --dtype bfloat16 \
  --device cuda \
  --output_dir="path/to/output/dir"

This will generate:

model.pte - The exported model
aoti_cuda_blob.ptd - The CUDA kernel blob required for runtime

Exporting with INT4 Quantization (Tile Packed)

For improved performance and reduced memory footprint, you can export Gemma3 with INT4 weight quantization using tile-packed format:

optimum-cli export executorch \
  --model "google/gemma-3-4b-it" \
  --task "multimodal-text-to-text" \
  --recipe "cuda" \
  --dtype bfloat16 \
  --device cuda \
  --qlinear 4w \
  --qlinear_encoder 4w \
  --qlinear_packing_format tile_packed_to_4d \
  --qlinear_encoder_packing_format tile_packed_to_4d \
  --output_dir="path/to/output/dir"

This will generate the same files (model.pte and aoti_cuda_blob.ptd) in the int4 directory.

See the "Building the Gemma3 runner" section below for instructions on building with CUDA support, and the "Running the model" section for runtime instructions.

Running the model

To run the model, we will use the Gemma3 runner, which utilizes ExecuTorch's MultiModal runner API. The Gemma3 runner will do the following:

Image Input: Load image files (PNG, JPG, etc.) and format them as input tensors for the model
Text Input: Process text prompts using the tokenizer
Feed the formatted inputs to the multimodal runner for inference

Obtaining the tokenizer

You can download the tokenizer.json file from Gemma 3's HuggingFace repo:

curl -L https://huggingface.co/unsloth/gemma-3-1b-it/resolve/main/tokenizer.json -o tokenizer.json

Building the Gemma3 runner

Prerequisites

Ensure you have a CUDA-capable GPU and CUDA toolkit installed on your system.

Building for CUDA

# Build the Gemma3 runner with CUDA enabled
make gemma3-cuda

# Build the Gemma3 runner with CPU enabled
make gemma3-cpu

Running the model

You need to provide the following files to run Gemma3:

model.pte - The exported model file
aoti_cuda_blob.ptd - The CUDA kernel blob
tokenizer.json - The tokenizer file
An image file (PNG, JPG, etc.)

Example usage

./cmake-out/examples/models/gemma3/gemma3_e2e_runner \
  --model_path path/to/model.pte \
  --data_path path/to/aoti_cuda_blob.ptd \
  --tokenizer_path path/to/tokenizer.json \
  --image_path docs/source/_static/img/et-logo.png \ # here we use the ExecuTorch logo as an example
  --temperature 0

Example output

Okay, let's break down what's in the image!

It appears to be a stylized graphic combining:

*   **A Microchip:** The core shape is a representation of a microchip (the integrated circuit).
*   **An "On" Symbol:**  There's an "On" symbol (often represented as a circle with a vertical line) incorporated into the microchip design.
*   **Color Scheme:** The microchip is colored in gray, and
PyTorchObserver {"prompt_tokens":271,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1761118126790,"inference_end_ms":1761118128385,"prompt_eval_end_ms":1761118127175,"first_token_ms":1761118127175,"aggregate_sampling_time_ms":86,"SCALING_FACTOR_UNITS_PER_SECOND":1000}

Name		Name	Last commit message	Last commit date
parent directory ..
config		config
BUCK		BUCK
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
README.md		README.md
__init__.py		__init__.py
convert_weights.py		convert_weights.py
e2e_runner.cpp		e2e_runner.cpp
targets.bzl		targets.bzl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Summary

Exporting the model

Setting up Optimum ExecuTorch

CUDA Support

Exporting with CUDA

Exporting with INT4 Quantization (Tile Packed)

Running the model

Obtaining the tokenizer

Building the Gemma3 runner

Prerequisites

Building for CUDA

Running the model

Example usage

Example output

FilesExpand file tree

gemma3

Directory actions

More options

Directory actions

More options

Latest commit

History

gemma3

Folders and files

parent directory

README.md

Summary

Exporting the model

Setting up Optimum ExecuTorch

CUDA Support

Exporting with CUDA

Exporting with INT4 Quantization (Tile Packed)

Running the model

Obtaining the tokenizer

Building the Gemma3 runner

Prerequisites

Building for CUDA

Running the model

Example usage

Example output