server: support Vertex AI compatible API by ngxson · Pull Request #22545 · ggml-org/llama.cpp

ngxson · 2026-04-30T10:02:13Z

Overview

Support Vertex AI compatible API (allow deploying llama.cpp to Vertex AI)

Based on an initial patch provided by @alvarobartt

Ref:

NOTE: the stream endpoint is not support atm, but we can add it in the future, see: https://docs.cloud.google.com/vertex-ai/docs/reference/rest/v1/projects.locations.endpoints/serverStreamingPredict

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: yes, I use it to write down the patch that I already had in my mind

alvarobartt

Thanks for the follow-up @ngxson!

We should also handle the following environment variables:

AIP_HTTP_PORT
AIP_HEALTH_ROUTE (defaults to /health)
AIP_MODE is what defines whether it needs to be enabled or not, so we'll need to check that it's set to PREDICTION, so let's a conditional based on that environment variable before void register_vertexai();

Those (and more) are defined in https://docs.cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements#aip-variables, but we only need to handle the ones reported above, as per e.g., AIP_STORAGE_URI would be a pointer to a Google Cloud Storage Bucket and ideally forwarded to -m ..., but we handle the download in a custom entrypoint, so no need to do that on the server-side. And apologies for not sending that earlier!

ngxson · 2026-04-30T10:38:08Z

hey @alvarobartt , I added support for custom health endpoint and custom port on the last commit, could you please take a look?

alvarobartt

Also as moving forward they are now naming Vertex AI as Agent Platform or Gemini Enterprise Agent Platform (see https://cloud.google.com/model-garden) maybe to be on the safe side we can simply name the variables and such as e.g., "google", "google-agent-platform", "gemini-agent-platform", or something along those lines to be aligned on the naming?

Note that we still use Vertex AI here and there, but moving forward they are going to stop using those and on the Rust side of things we usually expose those via a feature named "google" to keep things simpler, so just thinking out loud, whatever you feel it's best!

Update: if it helps AIP_* stands for AI Platform

alvarobartt · 2026-04-30T10:41:43Z

+
+    // Ref: https://docs.cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements#aip-variables
+    vertexai_params() {
+        enabled = getenv("AIP_MODE", "") == "PREDICT";


Suggested change

enabled = getenv("AIP_MODE", "") == "PREDICT";

enabled = getenv("AIP_MODE", "") == "PREDICTION";

alvarobartt · 2026-04-30T10:42:05Z

+        enabled = getenv("AIP_MODE", "") == "PREDICT";
+        path_health = getenv("AIP_HEALTH_ROUTE", "", true); // default: using the route defined in server.cpp
+        path_predict = getenv("AIP_PREDICT_ROUTE", "/predict", true);
+        port = std::stoi(getenv("PORT", "8080"));


Suggested change

port = std::stoi(getenv("PORT", "8080"));

port = std::stoi(getenv("AIP_HTTP_PORT", "8080"));

Should we check for collisions there with the --port arg and the LLAMA_ARG_PORT environment variables?

ah yes that's a good idea, I will add a warning if both of them are defined

ngxson · 2026-04-30T10:56:12Z

Note that we still use Vertex AI here and there, but moving forward they are going to stop using those and on the Rust side of things we usually expose those via a feature named "google" to keep things simpler, so just thinking out loud, whatever you feel it's best!

Yes that makes sense to me, no problem for changing name. On my personal projects, I name things gcp_ for Google Cloud Platform. Do you think it would be a better naming? (I'm not quite sure if vertex AI is part of GCP or not)

alvarobartt · 2026-04-30T11:02:21Z

Note that we still use Vertex AI here and there, but moving forward they are going to stop using those and on the Rust side of things we usually expose those via a feature named "google" to keep things simpler, so just thinking out loud, whatever you feel it's best!

Yes that makes sense to me, no problem for changing name. On my personal projects, I name things gcp_ for Google Cloud Platform. Do you think it would be a better naming? (I'm not quite sure if vertex AI is part of GCP or not)

100%, gcp is fair in this context! And yes, it's part of Google Cloud Platform though as mentioned they're going through a rename as per https://cloud.google.com/products/gemini-enterprise-agent-platform?hl=en i.e., they already claim "(formerly Vertex AI)", and this was announced too so I'd expect it to take some time until people pick up that's the same thing, so gcp is generic enough 👍🏻

ngxson · 2026-04-30T11:31:08Z

All problems should be addressed now, could you give a try on GCP to confirm if it works? @alvarobartt
Lmk if you need help building the docker image

alvarobartt · 2026-04-30T14:32:25Z

All problems should be addressed now, could you give a try on GCP to confirm if it works? @alvarobartt Lmk if you need help building the docker image

Sure @ngxson let me do that and report back (it might take a while as Vertex AI takes some time to allocate the instances, etc.) 🤗

alvarobartt

Hey @ngxson I just verified on Vertex AI and it works as expected, find below a reproducer for transparency 🤗

Let me just clarify streaming, and if that's something supported as of today for these containers!

# NOTE: `pip install google-cloud-aiplatform --upgrade --quiet`
from google.cloud import aiplatform

aiplatform.init(project="<PROJECT_ID>", location="us-central1")

model = aiplatform.Model.upload(
    display_name="Gemma-4-E2B-IT",
    serving_container_image_uri="<CONTAINER_URI>",
    serving_container_environment_variables={
        "LLAMA_ARG_HF_REPO": "unsloth/gemma-4-E2B-it-GGUF:Q4_K_M",
        "LLAMA_ARG_JINJA": "true",
        "LLAMA_ARG_N_GPU_LAYERS": "99",
        "LLAMA_ARG_CTX_SIZE": "131072",
        "LLAMA_ARG_CONTEXT_SHIFT": "false",
    },
    serving_container_ports=[8080],
)
model.wait()

endpoint = aiplatform.Endpoint.create(display_name="Gemma-4-E2B-IT-API")

deployed_model = model.deploy(
    endpoint=endpoint,
    machine_type="g2-standard-12",
    accelerator_type="NVIDIA_L4",
    accelerator_count=1,
)

output = deployed_model.predict(
    instances=[
        {
            "@requestFormat": "chatCompletions",
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": "What's in this image?"},
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
                            },
                        },
                    ],
                },
            ],
            "max_new_tokens": 256,
            "top_p": 0.95,
            "temperature": 1.0,
            "top_k": 64,
        },
    ]
)

deployed_model.undeploy_all()
deployed_model.delete()
model.delete()

ngxson · 2026-05-07T21:46:11Z

Hey @ggerganov could you give the 2nd approval? And also update the ggml-org readme.md to include google cloud as a cloud provider? (not sure if we have a special URL for llama.cpp @alvarobartt ?) Thanks!

* server: support Vertex AI compatible API * a bit safer * support other AIP_* env var * various fixes * if AIP_MODE is unset, do nothing * fix test case * fix windows build

ngxson added 2 commits April 30, 2026 11:55

server: support Vertex AI compatible API

bfc135f

Merge branch 'master' into xsn/vertexai

d34f971

github-actions Bot added examples python python script changes server labels Apr 30, 2026

alvarobartt reviewed Apr 30, 2026

View reviewed changes

ngxson added 2 commits April 30, 2026 12:15

a bit safer

5dd6c9e

support other AIP_* env var

5e11eaf

alvarobartt reviewed Apr 30, 2026

View reviewed changes

various fixes

348e608

if AIP_MODE is unset, do nothing

331e4d2

ngxson added 2 commits April 30, 2026 20:07

fix test case

9233271

fix windows build

1b2bd86

alvarobartt approved these changes May 6, 2026

View reviewed changes

ngxson marked this pull request as ready for review May 8, 2026 12:44

ngxson requested a review from a team as a code owner May 8, 2026 12:44

ggerganov approved these changes May 8, 2026

View reviewed changes

ngxson requested a review from ServeurpersoCom May 8, 2026 12:54

ServeurpersoCom approved these changes May 8, 2026

View reviewed changes

ngxson merged commit 29debb3 into master May 8, 2026
48 of 49 checks passed

ngxson deleted the xsn/vertexai branch June 13, 2026 12:00

	enabled = getenv("AIP_MODE", "") == "PREDICT";
	enabled = getenv("AIP_MODE", "") == "PREDICTION";

	port = std::stoi(getenv("PORT", "8080"));
	port = std::stoi(getenv("AIP_HTTP_PORT", "8080"));

Conversation

ngxson commented Apr 30, 2026

Overview

Requirements

Uh oh!

alvarobartt left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson commented Apr 30, 2026

Uh oh!

alvarobartt left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alvarobartt Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

alvarobartt Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

alvarobartt Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson commented Apr 30, 2026

Uh oh!

alvarobartt commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Apr 30, 2026

Uh oh!

alvarobartt commented Apr 30, 2026

Uh oh!

alvarobartt left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alvarobartt left a comment •

edited

Loading

alvarobartt left a comment •

edited

Loading

alvarobartt commented Apr 30, 2026 •

edited

Loading

alvarobartt left a comment •

edited

Loading

ngxson commented May 7, 2026 •

edited

Loading