Skip to content

server: support Vertex AI compatible API#22545

Merged
ngxson merged 8 commits into
masterfrom
xsn/vertexai
May 8, 2026
Merged

server: support Vertex AI compatible API#22545
ngxson merged 8 commits into
masterfrom
xsn/vertexai

Conversation

@ngxson

@ngxson ngxson commented Apr 30, 2026

Copy link
Copy Markdown
Collaborator

Overview

Support Vertex AI compatible API (allow deploying llama.cpp to Vertex AI)

Based on an initial patch provided by @alvarobartt

Ref:

NOTE: the stream endpoint is not support atm, but we can add it in the future, see: https://docs.cloud.google.com/vertex-ai/docs/reference/rest/v1/projects.locations.endpoints/serverStreamingPredict

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: yes, I use it to write down the patch that I already had in my mind

@github-actions github-actions Bot added examples python python script changes server labels Apr 30, 2026

@alvarobartt alvarobartt left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow-up @ngxson!

We should also handle the following environment variables:

  • AIP_HTTP_PORT
  • AIP_HEALTH_ROUTE (defaults to /health)
  • AIP_MODE is what defines whether it needs to be enabled or not, so we'll need to check that it's set to PREDICTION, so let's a conditional based on that environment variable before void register_vertexai();

Those (and more) are defined in https://docs.cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements#aip-variables, but we only need to handle the ones reported above, as per e.g., AIP_STORAGE_URI would be a pointer to a Google Cloud Storage Bucket and ideally forwarded to -m ..., but we handle the download in a custom entrypoint, so no need to do that on the server-side. And apologies for not sending that earlier!

@ngxson

ngxson commented Apr 30, 2026

Copy link
Copy Markdown
Collaborator Author

hey @alvarobartt , I added support for custom health endpoint and custom port on the last commit, could you please take a look?

@alvarobartt alvarobartt left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also as moving forward they are now naming Vertex AI as Agent Platform or Gemini Enterprise Agent Platform (see https://cloud.google.com/model-garden) maybe to be on the safe side we can simply name the variables and such as e.g., "google", "google-agent-platform", "gemini-agent-platform", or something along those lines to be aligned on the naming?

Note that we still use Vertex AI here and there, but moving forward they are going to stop using those and on the Rust side of things we usually expose those via a feature named "google" to keep things simpler, so just thinking out loud, whatever you feel it's best!

Update: if it helps AIP_* stands for AI Platform

Comment thread tools/server/server-http.cpp Outdated

// Ref: https://docs.cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements#aip-variables
vertexai_params() {
enabled = getenv("AIP_MODE", "") == "PREDICT";

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
enabled = getenv("AIP_MODE", "") == "PREDICT";
enabled = getenv("AIP_MODE", "") == "PREDICTION";

Comment thread tools/server/server-http.cpp Outdated
enabled = getenv("AIP_MODE", "") == "PREDICT";
path_health = getenv("AIP_HEALTH_ROUTE", "", true); // default: using the route defined in server.cpp
path_predict = getenv("AIP_PREDICT_ROUTE", "/predict", true);
port = std::stoi(getenv("PORT", "8080"));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
port = std::stoi(getenv("PORT", "8080"));
port = std::stoi(getenv("AIP_HTTP_PORT", "8080"));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check for collisions there with the --port arg and the LLAMA_ARG_PORT environment variables?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes that's a good idea, I will add a warning if both of them are defined

@ngxson

ngxson commented Apr 30, 2026

Copy link
Copy Markdown
Collaborator Author

Note that we still use Vertex AI here and there, but moving forward they are going to stop using those and on the Rust side of things we usually expose those via a feature named "google" to keep things simpler, so just thinking out loud, whatever you feel it's best!

Yes that makes sense to me, no problem for changing name. On my personal projects, I name things gcp_ for Google Cloud Platform. Do you think it would be a better naming? (I'm not quite sure if vertex AI is part of GCP or not)

@alvarobartt

alvarobartt commented Apr 30, 2026

Copy link
Copy Markdown

Note that we still use Vertex AI here and there, but moving forward they are going to stop using those and on the Rust side of things we usually expose those via a feature named "google" to keep things simpler, so just thinking out loud, whatever you feel it's best!

Yes that makes sense to me, no problem for changing name. On my personal projects, I name things gcp_ for Google Cloud Platform. Do you think it would be a better naming? (I'm not quite sure if vertex AI is part of GCP or not)

100%, gcp is fair in this context! And yes, it's part of Google Cloud Platform though as mentioned they're going through a rename as per https://cloud.google.com/products/gemini-enterprise-agent-platform?hl=en i.e., they already claim "(formerly Vertex AI)", and this was announced too so I'd expect it to take some time until people pick up that's the same thing, so gcp is generic enough 👍🏻

image

@ngxson

ngxson commented Apr 30, 2026

Copy link
Copy Markdown
Collaborator Author

All problems should be addressed now, could you give a try on GCP to confirm if it works? @alvarobartt
Lmk if you need help building the docker image

@alvarobartt

Copy link
Copy Markdown

All problems should be addressed now, could you give a try on GCP to confirm if it works? @alvarobartt Lmk if you need help building the docker image

Sure @ngxson let me do that and report back (it might take a while as Vertex AI takes some time to allocate the instances, etc.) 🤗

@alvarobartt alvarobartt left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ngxson I just verified on Vertex AI and it works as expected, find below a reproducer for transparency 🤗

Let me just clarify streaming, and if that's something supported as of today for these containers!

# NOTE: `pip install google-cloud-aiplatform --upgrade --quiet`
from google.cloud import aiplatform

aiplatform.init(project="<PROJECT_ID>", location="us-central1")

model = aiplatform.Model.upload(
    display_name="Gemma-4-E2B-IT",
    serving_container_image_uri="<CONTAINER_URI>",
    serving_container_environment_variables={
        "LLAMA_ARG_HF_REPO": "unsloth/gemma-4-E2B-it-GGUF:Q4_K_M",
        "LLAMA_ARG_JINJA": "true",
        "LLAMA_ARG_N_GPU_LAYERS": "99",
        "LLAMA_ARG_CTX_SIZE": "131072",
        "LLAMA_ARG_CONTEXT_SHIFT": "false",
    },
    serving_container_ports=[8080],
)
model.wait()

endpoint = aiplatform.Endpoint.create(display_name="Gemma-4-E2B-IT-API")

deployed_model = model.deploy(
    endpoint=endpoint,
    machine_type="g2-standard-12",
    accelerator_type="NVIDIA_L4",
    accelerator_count=1,
)

output = deployed_model.predict(
    instances=[
        {
            "@requestFormat": "chatCompletions",
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": "What's in this image?"},
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
                            },
                        },
                    ],
                },
            ],
            "max_new_tokens": 256,
            "top_p": 0.95,
            "temperature": 1.0,
            "top_k": 64,
        },
    ]
)

deployed_model.undeploy_all()
deployed_model.delete()
model.delete()

@ngxson

ngxson commented May 7, 2026

Copy link
Copy Markdown
Collaborator Author

Hey @ggerganov could you give the 2nd approval? And also update the ggml-org readme.md to include google cloud as a cloud provider? (not sure if we have a special URL for llama.cpp @alvarobartt ?) Thanks!

@ngxson ngxson marked this pull request as ready for review May 8, 2026 12:44
@ngxson ngxson requested a review from a team as a code owner May 8, 2026 12:44
@ngxson ngxson requested a review from ServeurpersoCom May 8, 2026 12:54
@ngxson ngxson merged commit 29debb3 into master May 8, 2026
48 of 49 checks passed
cetarthoriphros pushed a commit to cetarthoriphros/llama.cpp that referenced this pull request May 9, 2026
* server: support Vertex AI compatible API

* a bit safer

* support other AIP_* env var

* various fixes

* if AIP_MODE is unset, do nothing

* fix test case

* fix windows build
meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026
* server: support Vertex AI compatible API

* a bit safer

* support other AIP_* env var

* various fixes

* if AIP_MODE is unset, do nothing

* fix test case

* fix windows build
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 19, 2026
* server: support Vertex AI compatible API

* a bit safer

* support other AIP_* env var

* various fixes

* if AIP_MODE is unset, do nothing

* fix test case

* fix windows build
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
* server: support Vertex AI compatible API

* a bit safer

* support other AIP_* env var

* various fixes

* if AIP_MODE is unset, do nothing

* fix test case

* fix windows build
winstonma pushed a commit to winstonma/llama.cpp that referenced this pull request May 27, 2026
* server: support Vertex AI compatible API

* a bit safer

* support other AIP_* env var

* various fixes

* if AIP_MODE is unset, do nothing

* fix test case

* fix windows build
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
* server: support Vertex AI compatible API

* a bit safer

* support other AIP_* env var

* various fixes

* if AIP_MODE is unset, do nothing

* fix test case

* fix windows build
@ngxson ngxson deleted the xsn/vertexai branch June 13, 2026 12:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants