server: support Vertex AI compatible API#22545
Conversation
There was a problem hiding this comment.
Thanks for the follow-up @ngxson!
We should also handle the following environment variables:
AIP_HTTP_PORTAIP_HEALTH_ROUTE(defaults to/health)AIP_MODEis what defines whether it needs to be enabled or not, so we'll need to check that it's set toPREDICTION, so let's a conditional based on that environment variable beforevoid register_vertexai();
Those (and more) are defined in https://docs.cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements#aip-variables, but we only need to handle the ones reported above, as per e.g., AIP_STORAGE_URI would be a pointer to a Google Cloud Storage Bucket and ideally forwarded to -m ..., but we handle the download in a custom entrypoint, so no need to do that on the server-side. And apologies for not sending that earlier!
|
hey @alvarobartt , I added support for custom health endpoint and custom port on the last commit, could you please take a look? |
There was a problem hiding this comment.
Also as moving forward they are now naming Vertex AI as Agent Platform or Gemini Enterprise Agent Platform (see https://cloud.google.com/model-garden) maybe to be on the safe side we can simply name the variables and such as e.g., "google", "google-agent-platform", "gemini-agent-platform", or something along those lines to be aligned on the naming?
Note that we still use Vertex AI here and there, but moving forward they are going to stop using those and on the Rust side of things we usually expose those via a feature named "google" to keep things simpler, so just thinking out loud, whatever you feel it's best!
Update: if it helps AIP_* stands for AI Platform
|
|
||
| // Ref: https://docs.cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements#aip-variables | ||
| vertexai_params() { | ||
| enabled = getenv("AIP_MODE", "") == "PREDICT"; |
There was a problem hiding this comment.
| enabled = getenv("AIP_MODE", "") == "PREDICT"; | |
| enabled = getenv("AIP_MODE", "") == "PREDICTION"; |
| enabled = getenv("AIP_MODE", "") == "PREDICT"; | ||
| path_health = getenv("AIP_HEALTH_ROUTE", "", true); // default: using the route defined in server.cpp | ||
| path_predict = getenv("AIP_PREDICT_ROUTE", "/predict", true); | ||
| port = std::stoi(getenv("PORT", "8080")); |
There was a problem hiding this comment.
| port = std::stoi(getenv("PORT", "8080")); | |
| port = std::stoi(getenv("AIP_HTTP_PORT", "8080")); |
There was a problem hiding this comment.
Should we check for collisions there with the --port arg and the LLAMA_ARG_PORT environment variables?
There was a problem hiding this comment.
ah yes that's a good idea, I will add a warning if both of them are defined
Yes that makes sense to me, no problem for changing name. On my personal projects, I name things |
100%,
|
|
All problems should be addressed now, could you give a try on GCP to confirm if it works? @alvarobartt |
Sure @ngxson let me do that and report back (it might take a while as Vertex AI takes some time to allocate the instances, etc.) 🤗 |
There was a problem hiding this comment.
Hey @ngxson I just verified on Vertex AI and it works as expected, find below a reproducer for transparency 🤗
Let me just clarify streaming, and if that's something supported as of today for these containers!
# NOTE: `pip install google-cloud-aiplatform --upgrade --quiet`
from google.cloud import aiplatform
aiplatform.init(project="<PROJECT_ID>", location="us-central1")
model = aiplatform.Model.upload(
display_name="Gemma-4-E2B-IT",
serving_container_image_uri="<CONTAINER_URI>",
serving_container_environment_variables={
"LLAMA_ARG_HF_REPO": "unsloth/gemma-4-E2B-it-GGUF:Q4_K_M",
"LLAMA_ARG_JINJA": "true",
"LLAMA_ARG_N_GPU_LAYERS": "99",
"LLAMA_ARG_CTX_SIZE": "131072",
"LLAMA_ARG_CONTEXT_SHIFT": "false",
},
serving_container_ports=[8080],
)
model.wait()
endpoint = aiplatform.Endpoint.create(display_name="Gemma-4-E2B-IT-API")
deployed_model = model.deploy(
endpoint=endpoint,
machine_type="g2-standard-12",
accelerator_type="NVIDIA_L4",
accelerator_count=1,
)
output = deployed_model.predict(
instances=[
{
"@requestFormat": "chatCompletions",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
},
},
],
},
],
"max_new_tokens": 256,
"top_p": 0.95,
"temperature": 1.0,
"top_k": 64,
},
]
)
deployed_model.undeploy_all()
deployed_model.delete()
model.delete()|
Hey @ggerganov could you give the 2nd approval? And also update the ggml-org readme.md to include google cloud as a cloud provider? (not sure if we have a special URL for llama.cpp @alvarobartt ?) Thanks! |
* server: support Vertex AI compatible API * a bit safer * support other AIP_* env var * various fixes * if AIP_MODE is unset, do nothing * fix test case * fix windows build
* server: support Vertex AI compatible API * a bit safer * support other AIP_* env var * various fixes * if AIP_MODE is unset, do nothing * fix test case * fix windows build
* server: support Vertex AI compatible API * a bit safer * support other AIP_* env var * various fixes * if AIP_MODE is unset, do nothing * fix test case * fix windows build
* server: support Vertex AI compatible API * a bit safer * support other AIP_* env var * various fixes * if AIP_MODE is unset, do nothing * fix test case * fix windows build
* server: support Vertex AI compatible API * a bit safer * support other AIP_* env var * various fixes * if AIP_MODE is unset, do nothing * fix test case * fix windows build
* server: support Vertex AI compatible API * a bit safer * support other AIP_* env var * various fixes * if AIP_MODE is unset, do nothing * fix test case * fix windows build

Overview
Support Vertex AI compatible API (allow deploying llama.cpp to Vertex AI)
Based on an initial patch provided by @alvarobartt
Ref:
NOTE: the stream endpoint is not support atm, but we can add it in the future, see: https://docs.cloud.google.com/vertex-ai/docs/reference/rest/v1/projects.locations.endpoints/serverStreamingPredict
Requirements