{"id":167378,"date":"2026-05-06T00:50:32","date_gmt":"2026-05-05T21:50:32","guid":{"rendered":"https:\/\/computingforgeeks.com\/?p=167378"},"modified":"2026-05-06T12:46:29","modified_gmt":"2026-05-06T09:46:29","slug":"vertex-ai-gemini-python-streaming-tool-use","status":"publish","type":"post","link":"https:\/\/computingforgeeks.com\/vertex-ai-gemini-python-streaming-tool-use\/","title":{"rendered":"Use Vertex AI Gemini in Python: Streaming, Tools, Vision"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Most Vertex AI Gemini tutorials on the open web were written before the SDK changed. They show <code>from vertexai.generative_models import GenerativeModel<\/code>, an API Google deprecated in mid-2025 and removes entirely on June 24, 2026. Code that ships against that import will start raising <code>ModuleNotFoundError<\/code> the day it lands.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This guide is the current playbook for Vertex AI Gemini in Python: the unified <code>google-genai<\/code> SDK, the same package that powers Gemini in AI Studio, but pointed at Vertex AI for production work. Every command and every snippet was tested live on Google Cloud on May 5, 2026, and the screenshots are real captures from that session. You will set up auth, call Gemini 2.5 Flash, stream responses, force JSON output into a Pydantic schema, call functions, send images, combine streaming with tool use, and handle the four error modes you will hit in production.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If you already use the <a href=\"https:\/\/computingforgeeks.com\/install-configure-gemini-cli\/\" target=\"_blank\" rel=\"noreferrer noopener\">Gemini CLI<\/a> for one-off prompts or follow the <a href=\"https:\/\/computingforgeeks.com\/gemini-cli-cheat-sheet\/\" target=\"_blank\" rel=\"noreferrer noopener\">Gemini CLI cheat sheet<\/a> from the terminal, the Python path here is the next step: programmatic, billable through Vertex AI, and ready to wear inside FastAPI, Cloud Run, or a Kubernetes job.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Tested May 2026 on macOS with Python 3.13.11, <code>google-genai<\/code> 1.75.0, Pydantic 2.13.3, against Vertex AI in <code>us-central1<\/code> using <code>gemini-2.5-flash<\/code> and <code>gemini-2.5-pro<\/code>.<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What changed: google-genai vs vertexai.generative_models<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Two SDKs reach Gemini from Python today. Only one has a future.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>SDK<\/th><th>Import path<\/th><th>Status<\/th><th>Last day to migrate<\/th><\/tr><\/thead><tbody><tr><td><code>google-genai<\/code><\/td><td><code>from google import genai<\/code><\/td><td>Active, GA, <a href=\"https:\/\/googleapis.github.io\/python-genai\/\" target=\"_blank\" rel=\"noreferrer noopener\">documented here<\/a><\/td><td>n\/a (this is the target)<\/td><\/tr><tr><td><code>google-cloud-aiplatform[generative]<\/code><\/td><td><code>from vertexai.generative_models import GenerativeModel<\/code><\/td><td>Deprecated June 24, 2025<\/td><td>June 24, 2026<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The unified <code>google-genai<\/code> package supports both backends through one client: pass <code>vertexai=True<\/code> with a project and location to talk to Vertex AI, or pass <code>api_key=...<\/code> to talk to AI Studio. Same methods, same response shapes, different backend. For production on Google Cloud, Vertex AI is the right pick: it bills against your project, supports IAM properly, and gives you regional endpoints.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 1: Set reusable shell variables<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Every command in this guide assumes a few variables. Setting them once means you swap your project ID and region in one place, then paste the rest as is. Open a fresh shell and export them:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>export PROJECT_ID=\"your-gcp-project-id\"\nexport REGION=\"us-central1\"\nexport SA_NAME=\"gemini-vertex-sa\"\nexport SA_KEY_PATH=\"${HOME}\/sa-keys\/${PROJECT_ID}-vertex.json\"\nexport ADMIN_EMAIL=\"you@example.com\"<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Confirm the values are exported before you continue:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>echo \"Project:  ${PROJECT_ID}\"\necho \"Region:   ${REGION}\"\necho \"SA name:  ${SA_NAME}\"\necho \"SA key:   ${SA_KEY_PATH}\"<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Pick <code>us-central1<\/code> if you are unsure: it has the broadest model coverage. <code>europe-west4<\/code> is the standard EU pick. The variables only live for the current shell, so re-export them if you reconnect.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 2: Enable Vertex AI and grant IAM<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The <code>aiplatform.googleapis.com<\/code> API is the gate to Vertex AI. Enable it on the project:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>gcloud services enable aiplatform.googleapis.com --project=\"${PROJECT_ID}\"<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The console shows the activation in a few seconds. The image below captures both the API enable and the local Python check that follows.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"920\" height=\"800\" src=\"https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/wm-vertex-ai-setup-gcloud-adc.png\" alt=\"Enable Vertex AI API, authenticate with gcloud, and verify Python client init\" class=\"wp-image-167371\" title=\"\" srcset=\"https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/wm-vertex-ai-setup-gcloud-adc.png 920w, https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/wm-vertex-ai-setup-gcloud-adc-300x261.png 300w, https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/wm-vertex-ai-setup-gcloud-adc-768x668.png 768w\" sizes=\"auto, (max-width: 920px) 100vw, 920px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Now create a service account for production code paths and grant it the Vertex AI User role. Avoid <code>roles\/owner<\/code> here: the principle of least privilege matters when this key may end up baked into a CI runner.<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>gcloud iam service-accounts create \"${SA_NAME}\" \\\n  --display-name=\"Gemini on Vertex AI\" \\\n  --project=\"${PROJECT_ID}\"\n\ngcloud projects add-iam-policy-binding \"${PROJECT_ID}\" \\\n  --member=\"serviceAccount:${SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com\" \\\n  --role=\"roles\/aiplatform.user\"<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The <code>roles\/aiplatform.user<\/code> binding includes the <code>aiplatform.endpoints.predict<\/code> permission, which is what the SDK ultimately calls. If you skip this and run the SDK as a service account that lacks the role, the request returns a 403 with that exact permission name in the message. The error section at the end of this guide shows the trace.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 3: Authenticate from Python (ADC and service-account JSON)<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The SDK reads credentials through the standard <code>google-auth<\/code> chain. You have two practical paths: Application Default Credentials for your laptop, and a service-account JSON for servers and CI. Both work without changing a line of Python.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For local development, run ADC once:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>gcloud auth application-default login<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This writes <code>~\/.config\/gcloud\/application_default_credentials.json<\/code>. The SDK picks it up automatically. For a server or container, generate a key for the service account and point <code>GOOGLE_APPLICATION_CREDENTIALS<\/code> at it:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>mkdir -p \"$(dirname \"${SA_KEY_PATH}\")\"\n\ngcloud iam service-accounts keys create \"${SA_KEY_PATH}\" \\\n  --iam-account=\"${SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com\"\n\nexport GOOGLE_APPLICATION_CREDENTIALS=\"${SA_KEY_PATH}\"<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">If you deploy to GKE, prefer Workload Identity Federation over a JSON key. The <a href=\"https:\/\/computingforgeeks.com\/gke-workload-identity-federation-complete-guide\/\" target=\"_blank\" rel=\"noreferrer noopener\">GKE Workload Identity walkthrough<\/a> covers the binding so pods inherit the SA without a key on disk. For GitHub Actions, the same idea applies with <a href=\"https:\/\/computingforgeeks.com\/gcp-workload-identity-federation-github-actions\/\" target=\"_blank\" rel=\"noreferrer noopener\">Workload Identity Federation for GitHub<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 4: First call to Vertex AI Gemini in Python<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Create a clean virtualenv and pin the SDK:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>python3 -m venv venv\nsource venv\/bin\/activate\npip install --upgrade pip\n\npip install \"google-genai==1.75.0\" \"pydantic\" \"Pillow\"  #https:\/\/pypi.org\/project\/google-genai\/<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Save this as <code>demos\/01_hello.py<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>from google import genai\n\nclient = genai.Client(\n    vertexai=True,\n    project=\"your-gcp-project-id\",\n    location=\"us-central1\",\n)\n\nresponse = client.models.generate_content(\n    model=\"gemini-2.5-flash\",\n    contents=\"In one short sentence, explain what Vertex AI is.\",\n)\n\nprint(response.text)\nprint(\"---\")\nprint(f\"Model:  {response.model_version}\")\nprint(f\"Tokens: {response.usage_metadata.prompt_token_count} in, \"\n      f\"{response.usage_metadata.candidates_token_count} out, \"\n      f\"{response.usage_metadata.total_token_count} total\")<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Run it. The first call costs about 40 tokens, well below a cent on Flash:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>python demos\/01_hello.py<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The output confirms the call hit Gemini 2.5 Flash and reports token usage:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>Vertex AI is Google Cloud's unified, end-to-end platform for building, deploying, and managing machine learning models throughout their lifecycle.\n---\nModel:  gemini-2.5-flash\nTokens: 11 in, 29 out, 614 total<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The <code>total_token_count<\/code> includes thinking tokens that Gemini 2.5 generates internally before producing the visible answer, which is why the total is well above input plus output. You see this on every 2.5-series call.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 5: Stream responses (sync and async)<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Streaming is what makes a chat UI feel alive. Without it, the user stares at a spinner for 4 seconds. With it, the first token appears in around 300 ms and the rest flows in. The SDK exposes <code>generate_content_stream<\/code> on both the sync and async clients.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Sync is the right pick for CLI tools and scripts:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>from google import genai\n\nclient = genai.Client(vertexai=True, project=\"your-gcp-project-id\", location=\"us-central1\")\n\nprompt = \"List three reasons engineers stream LLM output. One short bullet each.\"\n\nfor chunk in client.models.generate_content_stream(\n    model=\"gemini-2.5-flash\",\n    contents=prompt,\n):\n    print(chunk.text, end=\"\", flush=True)\n\nprint()<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Each <code>chunk<\/code> carries a few tokens and you flush them as they arrive. The image below shows back-to-back runs of the hello-world script and the streaming script:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"920\" height=\"800\" src=\"https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/wm-vertex-ai-gemini-python-streaming-output.png\" alt=\"Run a hello world prompt then stream the next response chunk by chunk\" class=\"wp-image-167373\" title=\"\" srcset=\"https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/wm-vertex-ai-gemini-python-streaming-output.png 920w, https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/wm-vertex-ai-gemini-python-streaming-output-300x261.png 300w, https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/wm-vertex-ai-gemini-python-streaming-output-768x668.png 768w\" sizes=\"auto, (max-width: 920px) 100vw, 920px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Async is what you want behind a FastAPI endpoint or any code that already runs on an event loop. The async client lives at <code>client.aio<\/code> and mirrors every method:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>import asyncio\nfrom google import genai\n\nclient = genai.Client(vertexai=True, project=\"your-gcp-project-id\", location=\"us-central1\")\n\n\nasync def main() -> None:\n    stream = await client.aio.models.generate_content_stream(\n        model=\"gemini-2.5-flash\",\n        contents=\"Explain async streaming in two short sentences.\",\n    )\n    async for chunk in stream:\n        print(chunk.text, end=\"\", flush=True)\n    print()\n\n\nasyncio.run(main())<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Wrap that in a <a href=\"https:\/\/computingforgeeks.com\/deploy-google-cloud-run-terraform\/\" target=\"_blank\" rel=\"noreferrer noopener\">Cloud Run service<\/a> with FastAPI&#8217;s <code>StreamingResponse<\/code> and you have a Server-Sent Events endpoint that scales to zero and pays only for actual generation time.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 6: JSON output, system instructions, and safety settings<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Free-form text is fine for chat. For automation you want structure. The SDK accepts a Pydantic class as <code>response_schema<\/code> and Gemini conforms to it: no regex, no <code>json.loads<\/code> wrapped in a try\/except, no &#8220;the model forgot the closing brace again.&#8221;<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>from pydantic import BaseModel\nfrom google import genai\nfrom google.genai import types\n\n\nclass IncidentSummary(BaseModel):\n    severity: str\n    affected_service: str\n    likely_cause: str\n    next_step: str\n\n\nclient = genai.Client(vertexai=True, project=\"your-gcp-project-id\", location=\"us-central1\")\n\nresponse = client.models.generate_content(\n    model=\"gemini-2.5-flash\",\n    contents=(\n        \"Triage this Nginx alert: 502 Bad Gateway spiking on \/api after a \"\n        \"PostgreSQL maintenance window. Return one IncidentSummary.\"\n    ),\n    config=types.GenerateContentConfig(\n        response_mime_type=\"application\/json\",\n        response_schema=IncidentSummary,\n    ),\n)\n\nincident: IncidentSummary = response.parsed\nprint(f\"Severity:        {incident.severity}\")\nprint(f\"Service:         {incident.affected_service}\")\nprint(f\"Likely cause:    {incident.likely_cause}\")\nprint(f\"Next step:       {incident.next_step}\")<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The captured run produced a clean parse without retries:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"920\" height=\"800\" src=\"https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/wm-vertex-ai-gemini-json-pydantic-output.png\" alt=\"Force Gemini to return JSON that fits an IncidentSummary Pydantic schema\" class=\"wp-image-167374\" title=\"\" srcset=\"https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/wm-vertex-ai-gemini-json-pydantic-output.png 920w, https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/wm-vertex-ai-gemini-json-pydantic-output-300x261.png 300w, https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/wm-vertex-ai-gemini-json-pydantic-output-768x668.png 768w\" sizes=\"auto, (max-width: 920px) 100vw, 920px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Two more knobs on the same <code>GenerateContentConfig<\/code> are worth setting on day one. <code>system_instruction<\/code> pins the persona so you stop repeating &#8220;act as a senior SRE&#8221; in every prompt. <code>safety_settings<\/code> tunes the harm thresholds that block content; the defaults are mid-strict and you may need to lower them for security or red-team workloads:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>config = types.GenerateContentConfig(\n    system_instruction=\"You are a senior Linux SysAdmin. Be terse. Reply in one sentence unless asked otherwise.\",\n    safety_settings=[\n        types.SafetySetting(\n            category=\"HARM_CATEGORY_HATE_SPEECH\",\n            threshold=\"BLOCK_ONLY_HIGH\",\n        ),\n    ],\n)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Pass that <code>config<\/code> to any <code>generate_content<\/code> or <code>generate_content_stream<\/code> call and the model honours both.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 7: Function calling, automatic and manual<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Function calling is how Gemini asks your code to do things it cannot do itself: query a database, hit an internal API, run a kubectl command. The SDK supports two patterns. Pick automatic when the function lives in the same Python process. Pick manual when the tool is on a different host, in a different language, or when you want the model to suggest calls but never execute them.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For the automatic pattern, define a regular Python function with a docstring and type hints. The SDK introspects both, generates the JSON schema, and calls the function for you when Gemini decides a tool call is needed:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>from google import genai\nfrom google.genai import types\n\n\ndef get_server_status(hostname: str) -> dict:\n    \"\"\"Returns the systemd status of a Linux server.\n\n    Args:\n        hostname: The server hostname, e.g. web01.example.com.\n    \"\"\"\n    fake_db = {\n        \"web01.example.com\":   {\"status\": \"active\", \"load\": \"0.42\", \"uptime_days\": 18},\n        \"db01.example.com\":    {\"status\": \"active\", \"load\": \"1.10\", \"uptime_days\": 92},\n        \"cache01.example.com\": {\"status\": \"failed\", \"load\": \"n\/a\",  \"uptime_days\": 0},\n    }\n    return fake_db.get(hostname, {\"status\": \"unknown\", \"error\": \"host not in inventory\"})\n\n\nclient = genai.Client(vertexai=True, project=\"your-gcp-project-id\", location=\"us-central1\")\n\nresponse = client.models.generate_content(\n    model=\"gemini-2.5-flash\",\n    contents=(\n        \"I have three boxes: web01.example.com, db01.example.com, \"\n        \"and cache01.example.com. Which one is in trouble?\"\n    ),\n    config=types.GenerateContentConfig(tools=[get_server_status]),\n)\n\nprint(response.text)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The model called <code>get_server_status<\/code> three times (once per host), got the dict back each time, and produced a final natural-language answer. No state machine, no manual loop:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>It looks like cache01.example.com is in trouble. It has a status of failed and an uptime of 0 days. The other two servers, web01.example.com and db01.example.com, are active.<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The manual pattern is a different shape. You declare the tool with <code>types.FunctionDeclaration<\/code>, give it a JSON schema, and the model returns a <code>function_call<\/code> part instead of a text part. Your code dispatches the call, runs whatever it needs, and decides whether to feed the result back:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>from google import genai\nfrom google.genai import types\n\nlist_pods = types.FunctionDeclaration(\n    name=\"list_pods\",\n    description=\"Returns Kubernetes pods in a namespace.\",\n    parameters_json_schema={\n        \"type\": \"object\",\n        \"properties\": {\n            \"namespace\": {\n                \"type\": \"string\",\n                \"description\": \"The Kubernetes namespace, e.g. default, kube-system.\",\n            },\n            \"label_selector\": {\n                \"type\": \"string\",\n                \"description\": \"Optional label selector, e.g. app=nginx.\",\n            },\n        },\n        \"required\": [\"namespace\"],\n    },\n)\n\ntool = types.Tool(function_declarations=[list_pods])\n\nclient = genai.Client(vertexai=True, project=\"your-gcp-project-id\", location=\"us-central1\")\n\nresponse = client.models.generate_content(\n    model=\"gemini-2.5-flash\",\n    contents=\"Show me every nginx pod running in the staging namespace.\",\n    config=types.GenerateContentConfig(tools=[tool]),\n)\n\ncall = response.candidates[0].content.parts[0].function_call\nprint(f\"Function:  {call.name}\")\nprint(f\"Arguments: {dict(call.args)}\")<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The captured output of both runs side by side:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"920\" height=\"800\" src=\"https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/wm-vertex-ai-gemini-function-call-args.png\" alt=\"Manual function declaration plus the automatic function calling pattern\" class=\"wp-image-167372\" title=\"\" srcset=\"https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/wm-vertex-ai-gemini-function-call-args.png 920w, https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/wm-vertex-ai-gemini-function-call-args-300x261.png 300w, https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/wm-vertex-ai-gemini-function-call-args-768x668.png 768w\" sizes=\"auto, (max-width: 920px) 100vw, 920px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The manual call extracts <code>{'namespace': 'staging', 'label_selector': 'app=nginx'}<\/code> from the prompt without you parsing a single string. From there you would pass those args to <code>kubectl<\/code> via <code>subprocess<\/code>, the Python Kubernetes client, or any RPC of your choice.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 8: Vision and multimodal inputs<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Gemini accepts images in three shapes: a public or signed URL, raw bytes from disk, or a Cloud Storage URI. For anything bigger than a few megabytes, GCS is the cleanest path: you skip the bytes-over-HTTP overhead and the SDK reuses the upload across requests.<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>from google import genai\nfrom google.genai import types\n\nclient = genai.Client(vertexai=True, project=\"your-gcp-project-id\", location=\"us-central1\")\n\nresponse = client.models.generate_content(\n    model=\"gemini-2.5-flash\",\n    contents=[\n        \"Describe this image in two sentences. Then list every object you see.\",\n        types.Part.from_uri(\n            file_uri=\"gs:\/\/generativeai-downloads\/images\/scones.jpg\",\n            mime_type=\"image\/jpeg\",\n        ),\n    ],\n)\n\nprint(response.text)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The <code>gs:\/\/generativeai-downloads\/images\/scones.jpg<\/code> URI is one of Google&#8217;s public sample images, useful for sanity-checking vision without uploading your own bucket. The captured response shows how detailed the model can get on a single request:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"920\" height=\"800\" src=\"https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/wm-vertex-ai-gemini-vision-image-description.png\" alt=\"Pass an image URI on Cloud Storage and have Gemini describe it\" class=\"wp-image-167375\" title=\"\" srcset=\"https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/wm-vertex-ai-gemini-vision-image-description.png 920w, https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/wm-vertex-ai-gemini-vision-image-description-300x261.png 300w, https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/wm-vertex-ai-gemini-vision-image-description-768x668.png 768w\" sizes=\"auto, (max-width: 920px) 100vw, 920px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">For local images, swap <code>Part.from_uri<\/code> for <code>Part.from_bytes<\/code>. Inline bytes work up to about 7 MB, after which Vertex AI rejects the request with a payload-size error:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>with open(\"diagram.png\", \"rb\") as f:\n    image_bytes = f.read()\n\nresponse = client.models.generate_content(\n    model=\"gemini-2.5-flash\",\n    contents=[\n        \"What is shown in this image? Be specific about colors and text.\",\n        types.Part.from_bytes(data=image_bytes, mime_type=\"image\/png\"),\n    ],\n)\nprint(response.text)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Supported MIME types span <code>image\/png<\/code>, <code>image\/jpeg<\/code>, <code>image\/webp<\/code>, <code>image\/heic<\/code>, and <code>image\/heif<\/code>. Video inputs use <code>video\/mp4<\/code> and several siblings; audio uses <code>audio\/mp3<\/code> and <code>audio\/wav<\/code>. PDFs work too, treated as multi-page images, which is how Gemini&#8217;s RAG pipelines ingest documents in one call.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 9: Streaming with function calling, the production pattern<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Streaming and tool use are usually shown in isolation. The pattern that ships in real products combines them: the model streams a partial answer, decides it needs to call a function, your code runs the function, the model streams the rest. With automatic function calling the SDK handles the whole loop and you still get chunks as they arrive.<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>from google import genai\nfrom google.genai import types\n\n\ndef get_disk_usage(mountpoint: str) -> dict:\n    \"\"\"Return current disk usage for a mountpoint.\n\n    Args:\n        mountpoint: The mountpoint to query, e.g. \/var, \/, \/home.\n    \"\"\"\n    fake = {\n        \"\/\":     {\"used_gb\": 32, \"total_gb\": 100, \"percent\": 32},\n        \"\/var\":  {\"used_gb\": 88, \"total_gb\": 100, \"percent\": 88},\n        \"\/home\": {\"used_gb\": 12, \"total_gb\": 50,  \"percent\": 24},\n    }\n    return fake.get(mountpoint, {\"error\": \"mountpoint not found\"})\n\n\nclient = genai.Client(vertexai=True, project=\"your-gcp-project-id\", location=\"us-central1\")\n\nstream = client.models.generate_content_stream(\n    model=\"gemini-2.5-flash\",\n    contents=(\n        \"Check disk usage on \/, \/var, and \/home. Tell me which mountpoint \"\n        \"is closest to filling up and recommend one action.\"\n    ),\n    config=types.GenerateContentConfig(tools=[get_disk_usage]),\n)\n\nfor chunk in stream:\n    if chunk.text:\n        print(chunk.text, end=\"\", flush=True)\nprint()<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The model issues three calls to <code>get_disk_usage<\/code>, gets the dicts back, and streams a final recommendation. The captured run also shows pre-flight token counting and what a 404 looks like when you ask for a model that does not exist:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"920\" height=\"800\" src=\"https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/wm-vertex-ai-gemini-streaming-with-tools.png\" alt=\"Production pattern combining streaming with automatic function calling\" class=\"wp-image-167376\" title=\"\" srcset=\"https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/wm-vertex-ai-gemini-streaming-with-tools.png 920w, https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/wm-vertex-ai-gemini-streaming-with-tools-300x261.png 300w, https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/wm-vertex-ai-gemini-streaming-with-tools-768x668.png 768w\" sizes=\"auto, (max-width: 920px) 100vw, 920px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Pre-flight token counting is the cheapest check you can run. It costs no model time and helps you catch a 200 KB prompt before you send it. <code>client.models.count_tokens<\/code> takes the same model and contents you would pass to <code>generate_content<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>count = client.models.count_tokens(\n    model=\"gemini-2.5-flash\",\n    contents=\"Refactor this 5KB Python file.\" + \"x\" * 5000,\n)\nprint(count.total_tokens)<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Step 10: Thinking, context caching, and the Batch API<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Three features on Vertex AI cut cost or unlock new use cases. They all live behind the same <code>GenerateContentConfig<\/code> you already know, plus dedicated client methods for the longer-running ones.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Thinking mode lets Gemini 2.5 reason longer before answering. You set a thinking budget in tokens, and you can ask the SDK to expose the trace so you see how the model got there:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>response = client.models.generate_content(\n    model=\"gemini-2.5-pro\",\n    contents=(\n        \"A Kubernetes pod restarts every 47 minutes with OOMKilled. \"\n        \"The container limit is 512Mi and the JVM is set to -Xmx384m. \"\n        \"What is the most likely root cause and how do I confirm?\"\n    ),\n    config=types.GenerateContentConfig(\n        thinking_config=types.ThinkingConfig(\n            thinking_budget=2048,\n            include_thoughts=True,\n        ),\n    ),\n)\n\nfor part in response.candidates[0].content.parts:\n    if part.thought:\n        print(\"=== THOUGHT ===\")\n    elif part.text:\n        print(\"=== ANSWER ===\")\n    print(part.text)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">On the OOMKilled prompt, Pro spent 1,616 thinking tokens before producing a 1,767-token answer that walked through Native Memory Tracking, JVM metaspace caps, and the <code>MaxRAMPercentage<\/code> flag. Thinking is what turns Gemini from a fast generator into a slow but careful debugger; reach for it on real triage, leave Flash on for chat.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Context caching is the cost lever almost nobody uses. If you send the same long preamble (a 30-page PDF, a code repo, a knowledge base) on every request, cache it once and reference it by name. Cached input tokens cost roughly 90% less than fresh tokens on Gemini 2.5:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>cached = client.caches.create(\n    model=\"gemini-2.5-flash\",\n    config=types.CreateCachedContentConfig(\n        contents=[\n            types.Content(role=\"user\", parts=[\n                types.Part.from_uri(\n                    file_uri=\"gs:\/\/my-bucket\/runbook.pdf\",\n                    mime_type=\"application\/pdf\",\n                )\n            ])\n        ],\n        system_instruction=\"Answer questions strictly from this runbook.\",\n        ttl=\"3600s\",\n    ),\n)\n\nresponse = client.models.generate_content(\n    model=\"gemini-2.5-flash\",\n    contents=\"What is the rollback procedure if step 3 fails?\",\n    config=types.GenerateContentConfig(cached_content=cached.name),\n)\nprint(response.text)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The Batch API on Vertex AI is the third lever. Up to 200,000 prompts in a single job, sourced from a BigQuery table or a GCS file, processed asynchronously at lower per-token cost. Use it for content generation jobs, dataset labelling, or evaluation runs where latency does not matter:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>job = client.batches.create(\n    model=\"gemini-2.5-flash\",\n    src=\"bq:\/\/storage-samples.generative_ai.batch_requests_for_multimodal_input\",\n    config=types.CreateBatchJobConfig(\n        dest=\"bq:\/\/your-gcp-project-id.gemini_results.batch_001\",\n    ),\n)\nprint(f\"Job: {job.name}, state: {job.state}\")<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Cost per article: a streaming chat that runs for an hour against Gemini 2.5 Flash with a 200-token system prompt and 2,000 tokens of conversation per turn lands at roughly half a cent per turn. Add caching on the system prompt and that drops further. The <a href=\"https:\/\/computingforgeeks.com\/gcp-costs-explained\/\" target=\"_blank\" rel=\"noreferrer noopener\">GCP costs explained<\/a> guide covers the broader cost shape on Google Cloud if you want the bigger picture.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Migrating from vertexai.generative_models<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">If you already have code on the old SDK, the migration is mechanical. Most signatures have a one-line equivalent:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Old (<code>vertexai.generative_models<\/code>)<\/th><th>New (<code>google.genai<\/code>)<\/th><\/tr><\/thead><tbody><tr><td><code>vertexai.init(project=..., location=...)<\/code><\/td><td><code>genai.Client(vertexai=True, project=..., location=...)<\/code><\/td><\/tr><tr><td><code>GenerativeModel(\"gemini-1.5-flash\")<\/code><\/td><td>model name passed per call: <code>client.models.generate_content(model=\"gemini-2.5-flash\", ...)<\/code><\/td><\/tr><tr><td><code>model.generate_content(prompt)<\/code><\/td><td><code>client.models.generate_content(model=..., contents=...)<\/code><\/td><\/tr><tr><td><code>model.generate_content(prompt, stream=True)<\/code><\/td><td><code>client.models.generate_content_stream(model=..., contents=...)<\/code><\/td><\/tr><tr><td><code>Tool.from_function_declarations([...])<\/code><\/td><td><code>types.Tool(function_declarations=[...])<\/code><\/td><\/tr><tr><td><code>GenerationConfig(...)<\/code><\/td><td><code>types.GenerateContentConfig(...)<\/code><\/td><\/tr><tr><td><code>Part.from_data(data, mime_type)<\/code><\/td><td><code>types.Part.from_bytes(data=data, mime_type=...)<\/code><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The <a href=\"https:\/\/docs.cloud.google.com\/vertex-ai\/generative-ai\/docs\/sdks\/overview\" target=\"_blank\" rel=\"noreferrer noopener\">official SDK overview<\/a> tracks the full surface and notes the June 24, 2026 removal date for the old generative modules. Run <code>pip install --upgrade google-genai<\/code> on a feature branch, swap the imports, and run your existing tests; in most codebases the diff is under 50 lines.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Common errors and fixes<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Four error shapes cover almost every hit you will see in production. Each one is captured from a real failed run.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">404: &#8220;Publisher Model &#8230; was not found&#8221;<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The trace looks like this when the model ID is wrong, the model is not GA in your region, or the API was never enabled:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>google.genai.errors.ClientError: 404 NOT_FOUND. {'error': {'code': 404, 'message': \"Publisher Model `projects\/your-gcp-project-id\/locations\/us-central1\/publishers\/google\/models\/gemini-9.9-megadeluxe` was not found or your project does not have access to it...\"}}<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Run <code>gcloud services enable aiplatform.googleapis.com<\/code> first, then check the <a href=\"https:\/\/docs.cloud.google.com\/vertex-ai\/generative-ai\/docs\/models\/gemini\/2-5-flash\" target=\"_blank\" rel=\"noreferrer noopener\">model availability page<\/a> for the model ID and supported regions. New projects sometimes propagate model access slowly; if you just enabled the API, wait two minutes and retry.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">403: &#8220;Permission &#8216;aiplatform.endpoints.predict&#8217; denied&#8221;<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">This is what you see when the calling principal is missing the Vertex AI User role:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>google.genai.errors.ClientError: 403 PERMISSION_DENIED. Permission 'aiplatform.endpoints.predict' denied on resource '\/\/aiplatform.googleapis.com\/projects\/your-gcp-project-id\/locations\/us-central1\/publishers\/google\/models\/gemini-2.5-flash'<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Grant the role to the user or service account that is making the call:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>gcloud projects add-iam-policy-binding \"${PROJECT_ID}\" \\\n  --member=\"serviceAccount:${SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com\" \\\n  --role=\"roles\/aiplatform.user\"<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">DefaultCredentialsError: &#8220;File &#8230; was not found&#8221;<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You typically see this after rotating service-account keys without updating the env var:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>google.auth.exceptions.DefaultCredentialsError: File \/Users\/you\/old-key.json was not found.<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The <code>GOOGLE_APPLICATION_CREDENTIALS<\/code> env var points to a key that no longer exists. Either re-export it to a valid path or unset it and rely on ADC:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>unset GOOGLE_APPLICATION_CREDENTIALS\ngcloud auth application-default login<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">401: &#8220;API key not valid&#8221;<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">This one trips most people on day one. The trace looks like an auth issue but it is really a backend selection issue:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>google.genai.errors.ClientError: 401 API key not valid<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The SDK fell through to AI Studio mode because <code>vertexai=True<\/code> was missing. Pass it to the constructor or set <code>GOOGLE_GENAI_USE_VERTEXAI=True<\/code> in the environment:<\/p>\n\n\n\n<pre class=\"wp-block-code code\"><code>export GOOGLE_GENAI_USE_VERTEXAI=True\nexport GOOGLE_CLOUD_PROJECT=\"${PROJECT_ID}\"\nexport GOOGLE_CLOUD_LOCATION=\"${REGION}\"<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">With those three env vars set, you can drop the <code>project<\/code> and <code>location<\/code> kwargs from <code>genai.Client()<\/code>; the SDK reads them at init.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">From here, the natural next step is RAG: store embeddings on Cloud SQL with pgvector, retrieve documents, and stitch the context into a Gemini prompt. The <a href=\"https:\/\/computingforgeeks.com\/gcp-cloud-sql-postgresql-terraform\/\" target=\"_blank\" rel=\"noreferrer noopener\">Cloud SQL Postgres on GCP guide<\/a> covers the database side. If you would rather keep Gemini for hosted inference but run open-source models locally for sensitive data, the <a href=\"https:\/\/computingforgeeks.com\/install-ollama-rocky-ubuntu\/\" target=\"_blank\" rel=\"noreferrer noopener\">Ollama install on Rocky and Ubuntu<\/a> walkthrough is the parallel track. Either way, the SDK pattern stays the same: one client, one method per shape, and the migration to whatever Gemini ships next is one upgrade away.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Most Vertex AI Gemini tutorials on the open web were written before the SDK changed. They show from vertexai.generative_models import GenerativeModel, an API Google deprecated in mid-2025 and removes entirely on June 24, 2026. Code that ships against that import will start raising ModuleNotFoundError the day it lands. This guide is the current playbook for &#8230; <a title=\"Use Vertex AI Gemini in Python: Streaming, Tools, Vision\" class=\"read-more\" href=\"https:\/\/computingforgeeks.com\/vertex-ai-gemini-python-streaming-tool-use\/\" aria-label=\"Read more about Use Vertex AI Gemini in Python: Streaming, Tools, Vision\">Read more<\/a><\/p>\n","protected":false},"author":3,"featured_media":167377,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[39034,2680,36939],"tags":[669,36175,372],"cfg_series":[39811],"class_list":["post-167378","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","category-cloud","category-gcp","tag-dev","tag-gcp","tag-python","cfg_series-gcp-platform"],"_links":{"self":[{"href":"https:\/\/computingforgeeks.com\/wp-json\/wp\/v2\/posts\/167378","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/computingforgeeks.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/computingforgeeks.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/computingforgeeks.com\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/computingforgeeks.com\/wp-json\/wp\/v2\/comments?post=167378"}],"version-history":[{"count":2,"href":"https:\/\/computingforgeeks.com\/wp-json\/wp\/v2\/posts\/167378\/revisions"}],"predecessor-version":[{"id":167385,"href":"https:\/\/computingforgeeks.com\/wp-json\/wp\/v2\/posts\/167378\/revisions\/167385"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/computingforgeeks.com\/wp-json\/wp\/v2\/media\/167377"}],"wp:attachment":[{"href":"https:\/\/computingforgeeks.com\/wp-json\/wp\/v2\/media?parent=167378"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/computingforgeeks.com\/wp-json\/wp\/v2\/categories?post=167378"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/computingforgeeks.com\/wp-json\/wp\/v2\/tags?post=167378"},{"taxonomy":"cfg_series","embeddable":true,"href":"https:\/\/computingforgeeks.com\/wp-json\/wp\/v2\/cfg_series?post=167378"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}