agent-vision-mcp

English | 中文

🖼️ An MCP server that gives non-vision AI models the ability to analyze images

What is this?

Many users connect coding assistants like Claude Code to text-only models such as DeepSeek v4 that have no vision capability. As a result, when you send an image in a conversation, the main model can't understand its content.

agent-vision-mcp solves this: it's an MCP (Model Context Protocol) server that lets you configure any OpenAI-compatible vision API (e.g. Gemini Flash, Alibaba Qwen-VL, OpenAI, or your own self-hosted vision model). When the main model encounters an image, it automatically calls this tool to get a text analysis of the image.

In short: give your text model eyes.

How it works

You: Please analyze the error in this screenshot [uploads screenshot.png]
    ↓
Claude Code (DeepSeek v4) sees an image but can't understand it
    ↓
Claude Code calls the MCP tool: analyze_image(image="screenshot.png", prompt="Read the error in the screenshot")
    ↓
agent-vision-mcp sends the image to your configured vision API (e.g. Gemini Flash)
    ↓
The vision model returns: "The screenshot shows terminal output with a TypeError: Cannot read property 'map' of undefined error at src/app.ts line 42..."
    ↓
Claude Code receives the text description and replies: "Your code has a TypeError at src/app.ts:42..."

Quick start

1. Install

npm install -g @kitlau/agent-vision-mcp

2. Prepare a vision API

You need an OpenAI-compatible vision API. Here are some common options:

Provider	Cost	Configuration
Gemini Flash	Cheap (free tier available)	Get an API key, use the OpenAI-compatible endpoint
Alibaba Qwen-VL	Pay-as-you-go	Get an API key from the DashScope platform
OpenAI	More expensive	Use `https://api.openai.com/v1` directly
Self-hosted vLLM/LiteLLM	Depends on hardware	Runs locally, no external cost

3. Configure the MCP client

Transport: stdio — This MCP server runs as a child process and communicates with the client via standard input/output (stdin/stdout) using JSON-RPC.

Add the following to your Claude Code / Claude Desktop MCP configuration:

{
  "mcpServers": {
    "agent-vision": {
      "type": "stdio",
      "command": "npx",
      "args": ["-y", "@kitlau/agent-vision-mcp"],
      "env": {
        "VISION_API_KEY": "your-api-key",
        "VISION_BASE_URL": "https://api.openai.com/v1",
        "VISION_MODEL_NAME": "gpt-4o"
      }
    }
  }
}

4. Use it

Once configured, just send an image in the conversation. The main model will automatically call the analyze_image tool to analyze it.

Configuration reference

Environment variables

Variable	Required	Default	Description
`VISION_API_KEY`	✅ Yes	-	The vision API key
`VISION_BASE_URL`	No	`https://api.openai.com/v1`	OpenAI-compatible API endpoint
`VISION_MODEL_NAME`	No	`gpt-4o`	Model name
`VISION_MAX_TOKENS`	No	`1024`	Max output tokens
`VISION_MAX_IMAGE_SIZE`	No	`20971520` (20MB)	Max image size in bytes
`VISION_CACHE_ENABLED`	No	`false`	Enable caching (avoid re-analyzing the same image)
`VISION_CACHE_DIR`	No	System temp dir	Directory for cache files

Per-provider configuration examples

OpenAI

{
  "env": {
    "VISION_API_KEY": "sk-proj-...",
    "VISION_BASE_URL": "https://api.openai.com/v1",
    "VISION_MODEL_NAME": "gpt-4o"
  }
}

Gemini (via OpenAI-compatible layer)

{
  "env": {
    "VISION_API_KEY": "AIza...",
    "VISION_BASE_URL": "https://generativelanguage.googleapis.com/v1beta/openai",
    "VISION_MODEL_NAME": "gemini-2.0-flash"
  }
}

Alibaba Qwen-VL (DashScope)

{
  "env": {
    "VISION_API_KEY": "sk-...",
    "VISION_BASE_URL": "https://dashscope.aliyuncs.com/compatible-mode/v1",
    "VISION_MODEL_NAME": "qwen-vl-max"
  }
}

Self-hosted vLLM

{
  "env": {
    "VISION_API_KEY": "not-needed",
    "VISION_BASE_URL": "http://localhost:8000/v1",
    "VISION_MODEL_NAME": "Qwen2-VL-7B-Instruct"
  }
}

Self-hosted LiteLLM (aggregating proxy)

{
  "env": {
    "VISION_API_KEY": "sk-your-lite-llm-key",
    "VISION_BASE_URL": "http://localhost:4000/v1",
    "VISION_MODEL_NAME": "gemini-2.0-flash"
  }
}

Tip: LiteLLM is a great option — it converts various vision APIs into the OpenAI-compatible format, so you can use it to aggregate multiple providers.

Tool: analyze_image

Parameters

Parameter	Type	Required	Description
`image`	string	Yes	The image. Accepts: base64 data URL, HTTP/HTTPS link, local file path. Formats: JPEG, PNG, WebP, GIF, BMP
`prompt`	string	Yes	The analysis instruction. Tells the vision model what information you want

prompt examples

"Describe everything in this image in detail"
"Extract all visible text in the image (OCR)"
"What UI elements and interactive components are in this design mockup?"
"Analyze the trends and key data points in this chart"
"Read the error message in this terminal output screenshot"
"What objects are in this photo, and how are they positioned relative to each other?"

Caching

When caching is enabled, the same image + prompt combination won't be re-sent to the vision API; the previously cached result is returned directly.

{
  "env": {
    "VISION_CACHE_ENABLED": "true",
    "VISION_CACHE_DIR": "/tmp/my-vision-cache"
  }
}

Cache TTL: 1 hour
Cache key: SHA256(image data + "::" + prompt)
Storage: one JSON file per cache entry

Development

# Clone the project
git clone https://github.com/kitlau86/agent-vision-mcp.git
cd agent-vision-mcp

# Install dependencies
npm install

# Dev mode (run TypeScript directly with tsx)
npm run dev

# Build
npm run build

# Run the build output
npm start

Tech stack

TypeScript (strict mode)
@modelcontextprotocol/sdk v1.x (MCP framework)
openai v4.x (OpenAI-compatible API client)
zod v3.x (parameter validation)

Troubleshooting

Startup error "VISION_API_KEY 未设置"

Make sure the VISION_API_KEY environment variable is set in your MCP client configuration.

Tool call returns "API 密钥无效（401）"

Check that the API key is correct and has permission to access the specified model. Make sure there are no extra spaces or quotes in the key.

Tool call returns "API 端点或模型不存在（404）"

Check that VISION_BASE_URL and VISION_MODEL_NAME are correct. Note that VISION_BASE_URL usually ends with /v1.

Tool call returns "请求被限流（429）"

Free-tier APIs usually have rate limits. Wait a moment and retry, or consider enabling caching to reduce duplicate requests.

Returns "不支持的图片格式"

Currently JPEG, PNG, WebP, GIF, and BMP are supported. Convert the image to one of these formats and try again.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docs		docs
src		src
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README-CN.md		README-CN.md
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

agent-vision-mcp

What is this?

How it works

Quick start

1. Install

2. Prepare a vision API

3. Configure the MCP client

4. Use it

Configuration reference

Environment variables

Per-provider configuration examples

OpenAI

Gemini (via OpenAI-compatible layer)

Alibaba Qwen-VL (DashScope)

Self-hosted vLLM

Self-hosted LiteLLM (aggregating proxy)

Tool: analyze_image

Parameters

prompt examples

Caching

Development

Tech stack

Troubleshooting

Startup error "VISION_API_KEY 未设置"

Tool call returns "API 密钥无效（401）"

Tool call returns "API 端点或模型不存在（404）"

Tool call returns "请求被限流（429）"

Returns "不支持的图片格式"

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages