OSS chat-native image generation + editing — the open-source counterpart to Google's Nano-Banana 2 / OpenAI's GPT-Image-2, served as an OpenAI-compatible LiteLLM provider on top of ComfyUI.
The mascot above was generated by protoBanana itself — chat completion through
protolabs/qwen-image-chat, prompt: "a friendly cartoon banana waving hello, simple white background".
ComfyUI is the open-source standard for composable image pipelines. Every
pipeline is a workflow JSON: a graph of nodes, weights, prompts, sampler
settings, conditional branches. It's expressive — and it speaks only its
own /prompt REST API. Nothing in the wider OpenAI client ecosystem
knows what a ComfyUI workflow is.
LiteLLM is the open-source standard for unifying LLM providers behind the OpenAI spec. Every OpenAI-compatible client — Open WebUI, the Anthropic / OpenAI SDKs, a curl one-liner, your CLI tool — already speaks it. It has no native ComfyUI provider.
protoBanana is the bridge. The provider package registers a custom LiteLLM provider that does three things:
- Maps a model name (e.g.
comfyui-qwen-image/qwen_image_edit_2511) to a workflow JSON on disk. - Translates between specs: takes an OpenAI request shape —
/v1/images/generations,/v1/images/edits, or/v1/chat/completionswith image parts — and patches the right slots of the workflow JSON (prompt, init image, mask, seed, size, custom KSampler params, etc.). - Submits to ComfyUI, polls
/history/<id>, fetches/view, and returns the OpenAI response shape (ImageResponse(b64_json)or a chat completion with markdown-embedded images).
The net effect:
A ComfyUI workflow you authored in the web UI becomes a model name any OpenAI client can call.
That's why the same gateway alias works from Open WebUI, the Gradio app
in this repo, protoCLI, a Python SDK script, or a curl /v1/images/generations.
You don't write a new client per consumer; you write a workflow once and
it appears in every OpenAI-compatible surface you have.
Authoring those workflows is out of scope here — that's ComfyUI's job.
See protoLabsAI/comfy-workflows
for the workflow library that gets bind-mounted into the gateway. The
seven workflows shipped with this repo (qwen_image_2512,
qwen_image_edit_2511, multiref_*, inpaint_*, outpaint_*,
region_edit_*, bgremove_*) are reference implementations that prove
the bridge works across every category — gen, edit, mask, multi-image,
agent-routed chat — and back the model aliases below.
One gateway alias drives the full conversational image experience:
protolabs/qwen-image— text-to-image (/v1/images/generations)protolabs/qwen-image-edit— image + instruction (/v1/images/edits)protolabs/qwen-image-chat— multi-turn"draw → now make it blue", multi-reference compose, region edit, background removal, outpaint — routed by an LLM agent that owns the chat surface
The chat path is agent-driven: an LLM (default protolabs/fast =
Qwen3.6-35B-A3B-FP8) decides whether to respond conversationally, call
an image tool, or chain multiple tools. Conversational replies, clarifying
questions, and "remove the bg, then put a sunset behind" chains all
work — see docs/agent.md. Falls back to a deterministic
keyword classifier when no LLM endpoint is configured.
Backed by Qwen-Image-2512 (gen) + Qwen-Image-Edit-2511 (edit, multi-ref, inpaint, outpaint, region edit) + BiRefNet/RMBG-2.0 (sticker) + SAM 3 (text→mask grounding for region edit). All seven phases shipped.
Nano-Banana 2 and GPT-Image-2 made conversational image editing
mainstream. They're closed-source, hosted, and metered. For organizations
that can't or won't send their data to a third party, the equivalent
experience didn't exist as a single drop-in stack.
protoBanana fills that gap. It's the same call shape (/v1/chat/completions
with image output), the same UX ("draw a cat" → "now make it blue"),
running entirely on local GPUs through your own LiteLLM gateway.
| nano-banana 2 | protoBanana (Phase 1) | |
|---|---|---|
| Operation auto-routing per chat turn | ✓ | ✓ |
| Conversational replies + clarifying questions | ✓ | ✓ (agent path) |
| Chained operations in one chat turn | ✓ | ✓ (agent calls tools in sequence) |
| Text-to-image | ✓ | ✓ |
| Single-image instruction edit | ✓ | ✓ |
| Multi-reference compose | up to 14 refs | up to 3 (Qwen-Image-Edit cap) |
| Background removal / sticker | ✓ | ✓ |
Text-region edit ("change the man's tie") |
✓ | ✓ (SAM 3 text→mask) |
| Inpaint with provided mask | ✓ | ✓ (/v1/images/edits + mask) |
| Outpaint | ✓ | ✓ ("extend left", "make this wider") |
| Hosted | yes | no — all local |
| Cost per image | metered | electricity |
See PHASES.md for the per-phase rationale.
# 1. Install into your LiteLLM gateway environment.
# [tracing] pulls langfuse v2 (LiteLLM-compatible).
# [agent] pulls openai client for the chat agent.
pip install 'protobanana[tracing,agent] @ git+https://github.com/protoLabsAI/protoBanana.git'
# 2. Add to LiteLLM config.yaml:
model_list:
- model_name: protolabs/qwen-image
litellm_params:
model: protobanana/qwen_image_2512
api_base: http://your-comfyui-host:8188
model_info: { mode: image_generation }
- model_name: protolabs/qwen-image-chat
litellm_params:
model: protobanana/chat
api_base: http://your-comfyui-host:8188
model_info: { mode: chat, supports_vision: true }
litellm_settings:
custom_provider_map:
- { provider: "protobanana", custom_handler: "protobanana.handler" }
# 3. Mount the workflows dir into the gateway container at /app/workflows
# (or set PROTOBANANA_WORKFLOWS_DIR)
# 4. (Optional but recommended) Enable the chat agent:
# PROTOBANANA_AGENT_BASE=http://localhost:4000/v1 # gateway calls itself
# PROTOBANANA_AGENT_KEY=$LITELLM_MASTER_KEY
# PROTOBANANA_AGENT_MODEL=protolabs/fast # or protolabs/smart
# 5. Hit it like any OpenAI chat endpoint
curl -X POST http://your-gateway:4000/v1/chat/completions \
-H "Authorization: Bearer $KEY" \
-d '{"model":"protolabs/qwen-image-chat","messages":[
{"role":"user","content":"a cat in a hat, watercolor"}
]}'
# Then continue the conversation:
# {"role":"user","content":"make it a bowling cap"}
# → agent picks region_edit("the hat" → "a bowling cap"), preserves
# everything else pixel-perfect.Returns an assistant message with a markdown-embedded data:image/png;base64,...
URL — Open WebUI displays inline like a regular image attachment.
See docs/installation.md for the full setup (ComfyUI install, model downloads + symlinks, GPU planning).
OpenAI client (Open WebUI / protoCLI / curl)
│
▼
LiteLLM gateway
│
▼
ProtoBananaProvider
│
┌─────────┴──────────┐
▼ ▼
chat agent loop keyword classifier
(LLM picks tool) (fallback when no LM)
│ │
└─────────┬──────────┘
▼
┌──────┬──────┬──────┬──────┬──────┬──────┐
▼ ▼ ▼ ▼ ▼ ▼ ▼
gen edit region multi bgremove inpaint outpaint
edit ref
│ │ │ │ │ │ │
└──────┴──────┴──────┴──────┴──────┴──────┘
│
▼
ComfyUIClient
(HTTP transport)
│
▼
ComfyUI
(Qwen-Image-2512 / Qwen-Image-Edit-2511 /
BiRefNet / RMBG-2.0 / SAM 3)
The chat agent is the default; the keyword classifier kicks in only when
PROTOBANANA_AGENT_BASE is unset or the LLM endpoint fails. Either path
calls the same six route modules.
See docs/architecture.md for the full breakdown and docs/agent.md for the agent loop in detail.
The Gradio app at app/ is a reference consumer + quick-test surface,
not the product. It exists for two specific reasons:
- Dogfooding the bridge. Every tab posts to the gateway exactly the way any other OpenAI client would. If a workflow regresses or a provider change breaks the request shape, Gradio surfaces it before downstream consumers do.
- Fast iteration on workflows. Author a workflow in ComfyUI, drop the JSON into the gateway's mount, add a model alias to LiteLLM, hit the Gradio tab to validate end-to-end — usually 30 seconds.
Five tabs covering Generate, Edit, Multi-ref, Sticker, and Chat (multi-turn auto-routing). Runs anywhere with Python 3.11+; intended for local debugging AND HuggingFace Space deployment. See app/README.md and docs/gradio-app.md.
pip install -e ".[gradio]"
GATEWAY_URL=http://your-gateway:4000/v1 GATEWAY_API_KEY=sk-... python -m appIf you only want the bridge in your gateway — e.g. you'll consume it from
Open WebUI, your own CLI, or a non-Python client — skip the [gradio]
extra and the app/ directory entirely. The provider package is
self-contained.
| PROPOSAL.md | The strategic system design + why-this-shape |
| PHASES.md | The 7-phase roadmap with status, models needed, acceptance criteria |
| JOURNEY.md | How we got here — the full backfill (research → broken integrations → gateway → agent) |
| HOWTO.md | User-facing guide: prompting recipes, multi-ref tricks, intent keywords |
| app/README.md | Gradio test/eval UI — local + HF Space |
| docs/installation.md | Full setup from a clean machine |
| docs/operating.md | Day-2 ops: GPU planning, model swaps, troubleshooting |
| docs/architecture.md | Component breakdown + extension points |
| docs/agent.md | The tool-use chat agent — loop, tools, env, fallback, multi-step examples |
| docs/workflows-cookbook.md | How to add a new ComfyUI workflow |
| docs/intent-router.md | How the keyword fallback path routes requests |
| docs/gradio-app.md | Test/eval UI architecture + HF Space deploy |
| docs/api.md | Client-facing API reference |
| docs/observability.md | Langfuse tracing — what's captured, env, recommended views |
| docs/validating-workflows.md | Static workflow validator + e2e smoke (pre-merge gate) |
| docs/benchmarks.md | Quality + latency methodology |
| DECISIONS.md | Architectural decision records |
| CHANGELOG.md | Per-version log |
protoBanana is a synthesis, not an invention. Credit:
| Component | Source |
|---|---|
| Image gen + edit | Qwen-Image, Qwen-Image-Edit-2511 (Alibaba) |
| Background removal | BiRefNet, RMBG-2.0 (BRIA) |
| Region segmentation (Phase 4) | Florence-2 (Microsoft), SAM 2.1 (Meta) |
| Universal inpaint (Phase 5) | LanPaint |
| Bundled ComfyUI nodes | ComfyUI-RMBG (1038lab) — RMBG/BiRefNet/SAM/Grounding |
| LLM gateway | LiteLLM (BerriAI) |
| Image runtime | ComfyUI |
| Original paradigm | Nano-Banana 2 (Google), GPT-Image-2 (OpenAI) |
Apache-2.0. Workflows and node-pack dependencies retain their original
licenses (see workflows/<name>.json's _doc field for per-workflow
notes — RMBG-2.0 is CC BY-NC 4.0 / non-commercial).
See CITATION.cff.
