Skip to content

protoLabsAI/protoBanana

Repository files navigation

protoBanana mascot — a friendly cartoon banana waving hello

protoBanana

OSS chat-native image generation + editing — the open-source counterpart to Google's Nano-Banana 2 / OpenAI's GPT-Image-2, served as an OpenAI-compatible LiteLLM provider on top of ComfyUI.

The mascot above was generated by protoBanana itself — chat completion through protolabs/qwen-image-chat, prompt: "a friendly cartoon banana waving hello, simple white background".

License Status Tests Docs

The capability — workflow JSON as OpenAI model name

ComfyUI is the open-source standard for composable image pipelines. Every pipeline is a workflow JSON: a graph of nodes, weights, prompts, sampler settings, conditional branches. It's expressive — and it speaks only its own /prompt REST API. Nothing in the wider OpenAI client ecosystem knows what a ComfyUI workflow is.

LiteLLM is the open-source standard for unifying LLM providers behind the OpenAI spec. Every OpenAI-compatible client — Open WebUI, the Anthropic / OpenAI SDKs, a curl one-liner, your CLI tool — already speaks it. It has no native ComfyUI provider.

protoBanana is the bridge. The provider package registers a custom LiteLLM provider that does three things:

  1. Maps a model name (e.g. comfyui-qwen-image/qwen_image_edit_2511) to a workflow JSON on disk.
  2. Translates between specs: takes an OpenAI request shape — /v1/images/generations, /v1/images/edits, or /v1/chat/completions with image parts — and patches the right slots of the workflow JSON (prompt, init image, mask, seed, size, custom KSampler params, etc.).
  3. Submits to ComfyUI, polls /history/<id>, fetches /view, and returns the OpenAI response shape (ImageResponse(b64_json) or a chat completion with markdown-embedded images).

The net effect:

A ComfyUI workflow you authored in the web UI becomes a model name any OpenAI client can call.

That's why the same gateway alias works from Open WebUI, the Gradio app in this repo, protoCLI, a Python SDK script, or a curl /v1/images/generations. You don't write a new client per consumer; you write a workflow once and it appears in every OpenAI-compatible surface you have.

Authoring those workflows is out of scope here — that's ComfyUI's job. See protoLabsAI/comfy-workflows for the workflow library that gets bind-mounted into the gateway. The seven workflows shipped with this repo (qwen_image_2512, qwen_image_edit_2511, multiref_*, inpaint_*, outpaint_*, region_edit_*, bgremove_*) are reference implementations that prove the bridge works across every category — gen, edit, mask, multi-image, agent-routed chat — and back the model aliases below.

What it is

One gateway alias drives the full conversational image experience:

  • protolabs/qwen-image — text-to-image (/v1/images/generations)
  • protolabs/qwen-image-edit — image + instruction (/v1/images/edits)
  • protolabs/qwen-image-chat — multi-turn "draw → now make it blue", multi-reference compose, region edit, background removal, outpaint — routed by an LLM agent that owns the chat surface

The chat path is agent-driven: an LLM (default protolabs/fast = Qwen3.6-35B-A3B-FP8) decides whether to respond conversationally, call an image tool, or chain multiple tools. Conversational replies, clarifying questions, and "remove the bg, then put a sunset behind" chains all work — see docs/agent.md. Falls back to a deterministic keyword classifier when no LLM endpoint is configured.

Backed by Qwen-Image-2512 (gen) + Qwen-Image-Edit-2511 (edit, multi-ref, inpaint, outpaint, region edit) + BiRefNet/RMBG-2.0 (sticker) + SAM 3 (text→mask grounding for region edit). All seven phases shipped.

Why this exists

Nano-Banana 2 and GPT-Image-2 made conversational image editing mainstream. They're closed-source, hosted, and metered. For organizations that can't or won't send their data to a third party, the equivalent experience didn't exist as a single drop-in stack.

protoBanana fills that gap. It's the same call shape (/v1/chat/completions with image output), the same UX ("draw a cat" → "now make it blue"), running entirely on local GPUs through your own LiteLLM gateway.

Headline numbers

nano-banana 2 protoBanana (Phase 1)
Operation auto-routing per chat turn
Conversational replies + clarifying questions ✓ (agent path)
Chained operations in one chat turn ✓ (agent calls tools in sequence)
Text-to-image
Single-image instruction edit
Multi-reference compose up to 14 refs up to 3 (Qwen-Image-Edit cap)
Background removal / sticker
Text-region edit ("change the man's tie") ✓ (SAM 3 text→mask)
Inpaint with provided mask ✓ (/v1/images/edits + mask)
Outpaint ✓ ("extend left", "make this wider")
Hosted yes no — all local
Cost per image metered electricity

See PHASES.md for the per-phase rationale.

Quickstart

# 1. Install into your LiteLLM gateway environment.
#    [tracing] pulls langfuse v2 (LiteLLM-compatible).
#    [agent] pulls openai client for the chat agent.
pip install 'protobanana[tracing,agent] @ git+https://github.com/protoLabsAI/protoBanana.git'

# 2. Add to LiteLLM config.yaml:
model_list:
  - model_name: protolabs/qwen-image
    litellm_params:
      model: protobanana/qwen_image_2512
      api_base: http://your-comfyui-host:8188
    model_info: { mode: image_generation }

  - model_name: protolabs/qwen-image-chat
    litellm_params:
      model: protobanana/chat
      api_base: http://your-comfyui-host:8188
    model_info: { mode: chat, supports_vision: true }

litellm_settings:
  custom_provider_map:
    - { provider: "protobanana", custom_handler: "protobanana.handler" }

# 3. Mount the workflows dir into the gateway container at /app/workflows
#    (or set PROTOBANANA_WORKFLOWS_DIR)

# 4. (Optional but recommended) Enable the chat agent:
#    PROTOBANANA_AGENT_BASE=http://localhost:4000/v1   # gateway calls itself
#    PROTOBANANA_AGENT_KEY=$LITELLM_MASTER_KEY
#    PROTOBANANA_AGENT_MODEL=protolabs/fast            # or protolabs/smart

# 5. Hit it like any OpenAI chat endpoint
curl -X POST http://your-gateway:4000/v1/chat/completions \
  -H "Authorization: Bearer $KEY" \
  -d '{"model":"protolabs/qwen-image-chat","messages":[
    {"role":"user","content":"a cat in a hat, watercolor"}
  ]}'

# Then continue the conversation:
#   {"role":"user","content":"make it a bowling cap"}
# → agent picks region_edit("the hat" → "a bowling cap"), preserves
#   everything else pixel-perfect.

Returns an assistant message with a markdown-embedded data:image/png;base64,... URL — Open WebUI displays inline like a regular image attachment.

See docs/installation.md for the full setup (ComfyUI install, model downloads + symlinks, GPU planning).

Architecture

                 OpenAI client (Open WebUI / protoCLI / curl)
                          │
                          ▼
                    LiteLLM gateway
                          │
                          ▼
                  ProtoBananaProvider
                          │
                ┌─────────┴──────────┐
                ▼                    ▼
        chat agent loop      keyword classifier
        (LLM picks tool)     (fallback when no LM)
                │                    │
                └─────────┬──────────┘
                          ▼
       ┌──────┬──────┬──────┬──────┬──────┬──────┐
       ▼      ▼      ▼      ▼      ▼      ▼      ▼
      gen   edit  region multi  bgremove inpaint outpaint
                  edit   ref
       │      │      │      │      │      │      │
       └──────┴──────┴──────┴──────┴──────┴──────┘
                          │
                          ▼
                    ComfyUIClient
                  (HTTP transport)
                          │
                          ▼
                       ComfyUI
            (Qwen-Image-2512 / Qwen-Image-Edit-2511 /
                 BiRefNet / RMBG-2.0 / SAM 3)

The chat agent is the default; the keyword classifier kicks in only when PROTOBANANA_AGENT_BASE is unset or the LLM endpoint fails. Either path calls the same six route modules.

See docs/architecture.md for the full breakdown and docs/agent.md for the agent loop in detail.

Test/eval UI

The Gradio app at app/ is a reference consumer + quick-test surface, not the product. It exists for two specific reasons:

  1. Dogfooding the bridge. Every tab posts to the gateway exactly the way any other OpenAI client would. If a workflow regresses or a provider change breaks the request shape, Gradio surfaces it before downstream consumers do.
  2. Fast iteration on workflows. Author a workflow in ComfyUI, drop the JSON into the gateway's mount, add a model alias to LiteLLM, hit the Gradio tab to validate end-to-end — usually 30 seconds.

Five tabs covering Generate, Edit, Multi-ref, Sticker, and Chat (multi-turn auto-routing). Runs anywhere with Python 3.11+; intended for local debugging AND HuggingFace Space deployment. See app/README.md and docs/gradio-app.md.

pip install -e ".[gradio]"
GATEWAY_URL=http://your-gateway:4000/v1 GATEWAY_API_KEY=sk-... python -m app

If you only want the bridge in your gateway — e.g. you'll consume it from Open WebUI, your own CLI, or a non-Python client — skip the [gradio] extra and the app/ directory entirely. The provider package is self-contained.

Documentation

PROPOSAL.md The strategic system design + why-this-shape
PHASES.md The 7-phase roadmap with status, models needed, acceptance criteria
JOURNEY.md How we got here — the full backfill (research → broken integrations → gateway → agent)
HOWTO.md User-facing guide: prompting recipes, multi-ref tricks, intent keywords
app/README.md Gradio test/eval UI — local + HF Space
docs/installation.md Full setup from a clean machine
docs/operating.md Day-2 ops: GPU planning, model swaps, troubleshooting
docs/architecture.md Component breakdown + extension points
docs/agent.md The tool-use chat agent — loop, tools, env, fallback, multi-step examples
docs/workflows-cookbook.md How to add a new ComfyUI workflow
docs/intent-router.md How the keyword fallback path routes requests
docs/gradio-app.md Test/eval UI architecture + HF Space deploy
docs/api.md Client-facing API reference
docs/observability.md Langfuse tracing — what's captured, env, recommended views
docs/validating-workflows.md Static workflow validator + e2e smoke (pre-merge gate)
docs/benchmarks.md Quality + latency methodology
DECISIONS.md Architectural decision records
CHANGELOG.md Per-version log

Building on prior art

protoBanana is a synthesis, not an invention. Credit:

Component Source
Image gen + edit Qwen-Image, Qwen-Image-Edit-2511 (Alibaba)
Background removal BiRefNet, RMBG-2.0 (BRIA)
Region segmentation (Phase 4) Florence-2 (Microsoft), SAM 2.1 (Meta)
Universal inpaint (Phase 5) LanPaint
Bundled ComfyUI nodes ComfyUI-RMBG (1038lab) — RMBG/BiRefNet/SAM/Grounding
LLM gateway LiteLLM (BerriAI)
Image runtime ComfyUI
Original paradigm Nano-Banana 2 (Google), GPT-Image-2 (OpenAI)

License

Apache-2.0. Workflows and node-pack dependencies retain their original licenses (see workflows/<name>.json's _doc field for per-workflow notes — RMBG-2.0 is CC BY-NC 4.0 / non-commercial).

Citation

See CITATION.cff.

About

OSS chat-native image generation + editing — open-source counterpart to Nano-Banana 2 / GPT-Image-2, served as an OpenAI-compatible LiteLLM provider on top of ComfyUI

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages