Moleculyst: Agentic Image Editing

An agentic image editing pipeline inspired by the Agent Banana paper (arXiv:2602.09084). Implements Image Layer Decomposition (ILD) for high-fidelity, localized edits with seamless blending.

✨ Features

Ground-First Local Inpainting — Locates the target object before editing, crops a local patch, and sends it to Gemini for context-aware editing
Laplacian Pyramid Blending — Multi-band blending (Burt & Adelson, 1983) seamlessly fuses edited patches back into the original
LLM Grounding Advisor — Gemini 2.5 Flash reasons about spatial context and disambiguates targets (e.g., "glasses" → drinking glasses vs eyewear)
Interactive BBox Editor — Draw custom bounding boxes on the original image to fine-tune the edit region
Custom Reconstruction Instructions — Type specific instructions (e.g., "fill with table texture") for each recompose
Iterative Editing Loop — Each output becomes the input for the next round — refine endlessly
Agentic Timeline UI — Full transparency: reasoning, grounding phrases, spatial guidance, quality scores

🏗️ Architecture

User Instruction
    │
    ▼
┌──────────────────┐
│   LLM Planner    │ ── Parse instruction → plan steps
└────────┬─────────┘
         │
    ┌────▼────────────────────────────────────┐
    │           For each step:                │
    │                                         │
    │  1. GROUND (Florence-2 + LLM Advisor)   │
    │     └─ Find target on original image    │
    │                                         │
    │  2. CROP LOCAL PATCH                    │
    │     └─ bbox + 50% padding from original │
    │                                         │
    │  3. EDIT LOCALLY (Gemini)               │
    │     └─ Model sees surrounding context   │
    │     └─ Acts as inpainter                │
    │                                         │
    │  4. BLEND BACK (Laplacian Pyramid)      │
    │     └─ Multi-band frequency blending    │
    │     └─ Low-freq: wide color smoothing   │
    │     └─ High-freq: crisp edges           │
    └─────────────────────────────────────────┘
         │
         ▼
    Final Image (original pixels preserved outside edit region)

🚀 Quick Start

Prerequisites

Python 3.10+
Gemini API key

Setup

# Clone and enter the project
cd agent-crop

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -e .

# Configure API key
echo "GEMINI_API_KEY=your-key-here" > .env

Run

python -m agent_banana.server --host 127.0.0.1 --port 8011

Open http://127.0.0.1:8011 in your browser.

🎯 Usage

Upload an image and type an instruction (e.g., "remove the glasses from the table")
Review the agentic timeline: LLM reasoning → grounding → local edit → composition
Adjust the bounding box by drawing on the original image
Type custom instructions in the text field (e.g., "fill the area with wooden texture")
Click Re-compose — a new editor appears on the result for further refinement
Iterate until satisfied

📁 Project Structure

src/agent_banana/
├── server.py                 # Web UI + API endpoints
├── pipeline.py               # ILD pipeline: ground → crop → edit → blend
├── vision.py                 # Laplacian pyramid blending + image utilities
├── nano_banana.py            # Gemini API client
├── llm_grounding_advisor.py  # LLM spatial reasoning advisor
├── vlm_localizer.py          # Florence-2 grounding
├── targeting.py              # Target classification + bbox refinement
├── planning.py               # RL-based edit planner
├── quality.py                # Quality evaluation judge
├── models.py                 # Data models (BoundingBox, StepResult, etc.)
├── memory.py                 # Context folding + session storage
└── config.py                 # Environment configuration

🔧 Configuration

Environment Variable	Description	Default
`GEMINI_API_KEY`	Google Gemini API key	required
`AGENT_BANANA_IMAGE_MODEL`	Image editing model	`gemini-2.5-flash-preview-04-17`
`AGENT_BANANA_ADVISOR_MODEL`	Grounding advisor model	`gemini-2.5-flash-preview-04-17`

📄 Key Concepts from the Paper

Image Layer Decomposition (ILD)

Instead of editing the full image (which causes color drift and detail loss), ILD:

Crops the target region with context padding
Edits only the local patch (model naturally matches surrounding pixels)
Blends back using Gaussian/Laplacian pyramids

Context Folding

Compresses interaction history into structured memory across three levels:

Asset Level: Lightweight image state nodes
Execution Level: Transient tool context for error recovery
Planning Level: Persistent memory of verified edit paths

📜 License

MIT

Acknowledgments

Agent Banana Paper — Ye et al., 2026
Florence-2 — Microsoft
Gemini API — Google DeepMind

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
artifacts/agent_banana		artifacts/agent_banana
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Moleculyst: Agentic Image Editing

✨ Features

🏗️ Architecture

🚀 Quick Start

Prerequisites

Setup

Run

🎯 Usage

📁 Project Structure

🔧 Configuration

📄 Key Concepts from the Paper

Image Layer Decomposition (ILD)

Context Folding

📜 License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Moleculyst: Agentic Image Editing

✨ Features

🏗️ Architecture

🚀 Quick Start

Prerequisites

Setup

Run

🎯 Usage

📁 Project Structure

🔧 Configuration

📄 Key Concepts from the Paper

Image Layer Decomposition (ILD)

Context Folding

📜 License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages