AgentForge

Inspiration

Every time you prompt an AI agent, you're guessing. You tweak the system prompt, test it, tweak again — manually iterating until it works. We asked: what if the agent could improve itself?

What it does

AgentForge is a self-improving agent builder. You describe what you want an agent to do — "build me a cybersecurity threat briefing agent" — and AgentForge automatically:

Scouts real-time data sources via Tavily
Builds an agent with an optimized system prompt
Runs the agent against live data
Evaluates the output with a harsh scoring system (1-10)
Extracts lessons categorized by Prompt Design, APIs & Data, Architecture, and Output Quality
Evolves the agent using those lessons and iterates until it hits the target score

The key breakthrough: lessons transfer across tasks. After building an AI news briefing agent (6→8/10 in 2 iterations), we forged a cybersecurity threat agent — and it scored 9/10 on the first try because it applied lessons learned from the previous build.

How we built it

OpenAI GPT-4o powers all five core components (Scout, Builder, Runner, Evaluator, Lesson Engine)
Tavily provides real-time web search so agents work with today's data, not stale training knowledge
Neo4j Aura stores execution traces as a real graph database — ForgeRun nodes, IterationStep nodes connected by EVOLVED_INTO relationships, and Lesson nodes linked via LEARNED edges
React + Vite frontend with D3 knowledge graph visualization, markdown rendering, and a live test harness

The entire system is a single ~1100-line React component with no backend server — all API calls happen client-side.

Challenges we ran into

CORS killed our original architecture. We started with Anthropic Claude for evaluation and OpenAI for execution, but browser CORS policies blocked all Anthropic API calls from localhost. We had to migrate the entire system to OpenAI-only mid-hackathon.

Agents kept describing plans instead of executing them. The Builder would design agents that said "call the CoinGecko API" — but agents can't call APIs. They receive pre-fetched Tavily data. We had to rewrite the Builder prompt to explicitly teach it the runtime architecture: "the agent receives data, it doesn't fetch data."

Self-improvement was going backwards. Our evaluator gave vague feedback ("lacks real-time data") that the Builder couldn't act on. Scores went 6→5→4. We fixed this by making the Builder see its previous system prompts and forcing the evaluator to give specific, actionable suggestions.

What we learned

The gap between "describing what an agent would do" and "actually doing it" is the core challenge of agent design
Lesson transfer across tasks is more powerful than within-task iteration — each agent you build makes the next one smarter
Harsh, specific evaluation feedback matters more than clever prompting

What's next

Persistent lesson storage across sessions via Neo4j
Multi-agent collaboration (agents that build agents that build agents)
Community lesson sharing — imagine a marketplace of learned agent-building strategies