Inspiration
Every time you prompt an AI agent, you're guessing. You tweak the system prompt, test it, tweak again — manually iterating until it works. We asked: what if the agent could improve itself?
What it does
AgentForge is a self-improving agent builder. You describe what you want an agent to do — "build me a cybersecurity threat briefing agent" — and AgentForge automatically:
- Scouts real-time data sources via Tavily
- Builds an agent with an optimized system prompt
- Runs the agent against live data
- Evaluates the output with a harsh scoring system (1-10)
- Extracts lessons categorized by Prompt Design, APIs & Data, Architecture, and Output Quality
- Evolves the agent using those lessons and iterates until it hits the target score
The key breakthrough: lessons transfer across tasks. After building an AI news briefing agent (6→8/10 in 2 iterations), we forged a cybersecurity threat agent — and it scored 9/10 on the first try because it applied lessons learned from the previous build.
How we built it
- OpenAI GPT-4o powers all five core components (Scout, Builder, Runner, Evaluator, Lesson Engine)
- Tavily provides real-time web search so agents work with today's data, not stale training knowledge
- Neo4j Aura stores execution traces as a real graph database — ForgeRun nodes, IterationStep nodes connected by EVOLVED_INTO relationships, and Lesson nodes linked via LEARNED edges
- React + Vite frontend with D3 knowledge graph visualization, markdown rendering, and a live test harness
The entire system is a single ~1100-line React component with no backend server — all API calls happen client-side.
Challenges we ran into
CORS killed our original architecture. We started with Anthropic Claude for evaluation and OpenAI for execution, but browser CORS policies blocked all Anthropic API calls from localhost. We had to migrate the entire system to OpenAI-only mid-hackathon.
Agents kept describing plans instead of executing them. The Builder would design agents that said "call the CoinGecko API" — but agents can't call APIs. They receive pre-fetched Tavily data. We had to rewrite the Builder prompt to explicitly teach it the runtime architecture: "the agent receives data, it doesn't fetch data."
Self-improvement was going backwards. Our evaluator gave vague feedback ("lacks real-time data") that the Builder couldn't act on. Scores went 6→5→4. We fixed this by making the Builder see its previous system prompts and forcing the evaluator to give specific, actionable suggestions.
What we learned
- The gap between "describing what an agent would do" and "actually doing it" is the core challenge of agent design
- Lesson transfer across tasks is more powerful than within-task iteration — each agent you build makes the next one smarter
- Harsh, specific evaluation feedback matters more than clever prompting
What's next
- Persistent lesson storage across sessions via Neo4j
- Multi-agent collaboration (agents that build agents that build agents)
- Community lesson sharing — imagine a marketplace of learned agent-building strategies
Log in or sign up for Devpost to join the conversation.