Inspiration
Building tools for AI agents is backwards. Writing MCP servers, API integrations—it's weeks of work. Why not let AI write its own tools on demand? Wouldn't that be faster and more secure? But here's the problem: testing tool quality is nearly impossible.
How do you know if generated code is secure? Fast? Correct?
Then it hit us: Everyone believes in natural selection. Species don't get "tested"—they compete, and the fittest survive. What if we applied that to AI tool generation? Let models compete. Scan for vulnerabilities. Weak code dies. Strong code survives. The system learns naturally.
Natural selection works because it's infused with survival pressure. We brought that pressure to AI.
What it does
Darwin runs parallel competitions between AI models to generate tools on demand.
The Arena:
- Submit a task ("create email validator")
- Three models compete simultaneously (Claude, GPT-4, Llama via AWS Bedrock)
- Each generates working tool code
- Semgrep scans every output for security vulnerabilities
- Fitness scoring: security (40%), success (30%), speed (20%), quality (10%)
- Winner survives. Losers go extinct.
Evolution Tracking:
- Tracks every generation in
data/generations.json - Calculates which models win which task types
- Shows fitness curves climbing over time
- Leaderboard reveals natural specialization
The Result: You don't pick the best model. The best model emerges through competition.
Tech Stack:
- AWS Bedrock for multi-model access
- Semgrep for vulnerability scanning
- FastAPI backend + Next.js frontend
- JSON file storage (fast iteration, no DB overhead)
Challenges
- Parallel model invocation - Bedrock async patterns for simultaneous generation
- Security scanning speed - Semgrep is thorough but slow; built heuristic fallback
- Fitness scoring balance - Weighting security vs speed without killing innovation
- Real-time coordination - Four people, zero merge conflicts, perfect handoffs
- Time pressure - 6 hours from concept to working demo
Accomplishments
✅ Vertical integration model - Nobody blocked, everyone shipped
✅ Natural selection actually works - Models specialize without manual tuning
✅ Semgrep integration - Automated vulnerability detection in generated code
✅ AWS Bedrock mastery - Multi-model orchestration in production
✅ Complete system in 6 hours - Backend, frontend, evolution engine, security
✅ Real generational learning - System tracks which models win which tasks
The biggest win: We proved you can test AI-generated tools through competition instead of manual review.
What we learned
AWS Bedrock is a cheat code - Access to Claude, Llama, GPT-4 without managing infrastructure. Parallel invocation is seamless. This would've taken much longer with raw APIs.
Semgrep catches what humans miss - Regex DoS, eval() abuse, shell injection. The security scanning is incredibly good. Every generated tool should be scanned.
Natural selection is instantly understandable - No one questions the metaphor. Everyone gets it. Biological framing made the technical system memorable.
Tight teams ship fast - Vertical integration, clear ownership, hourly checkpoints. No one waited on anyone. We moved like a surgical strike team.
Darwin proves natural selection works for code. May the best model survive.
Built With
- amazon-web-services
- fastapi
- next.js
- semgrep
Log in or sign up for Devpost to join the conversation.