Inspiration

Building tools for AI agents is backwards. Writing MCP servers, API integrations—it's weeks of work. Why not let AI write its own tools on demand? Wouldn't that be faster and more secure? But here's the problem: testing tool quality is nearly impossible.

How do you know if generated code is secure? Fast? Correct?

Then it hit us: Everyone believes in natural selection. Species don't get "tested"—they compete, and the fittest survive. What if we applied that to AI tool generation? Let models compete. Scan for vulnerabilities. Weak code dies. Strong code survives. The system learns naturally.

Natural selection works because it's infused with survival pressure. We brought that pressure to AI.

What it does

Darwin runs parallel competitions between AI models to generate tools on demand.

The Arena:

  • Submit a task ("create email validator")
  • Three models compete simultaneously (Claude, GPT-4, Llama via AWS Bedrock)
  • Each generates working tool code
  • Semgrep scans every output for security vulnerabilities
  • Fitness scoring: security (40%), success (30%), speed (20%), quality (10%)
  • Winner survives. Losers go extinct.

Evolution Tracking:

  • Tracks every generation in data/generations.json
  • Calculates which models win which task types
  • Shows fitness curves climbing over time
  • Leaderboard reveals natural specialization

The Result: You don't pick the best model. The best model emerges through competition.

Tech Stack:

  • AWS Bedrock for multi-model access
  • Semgrep for vulnerability scanning
  • FastAPI backend + Next.js frontend
  • JSON file storage (fast iteration, no DB overhead)

Challenges

  1. Parallel model invocation - Bedrock async patterns for simultaneous generation
  2. Security scanning speed - Semgrep is thorough but slow; built heuristic fallback
  3. Fitness scoring balance - Weighting security vs speed without killing innovation
  4. Real-time coordination - Four people, zero merge conflicts, perfect handoffs
  5. Time pressure - 6 hours from concept to working demo

Accomplishments

Vertical integration model - Nobody blocked, everyone shipped
Natural selection actually works - Models specialize without manual tuning
Semgrep integration - Automated vulnerability detection in generated code
AWS Bedrock mastery - Multi-model orchestration in production
Complete system in 6 hours - Backend, frontend, evolution engine, security
Real generational learning - System tracks which models win which tasks

The biggest win: We proved you can test AI-generated tools through competition instead of manual review.

What we learned

AWS Bedrock is a cheat code - Access to Claude, Llama, GPT-4 without managing infrastructure. Parallel invocation is seamless. This would've taken much longer with raw APIs.

Semgrep catches what humans miss - Regex DoS, eval() abuse, shell injection. The security scanning is incredibly good. Every generated tool should be scanned.

Natural selection is instantly understandable - No one questions the metaphor. Everyone gets it. Biological framing made the technical system memorable.

Tight teams ship fast - Vertical integration, clear ownership, hourly checkpoints. No one waited on anyone. We moved like a surgical strike team.

Darwin proves natural selection works for code. May the best model survive.

Built With

Share this project:

Updates