Darwin

Inspiration

Building tools for AI agents is backwards. Writing MCP servers, API integrations—it's weeks of work. Why not let AI write its own tools on demand? Wouldn't that be faster and more secure? But here's the problem: testing tool quality is nearly impossible.

How do you know if generated code is secure? Fast? Correct?

Then it hit us: Everyone believes in natural selection. Species don't get "tested"—they compete, and the fittest survive. What if we applied that to AI tool generation? Let models compete. Scan for vulnerabilities. Weak code dies. Strong code survives. The system learns naturally.

Natural selection works because it's infused with survival pressure. We brought that pressure to AI.

What it does

Darwin runs parallel competitions between AI models to generate tools on demand.

The Arena:

Submit a task ("create email validator")
Three models compete simultaneously (Claude, GPT-4, Llama via AWS Bedrock)
Each generates working tool code
Semgrep scans every output for security vulnerabilities
Fitness scoring: security (40%), success (30%), speed (20%), quality (10%)
Winner survives. Losers go extinct.

Evolution Tracking:

Tracks every generation in data/generations.json
Calculates which models win which task types
Shows fitness curves climbing over time
Leaderboard reveals natural specialization

The Result: You don't pick the best model. The best model emerges through competition.

Tech Stack:

AWS Bedrock for multi-model access
Semgrep for vulnerability scanning
FastAPI backend + Next.js frontend
JSON file storage (fast iteration, no DB overhead)

Challenges

Parallel model invocation - Bedrock async patterns for simultaneous generation
Security scanning speed - Semgrep is thorough but slow; built heuristic fallback
Fitness scoring balance - Weighting security vs speed without killing innovation
Real-time coordination - Four people, zero merge conflicts, perfect handoffs
Time pressure - 6 hours from concept to working demo

Accomplishments

✅ Vertical integration model - Nobody blocked, everyone shipped
✅ Natural selection actually works - Models specialize without manual tuning
✅ Semgrep integration - Automated vulnerability detection in generated code
✅ AWS Bedrock mastery - Multi-model orchestration in production
✅ Complete system in 6 hours - Backend, frontend, evolution engine, security
✅ Real generational learning - System tracks which models win which tasks

The biggest win: We proved you can test AI-generated tools through competition instead of manual review.

What we learned

AWS Bedrock is a cheat code - Access to Claude, Llama, GPT-4 without managing infrastructure. Parallel invocation is seamless. This would've taken much longer with raw APIs.

Semgrep catches what humans miss - Regex DoS, eval() abuse, shell injection. The security scanning is incredibly good. Every generated tool should be scanned.

Natural selection is instantly understandable - No one questions the metaphor. Everyone gets it. Biological framing made the technical system memorable.

Tight teams ship fast - Vertical integration, clear ownership, hourly checkpoints. No one waited on anyone. We moved like a surgical strike team.

Darwin proves natural selection works for code. May the best model survive.