Inspiration

Coding is the most important vertical to which AI has been applied. With the rise of vibe coding, cybersecurity has fallen to an afterthought. We made EvoSec, the cursor for cybersecurity. Short for evolutionary cybersecurity, EvoSec is a multi-agent system that pentests any website, at terminal velocity.

Google DeepMind recently released a paper that uses LLM-backed evolutionary algorithms to evolve snippets of code, enabling them to find the best local minima in a given problem space. By applying this to system prompts, we improved the performance of our CUA with a custom super-fast context management system, perfect for the task.

What it does

EvoSec is a terminal-based agent. When given a website, the devtool's orchestrator agent will call subagents to determine all the vulnerabilities and exploit them. Each subagent excels at its tool. However, some tools are solely GUI-based, like Burp Suite. To counter this, we implemented a CUA subagent, which expanded our target space to practically anything connected to a computer.

We charge users/enterprises on a monthly subscription through SolanaPay. To authenticate them, we used Auth0.

Agents

Orchestrator: Martian/Cohere

The orchestrator agent is the first in line. It looks at the given website, and assigns tasks to the subagents, passing the website as context. It then takes the output of the subagents, and recursively feeds it back to more subagents. Over time, EvoSec finds all possible vulnerabilities. In more technical terms, it uses an LLM as a heuristic, enabling us to turn vulnerability search into a problem much like TSP. By doing this, we can abstract it to A*, as opposed to the current market's brute BFS approach. This distinction is what sets us ahead of the market.

ReconAgent: Reconnaissance Agent

Gathers public & semi‑public info about the target: endpoints, subdomains, parameters, tech stack, directory structure, etc., to map the attack surface before any vulnerability probes. Additionally, it looks at the website networking, which enabled us to scan the CSE dataset.

ScanAgent: Scanning/Vulnerability Detection Agent

Takes the output of recon and runs lightweight tests (probes) to identify potential weak spots (common SQLi, XSS, misconfigurations, version leaks, public APIs, etc.). This ranks the candidates with confidence scores, allowing the orchestrator to direct the exploit subagents effectively.

ExploitAgent: Exploit/Proof‑of‑Concept Agent

For the candidates with sufficient signal, this subagent actually exploits (or demonstrates the vulnerability in a non‑destructive way), confirming them. In the process, it learns more about the site, allowing us to chain vulnerabilities together, and go even deeper. This can come in the form of simple RCE, authenticated bypasses, or simply accessing files that shouldn't be public.

AnalysisAgent: Analysis & Reporting Agent

Collects results from other agents, filters false positives, assesses impact/severity, deduplicates findings, and produces a coherent report (with repro steps, risk rating, recommended fixes).

ReAgent: Reverse Engineering Agent (Future Roadmap Feature)

This agent ingests our findings while we saturate the search space, and builds an internal model of the target with a hack.md file. This allows us to experiment much more heavily on our internal counterpart.

CUA

Our CUA agent beat the current SOTA on HUD's OSWorld-Verified benchmarking environment. This was achieved due to our use of niche context management techniques, perfect for our use case, as well as prompt evolution.

EvoPrompt using Cerebras and Groq

EvoPrompt applies evolutionary algorithms to automatically optimize system prompts for specific tasks/benchmarks, inspired by Google DeepMind's approach to evolving code snippets for better local optima.

Architecture

  • Population-based Search: Maintains a population of prompt variants
  • Fitness Evaluation: Uses task success rate (Hud's OSWorld performance) as fitness function
  • Genetic Operators:
    • Mutation: LLM-generated prompt modifications
    • Crossover: Combining successful prompt elements
    • Selection: Keeping top-performing variants (MAP elites)

Implementation Details

  • LLM Backends: Cerebras + Groq for fast prompt generation/evaluation
  • Context Management: Turning the scripts into Abstract syntax trees, long strings of text, with a finetune of gemini 1.5 flash to find relevant files(perfect balance between grep and semantic search and better than both)
  • Benchmark Target: OSWorld-Verified environment performance

The key insight is treating prompt optimization as an evolutionary search problem rather than manual engineering - letting the algorithm discover optimal prompt structures for our use case.

How we built EvoSec

In a sleek typescript-based terminal UX, coupled with an easy installation process, we made a smart orchestrator with a few subagents. The agent backend is all in Python using Langraph. The backend is in Express and Node.

By scraping the web, we structured a graph knowledge base using Zep so that the agent always has great context on what possible vulnerabilities there are, how to find them, and what to do with them.

Challenges we ran into

We had difficulty getting the terminal UX to work with the local backend. The orchestrator flow, the subagents, and the CUA proved tedious to integrate. We also had to work on lowering the false positive rate to less than 1%. We realized that prompt caching our system prompt, which is quite large, slowed down our program, so we decided to change it into a knowledge graph, but had to hold off on it due to a lack of time. Lastly, the exploits that our ExploitAgent was generating were getting deleted in real time by Microsoft Defender. We also dropped a water bottle down the stairs of the hacker house, Akatos, we are staying in, which was embarrassing.

Accomplishments that we're proud of

Successfully breaching quite a few web apps from founders around Waterloo, achieving a SOTA benchmark on the CUA, our high hit rate %, the integration of a cutting-edge prompt evolution algorithm, giving us SOTA tech on multiple subtasks, and monetizing with the blockchain!

What we learned

SolanaPay, CUA, Auth0, Cereras inference, multi-agent orchestration, Warp, and general insight on what it takes to make an amazing agent.

What's next for EvoSec

We want to pursue an open-core model, like Metasploit, but AI-native. The agent should be community-owned: lightweight, local, and extensible. We monetize the API (cybersec reasoning + context management + blind inference), but let the economy embrace the tool.

Business model

API-only SaaS: Agents are mostly OSS; we sell access to find/plan/patch/explain endpoints. Pricing: Pro $149/seat/mo (bug-bounty & CI workflows); Enterprise $150/seat/mo (SSO/RBAC, audit exports, regional endpoints). Overages by usage.

Why it pays for itself: Verified findings -> draft PR fixes. In bug-bounty automation alone, the pro plan recoups many multiples per month.

Marketplace: Community modules & sub-agents

What ships: Recon, SAST/DAST adapters, exploit simulators, reporting packs; all published by the community.

Revenue share: Creators earn; EvoSec takes a platform fee.

Payments: Solana Pay for instant, low-fee decentralized payouts.

Go-to-market

B2D: OSS agent → CTFs/leaderboards → Pro seats via CI templates and the “Secured by EvoSec” badge. B2B: AppSec/platform teams buy seats + org commits; prove value on verified exploit + merged PR KPIs. Early success metrics we’re tracking

60k users with a mixed user base = 110M arr TAM = 300-500B, growing ~30% YoY

Why this matters

Traditional pentests are snapshots. EvoSec makes security continuous and developer-run: from vulnerability -> patch. Open-core earns trust; the API and marketplace make it a unicorn.

Built With

Share this project:

Updates