Inspiration

LLMs have become very good at reasoning, understanding tasks and finding loopholes in systems and other LLMs. At scale this finding such issues can be hard at tedious, sometimes almost impossible

What it does

I tried building an AI agent that attacks other LLMs to find security issues and shortcomings that can cause it to produce harmful content. The best part is it uses LLM-as-a-judge and the prior traces to improvise its strategy to attack.

How we built it

Mistral running on Ollama: The agent under attack GPT OSS: The attacker and the judge hosted on Amazon AWS Bedrock Pure python backend, scoring heuristic that is layered: keyword based and then it moves to the LLM judge. The judge builds self improving attacking prompts based on the prior traces.

Challenges we ran into

True Foundry did not integrate into the

Accomplishments that we're proud of

Quickly able to pivot to a simple solution that works with minimal resources.

What we learned

Integration issues should not stop you from building the product and to be able to pivot quickly.

What's next for Red

Adapt the model to attack a codebase find inefficiencies, bottlenecks and improve its prompt, tools, model weights (using SLMs)

Built With

Share this project:

Updates