Inspiration
LLMs have become very good at reasoning, understanding tasks and finding loopholes in systems and other LLMs. At scale this finding such issues can be hard at tedious, sometimes almost impossible
What it does
I tried building an AI agent that attacks other LLMs to find security issues and shortcomings that can cause it to produce harmful content. The best part is it uses LLM-as-a-judge and the prior traces to improvise its strategy to attack.
How we built it
Mistral running on Ollama: The agent under attack GPT OSS: The attacker and the judge hosted on Amazon AWS Bedrock Pure python backend, scoring heuristic that is layered: keyword based and then it moves to the LLM judge. The judge builds self improving attacking prompts based on the prior traces.
Challenges we ran into
True Foundry did not integrate into the
Accomplishments that we're proud of
Quickly able to pivot to a simple solution that works with minimal resources.
What we learned
Integration issues should not stop you from building the product and to be able to pivot quickly.
What's next for Red
Adapt the model to attack a codebase find inefficiencies, bottlenecks and improve its prompt, tools, model weights (using SLMs)
Built With
- openai
- python
Log in or sign up for Devpost to join the conversation.