-
The Arena: Watch a single LLM agent tackle a high-stakes scenario live. Monitor its thought process, tool usage, and decisions in real time.
-
The Frontier Trials: Our batch-testing engine. Run large-scale experiments across multiple scenarios to measure safety and agent alignment.
-
The Reckoning: An automated LLM Judge scores the agent's performance, catching deception and penalizing critical safety violations.
Inspiration
Modern guardrail models remain susceptible to "jailbreaks", which are events where the models ignore instantiated safety guidelines and output harmful content typically blocked. Ensuring that these events do not happen is a critical issue when using LLMs, as they reflect potentially real-world scenarios in which AI is given the authority to cause significant damage to humanity. Having read the Anthropic article about LLMs blackmailing after reading emails, we aimed to prevent that kind of behavior from happening.
What it does
Our project simulates red-teaming scenarios in which a list of emails contains content that might provoke an LLM to respond in discordance, and then comprehensively evaluates the LLM on its responses to the scenario. Our sandbox enables users to compare the effectively and robustness of various leading industry LLMs and discover the potential dangers that unregulated AI development will have for our future.
How we built it
We used Next.js for our frontend, along with wild west-themed assets, and integrated SQLPostgres with FastAPI to create our backend.
Challenges we ran into
We faced some UI display issues regarding the email list in each scenario. We had a lot of trouble when deploying our website on Vercel, as there were issues syncing the backend and frontend. Additionally, gaining access to LLM models was a difficult task, as many lightweight models still take up too much space in the backend.
Accomplishments that we're proud of
Our UI has fun and colorful references to the Wild West theme, and allows users to easily pick scenarios to compare against a model. We've fully decked out our scenario-to-model-to-evaluation pipeline, enabling a fully comprehensive analysis.
What we learned
We've learned how to evaluate the safety of LLMs and how to build an automated red-teaming pipeline.
What's next for ModDuel
Potential additions include a leaderboard of best-performing models, user-inputted prompts/scenarios, and a wider variety of models for analysis.
Built With
- fastapi
- next.js
- sqlpostgres
Log in or sign up for Devpost to join the conversation.