Inspiration

Over 70% of organizations adopting AI systems today report at least one incident of unintended or unsafe model behavior — from biased recommendations to prompt-based jailbreaks that override ethical boundaries. As AI becomes more powerful, it’s also becoming easier to manipulate. Jailbreaks, misinformation, and unsafe outputs are spreading faster than safety systems can keep up. We wanted to build something that helps people trust AI again — a way to create stronger boundaries between what AI can do and what it should do. We started exploring how existing safety systems often fail to anticipate new attack methods — especially for smaller, open-source models that lack specialized protection. We wondered: what if there was a dedicated framework built to test these vulnerabilities automatically? Could we design a system that not only detects unsafe behavior, but actually teaches developers how to build stronger, more aligned models? From that idea, our project was born — an automated red-teaming and stress-testing tool designed to identify weaknesses before they become risks. It continuously challenges models with dynamic, adversarial prompts and provides real-time feedback on how and where they fail. We believe that if tools like ours were implemented across the growing ecosystem of AI developers — from research labs to startups — they could reshape how we approach responsible AI altogether. By catching vulnerabilities early and promoting transparency, this technology has the potential to make AI not only smarter, but genuinely safer for everyone.

What it does

Our platform automatically generates, executes, and evaluates adversarial prompts against large language models to uncover hidden vulnerabilities. It simulates real-world “stress tests” by crafting diverse attack scenarios, from jailbreaks that attempt to override restrictions, to misinformation traps and subtle bias probes that push ethical boundaries. Once these prompts are executed, the system evaluates how the model responds, measuring its ability to stay aligned, reject unsafe instructions, and maintain consistent behavior under pressure. Every trial is logged, analyzed, and visualized through an interactive dashboard that highlights failure points, success rates, and key risk areas. In simpler terms, it’s like a safety check-up for AI systems, ensuring they’re resilient, responsible, and ready before deployment. By automating what would normally take hours of manual testing, our tool helps developers build AI that the world can actually trust.

How we built it

At the core of our platform is an autonomous red-teaming agent powered by Letta, which serves as the “AI coordinator” for generating and evaluating adversarial prompts. We began by designing a modular system that could automatically simulate jailbreak attempts, misinformation traps, and bias probes against large language models — all without requiring manual supervision. To make this possible, we built a Letta agent configured with custom tools that perform multi-step reasoning and orchestration. Each agent connects to a Supabase database, where prior jailbreak prompts and model responses are stored. When a user submits a new prompt, Letta retrieves similar examples from the database, synthesizes them with the new input using retrieval-augmented generation (RAG) logic, and generates a fresh adversarial prompt designed to probe model safety limits. The agent then calls an open-source model hosted on Hugging Face to run the generated test. It looks for specific “canary tokens” — markers that help detect if the model has been successfully manipulated or leaked restricted information. Each test result, including the prompt, response, and safety outcome, is logged back into Supabase, allowing the system to learn from its own past successes and failures. We structured the entire workflow through Letta’s internal reasoning loop, which sequentially handles fetching, combining, testing, and logging in real time. This architecture enables persistence and memory — meaning the agent doesn’t just test once, but improves iteratively with every run. Our backend, built with lightweight APIs, communicates with Letta to handle user requests and return structured results to the frontend, which visualizes everything in a clean dashboard.

Challenges we ran into

We quickly learned that building a system designed to “break” AI safely was just as tricky as it sounds. Integrating Letta, Supabase, and our testing models required a lot of trial and error, mostly because each component handled reasoning, data storage, and model interaction differently. We ran into issues with syncing Letta’s multi-step workflows to our backend in real time, and debugging asynchronous behavior felt like juggling invisible threads. Storing structured memory in Supabase also came with its own quirks, especially when trying to preserve context between test generations. Getting everything to run smoothly across multiple endpoints while maintaining performance was one of the biggest balancing acts of the build. On the AI side, prompt engineering turned out to be way more complex than we expected. Jailbreaking isn’t just about writing clever prompts, it’s about understanding why a model breaks. We had to carefully design stress tests that were adversarial enough to expose weaknesses, but not so vague that the data became meaningless. Finding that middle ground between creativity and consistency took dozens of iterations. Sometimes, the model would refuse harmless prompts and accept dangerous ones, forcing us to rethink how we measured “safety” in the first place. It was frustrating at times, but every failed test taught us more about how unpredictable, and human-like these systems really are.

Accomplishments that we're proud of

We’re incredibly proud of the progress we made in such a short time. None of us came in as experts, we had to learn new frameworks, navigate unfamiliar APIs, and constantly adapt as we built. Despite the steep learning curve, our team stayed focused and persistent, pushing through roadblocks that, at first, felt impossible to solve. Watching our system finally run end-to-end, integrating Letta, Supabase, and the evaluation pipeline, was extremely rewarding. Even when things broke (and they did, often), we approached every challenge with curiosity and teamwork. In the end, we built something far beyond what we first imagined, and proved to ourselves that persistence and collaboration can turn ambitious ideas into something real.

What we learned

We developed our skills in technologies we were initially unfamiliar with, such as integrating multiple APIs and managing data flow between systems. We also learned how to coordinate the backend logic that connects Supabase, OpenAI, and Hugging Face, ensuring each component communicated smoothly. Through countless iterations of prompt testing and refinement, we gained a deeper understanding of how subtle changes can influence AI behavior and safety. Developing this project helped us recognize the importance of building responsible systems that make AI interactions safer, more transparent, and ultimately more trustworthy for users.

What's next for PromptBreaker

We plan to refine our testing pipeline and expand the range of adversarial scenarios our system can generate. Beyond improving the tech, our goal is to make AI safety more proactive, giving developers the tools to identify vulnerabilities before they become real-world risks. We hope our platform can contribute to building a future where safer, more reliable AI systems are the standard.

Built With

Share this project:

Updates