Inspiration
People use LLMs in the legal field and it isn't always safe. How do we test for that, iteratively?
What it does
It uses a series of manual test cases to test an LLM, judges responses using an LLM judge, and then can make new cases using an LLM test case generator.
How I built it
Challenges I ran into
Figuring out how to do this, especially for case specific instances (I found Callidus AI, a legal AI) to test, but couldn't get an API. I realized this would mean I should create a way of testing separate from APIs so I used selenium.
Coming up with what to put on a tracker dashboard.
Taking advantage of modal. I ended up coming up with the idea of that being my LLM judge.
Accomplishments that I'm proud of
Combining all these ideas!
What I learned
So much about Modal and Windsurf was great.
Built With
- llama
- modal
- python
- streamlit
Log in or sign up for Devpost to join the conversation.