Inspiration

People use LLMs in the legal field and it isn't always safe. How do we test for that, iteratively?

What it does

It uses a series of manual test cases to test an LLM, judges responses using an LLM judge, and then can make new cases using an LLM test case generator.

How I built it

Challenges I ran into

  1. Figuring out how to do this, especially for case specific instances (I found Callidus AI, a legal AI) to test, but couldn't get an API. I realized this would mean I should create a way of testing separate from APIs so I used selenium.

  2. Coming up with what to put on a tracker dashboard.

  3. Taking advantage of modal. I ended up coming up with the idea of that being my LLM judge.

    Accomplishments that I'm proud of

    Combining all these ideas!

    What I learned

    So much about Modal and Windsurf was great.

Built With

Share this project:

Updates