LLM Testing (LIST)

Inspiration

People use LLMs in the legal field and it isn't always safe. How do we test for that, iteratively?

What it does

It uses a series of manual test cases to test an LLM, judges responses using an LLM judge, and then can make new cases using an LLM test case generator.

How I built it

Challenges I ran into

Figuring out how to do this, especially for case specific instances (I found Callidus AI, a legal AI) to test, but couldn't get an API. I realized this would mean I should create a way of testing separate from APIs so I used selenium.
Coming up with what to put on a tracker dashboard.
Taking advantage of modal. I ended up coming up with the idea of that being my LLM judge.

Accomplishments that I'm proud of

Combining all these ideas!

What I learned

So much about Modal and Windsurf was great.