LLMatchup

Two shall enter, one shall triumph. Which LLM will come out on top? Let's get ready to rumble!

Inspiration

The rise of DeepSeek really inspired me to put the LLM to the test. I used the 1.5b model but was very unimpressed. However, the 8b model really blew my mind. But how well did it stack up against the other LLMs out there? And how well would a Snapdragon processor handle them? That led me to develop this little project.

What it does

The user selects two LLMs to face off, as well as a third LLM (it could be the same as one of the contestants - they are three separate instances of the model) to act as the referee. The models are then given a vector store of a PDF as context (in my case, Homer's "The Odyssey") and asked the same questions (in the questions.csv file). The referee then verifies the model's answer against the correct answer (included in the questions.csv file) and classifies it as correct or incorrect.

The program then tracks the time that the model took to answer and if it answered correctly or not. Once both LLMs have answered the questions, a line graph pops up displaying the results. The data is saved as a CSV file in case the user wants to refer to it later on.

How I built it

The project uses Ollama as the backend to run the models and Python to interface with them. I noticed how spotty the 1b and 1.5b were across the board, and the 8b models were way too slow to run locally. However, when I set them up in a Snapdragon X Plus 8-core, they executed much faster, even without a GPU. I ended up running the 1.5b models locally and the 8b models when I uploaded them to the QDC.

Challenges I ran into

The incorrectness of the 1.5b models was such a disappointment that I almost gave up on the project altogether. However, once I had the QDC device running and tested with the 8b models, the results were much more aligned with what I expected.

Figuring out the correctness of a model response was also a challenge, since the response may contain some text found in the answer, but its conclusion may be incorrect. I solved that problem by adding a third LLM to act as the referee and decide if an answer is correct or not.

Accomplishments that I'm proud of

I got to pit Deepseek against Llama and other LLMs to see just how great this new LLM really is. I learned a lot about chunks, vector stores and indexes, and got to play around with various devices in the QDC. This was probably the most fulfilling project I've worked on in Devpost.

What I learned

I learned about RAGs and how to feed LLMs specific context in order to get precise responses that focus solely on the document provided. I also learned that the Deepseek and Llama3.1 8b models are on par as far as correctness is concerned, but that Llama is much faster in execution. However, the most important lesson learned was to stay away from the 1b and 1.5b models - their answers were awful across the board.

What's next for LLMatchup

Some features that I'd love to add include incorporating more than two LLMs in a matchup; the ability for the referee to create the questions and answers; and the automatic download of models specified.

Built With

langchain
matplotlib
ollama
python

Updates

Orlando Soto started this project — Feb 11, 2025 12:47 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.