Inspiration
We have created a document analyzer app using an open source large language model Microsoft Phi-3.5-mini-instruct quantized to 5 bits and embedding model mxbai-embed-large-v1-f16. Basically we have created a Retrieval Augmented Generation system around this llm. Users can create account, upload their pdf's, download/delete their pdf's and ask their questions from their selected pdf's.
What it does
It works as an information retreival system where users can extract relevant information from lengthy documents with ease. It can help in studying research papers or extracting information from legal documents or for increasing productivity in general.
How we built it
We first downloaded the model from -> https://huggingface.co/bartowski/Phi-3.5-mini-instruct-GGUF and embedding model from here -> https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1
Then we ran the model using llama.cpp as an http web server. The code for this is inside flask-backend/MLmodel/project_convex/model.py
Llama.cpp repo -> https://github.com/ggerganov/llama.cpp
In model.py we convert the pdf's into text using pymupdf then chunk them into strings. These chunked strings are further converted into vector embeddings using our embedding model and these embeddings get stored inside our vector database Qdrant which is running with the help of docker.
Qdrant VectorDB -> https://qdrant.tech/
Whenever User asks a question related to the pdf we fetch the content related to that question from the pdf using our VectorDB Qdrant (by running vector search) and then make a prompt which has that context we just fetched and the user's query. This prompt is further given to our large language model which answers the user's question and that answer is displayed in the frontend.
Challenges we ran into
We had a hard time choosing the right size of open source llm due to hardware constraints. Earlier we were using Mistral-7B-Instruct but it was taking around 3 min to answer a single question on our 16GB ram and 8 CPU linux server. Then we switched the model to Microsoft Phi 3.5 mini which was giving answers in less than 40 seconds.
When developing User login/authentication system we faced a lot of CORS (Cross Origin Resource Sharing) issues. Basically our app checks with the help of cookies whether the user is logged in or not and those cookies were being rejected by browser due to which we had to configure https to set cookies properly.
Accomplishments that we're proud of
We created a fully working user authentication system with the help of cookies and deployed it on https successfully. Also we didn't get discouraged due the lack of hardware resources and modified our design of the app accordingly. Deploying this app with the help of Nginx, Systemd and docker was also a major hurdle for us.
We built a RAG system using only open source models and by just reading online documentation and watching various tutorials.
What we learned
We learned
- How to connect react frontend with flask backend using fetch functions.
- How to integrate large language models into our app.
- Use of llama.cpp
- What is a vector database and vector embeddings
- How are RAG systems developed
- How to expose model as a web API
- How to deploy full-stack apps using https, domain, react, flask, docker, systemd services, Nginx etc.
What's next for Transformo Docs
We will try to expand the features of the app make the responses even faster and improve the response quality. We will try to include language translation, streaming of responses and text to speech into our main application.

Log in or sign up for Devpost to join the conversation.