A multi-tenant Flask app that turns a stack of PDFs into an embeddable, retrieval-augmented chatbot. Upload your documents, generate a knowledge base, and drop a one-line <script> tag into any website to render a ChatGPT-style widget that answers questions grounded in your PDFs.
Built in late 2023 as an early exploration of the RAG pattern over the (then-new) LangChain + OpenAI + FAISS stack, with a real account system and S3-backed storage so multiple users could maintain their own private assistants.
I wanted to feel the full RAG loop end to end, not just the "retrieve + answer" toy version: real user accounts, per-tenant document isolation, persisted vector stores, and an embed widget so the chatbot could actually live on someone else's page. The original use case was a chatbot for a small institute's website (see the footer of templates/index.html) that could answer questions about their programs from a handful of brochures.
PDF upload -> S3 (uploads/{user_id}/)
|
v
PyPDF2 text extract
|
v
CharacterTextSplitter (1000 / 200 overlap)
|
v
OpenAI embeddings (batched, 5000 / call)
|
v
FAISS vector store
|
v
pickle.dumps -> Postgres (User.assistant_data column)
Question -> ConversationalRetrievalChain (ChatOpenAI + ConversationBufferMemory)
|
v
Grounded answer
Each registered user gets a random 16-byte pin, which is the only key the embed widget needs. The chatbot is generated as a customized embedChatbot.js (with the user's pin, name, icon, greeting baked in), uploaded to S3, and served through CloudFront so any external page can include it with one tag.
In-memory assistant instances are reaped after 10 minutes of inactivity via a threading.Timer to keep RAM bounded; they rehydrate lazily from Postgres on the next chat request.
- Backend: Flask, Flask-Login, Flask-SQLAlchemy, Postgres (psycopg2)
- LLM: OpenAI
ChatOpenAIvia LangChainConversationalRetrievalChain - Embeddings:
OpenAIEmbeddings(text-embedding-ada-002 era) - Vector store: FAISS (CPU), pickled into a
LargeBinarycolumn per user - Memory:
ConversationBufferMemory(full history, no summarization) - Storage: S3 for PDFs and generated JS, CloudFront for widget delivery
- PDF parsing: PyPDF2
git clone https://github.com/dparikh79/AI-PDF-Reader.git
cd AI-PDF-Reader
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # then fill in real values
python app.py # http://127.0.0.1:5000# LLM
OPENAI_API_KEY=sk-...
OPENAI_API_ENDPOINT=https://api.openai.com/v1 # optional override
# Flask
FLASK_SECRET_KEY=replace-with-a-long-random-string
# Postgres
DATABASE_URL=postgresql://user:pass@host:5432/aipdfreader
# Local working dir for the embed JS template
BASE_UPLOAD_FOLDER=./uploads
ALLOWED_EXTENSIONS=pdf
# AWS (S3 + CloudFront for storage + widget delivery)
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
AWS_REGION=us-east-1
AWS_BUCKET_NAME=your-bucket
CLOUDFRONT_DOMAIN=https://xxxx.cloudfront.net/There is no .env.example checked in; treat the block above as the source of truth.
This was a 2023 build. With hindsight:
- Drop the pickle-into-Postgres pattern. Storing FAISS indexes as
pickle.dumpsin aLargeBinarycolumn is convenient but couples deserialization to the exact LangChain/FAISS version, andpickle.loadson user-scoped data is a footgun. A managed vector DB (pgvector, Pinecone, Qdrant) is the right answer. - Replace
CharacterTextSplitterwithRecursiveCharacterTextSplitterand chunk on semantic boundaries rather than raw\n. Current chunking can split mid-sentence on densely formatted PDFs. - Swap
ConversationBufferMemoryfor a summarizing or windowed memory. Long sessions will eventually blow past context limits. - Pin the LangChain version explicitly and migrate to
langchain-openai+langchain-community. This repo is onlangchain==0.0.312, which predates the v0.1 split. - Move PDF parsing off PyPDF2.
pypdf,pdfplumber, orunstructuredhandle layout and tables better. - Rate-limit the chat endpoint and add per-user spend caps. Right now a single rogue embed page could pull on someone's OpenAI key indefinitely.
- Designed for short brochures and handbooks, not large corpora. Batching sleeps 60 s between 5000-chunk batches to stay under rate limits, so very large uploads are slow.
- Single
globalin-memory dict means horizontal scaling needs a shared cache (Redis) and inactivity timer redesign. - No SSE / streaming on the chat endpoint; the widget waits for the full response.
WHILE TRUEpin generation has a (vanishingly small) collision retry loop with no upper bound.
app.py Flask routes, auth, S3 + Postgres glue, lifecycle of in-memory assistants
assistant.py VectorStore (PDF -> chunks -> FAISS) and Assistant (LangChain chain)
templates/ Jinja templates for auth pages and the admin / chatbot-creation UI
static/ embedChatbot.js template + base stylesheet
requirements.txt Pinned dependency set from late 2023
MIT. See LICENSE.