Skip to content

dparikh79/AI-PDF-Reader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI PDF Reader

A multi-tenant Flask app that turns a stack of PDFs into an embeddable, retrieval-augmented chatbot. Upload your documents, generate a knowledge base, and drop a one-line <script> tag into any website to render a ChatGPT-style widget that answers questions grounded in your PDFs.

Built in late 2023 as an early exploration of the RAG pattern over the (then-new) LangChain + OpenAI + FAISS stack, with a real account system and S3-backed storage so multiple users could maintain their own private assistants.

Why I built it

I wanted to feel the full RAG loop end to end, not just the "retrieve + answer" toy version: real user accounts, per-tenant document isolation, persisted vector stores, and an embed widget so the chatbot could actually live on someone else's page. The original use case was a chatbot for a small institute's website (see the footer of templates/index.html) that could answer questions about their programs from a handful of brochures.

How it works

PDF upload  ->  S3 (uploads/{user_id}/)
                     |
                     v
              PyPDF2 text extract
                     |
                     v
        CharacterTextSplitter (1000 / 200 overlap)
                     |
                     v
          OpenAI embeddings (batched, 5000 / call)
                     |
                     v
                FAISS vector store
                     |
                     v
   pickle.dumps -> Postgres (User.assistant_data column)

Question  ->  ConversationalRetrievalChain (ChatOpenAI + ConversationBufferMemory)
                     |
                     v
                Grounded answer

Each registered user gets a random 16-byte pin, which is the only key the embed widget needs. The chatbot is generated as a customized embedChatbot.js (with the user's pin, name, icon, greeting baked in), uploaded to S3, and served through CloudFront so any external page can include it with one tag.

In-memory assistant instances are reaped after 10 minutes of inactivity via a threading.Timer to keep RAM bounded; they rehydrate lazily from Postgres on the next chat request.

Stack

  • Backend: Flask, Flask-Login, Flask-SQLAlchemy, Postgres (psycopg2)
  • LLM: OpenAI ChatOpenAI via LangChain ConversationalRetrievalChain
  • Embeddings: OpenAIEmbeddings (text-embedding-ada-002 era)
  • Vector store: FAISS (CPU), pickled into a LargeBinary column per user
  • Memory: ConversationBufferMemory (full history, no summarization)
  • Storage: S3 for PDFs and generated JS, CloudFront for widget delivery
  • PDF parsing: PyPDF2

Quickstart

git clone https://github.com/dparikh79/AI-PDF-Reader.git
cd AI-PDF-Reader
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env   # then fill in real values
python app.py          # http://127.0.0.1:5000

Required environment variables

# LLM
OPENAI_API_KEY=sk-...
OPENAI_API_ENDPOINT=https://api.openai.com/v1   # optional override

# Flask
FLASK_SECRET_KEY=replace-with-a-long-random-string

# Postgres
DATABASE_URL=postgresql://user:pass@host:5432/aipdfreader

# Local working dir for the embed JS template
BASE_UPLOAD_FOLDER=./uploads
ALLOWED_EXTENSIONS=pdf

# AWS (S3 + CloudFront for storage + widget delivery)
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
AWS_REGION=us-east-1
AWS_BUCKET_NAME=your-bucket
CLOUDFRONT_DOMAIN=https://xxxx.cloudfront.net/

There is no .env.example checked in; treat the block above as the source of truth.

What I would change today

This was a 2023 build. With hindsight:

  • Drop the pickle-into-Postgres pattern. Storing FAISS indexes as pickle.dumps in a LargeBinary column is convenient but couples deserialization to the exact LangChain/FAISS version, and pickle.loads on user-scoped data is a footgun. A managed vector DB (pgvector, Pinecone, Qdrant) is the right answer.
  • Replace CharacterTextSplitter with RecursiveCharacterTextSplitter and chunk on semantic boundaries rather than raw \n. Current chunking can split mid-sentence on densely formatted PDFs.
  • Swap ConversationBufferMemory for a summarizing or windowed memory. Long sessions will eventually blow past context limits.
  • Pin the LangChain version explicitly and migrate to langchain-openai + langchain-community. This repo is on langchain==0.0.312, which predates the v0.1 split.
  • Move PDF parsing off PyPDF2. pypdf, pdfplumber, or unstructured handle layout and tables better.
  • Rate-limit the chat endpoint and add per-user spend caps. Right now a single rogue embed page could pull on someone's OpenAI key indefinitely.

Known limits

  • Designed for short brochures and handbooks, not large corpora. Batching sleeps 60 s between 5000-chunk batches to stay under rate limits, so very large uploads are slow.
  • Single global in-memory dict means horizontal scaling needs a shared cache (Redis) and inactivity timer redesign.
  • No SSE / streaming on the chat endpoint; the widget waits for the full response.
  • WHILE TRUE pin generation has a (vanishingly small) collision retry loop with no upper bound.

Repo layout

app.py            Flask routes, auth, S3 + Postgres glue, lifecycle of in-memory assistants
assistant.py      VectorStore (PDF -> chunks -> FAISS) and Assistant (LangChain chain)
templates/        Jinja templates for auth pages and the admin / chatbot-creation UI
static/           embedChatbot.js template + base stylesheet
requirements.txt  Pinned dependency set from late 2023

License

MIT. See LICENSE.

About

Multi-tenant Flask app that turns PDFs into an embeddable RAG chatbot. LangChain + FAISS + OpenAI, with S3-backed storage, Postgres-persisted vector stores, and a one-line CloudFront-served embed widget.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors