Self-hosted RAG pipeline with Ollama and pgvector. No API keys, no telemetry, no SaaS dependencies. Companion code for the Self-Hosted RAG Pipeline with Ollama and pgvector guide on computingforgeeks.com.
Ingests a folder of Markdown or text files into PostgreSQL + pgvector, indexes them with HNSW (cosine) for vector retrieval and GIN (tsvector) for full-text retrieval, then answers questions using a hybrid Reciprocal Rank Fusion query and a local Ollama chat model. Source citations included.
The whole thing is one Python file (~290 lines).
| Component | Purpose | Why |
|---|---|---|
| PostgreSQL 17 + pgvector 0.8 | Vector + full-text store | ACID, backups, joins, one ops story |
| Ollama 0.23+ | Embedding + chat models | One service, open weights, no API |
nomic-embed-text (768 dim) |
Embedding model | Fast, accurate, small (274 MB) |
llama3.1:8b |
Chat model | Fits 6 GB VRAM at Q4_K_M |
| Python 3.10+ | Glue | psycopg, pgvector, ollama |
Prerequisites: PostgreSQL 17 with pgvector, Ollama running. See the guide for install instructions on Rocky Linux and Ubuntu.
# Database role + DB
sudo -u postgres psql <<'SQL'
CREATE ROLE rag LOGIN;
CREATE DATABASE rag OWNER rag;
\c rag
CREATE EXTENSION vector;
SQL
# Models
ollama pull nomic-embed-text
ollama pull llama3.1:8b
# Python
python3 -m venv .venv
. .venv/bin/activate
pip install "psycopg[binary]>=3.2" "pgvector>=0.3.6" "ollama>=0.4"
# Initialise schema and run
export RAG_DSN="postgresql://rag@/rag?host=/var/run/postgresql"
python rag.py init
python rag.py ingest path/to/docs/
python rag.py query "What does the documentation say about X?"
python rag.py stats| Command | What it does |
|---|---|
python rag.py init |
Create schema (documents, chunks) and HNSW + GIN indexes |
python rag.py ingest <dir> |
Recursively ingest *.md and *.txt from <dir> (idempotent via sha256) |
python rag.py query "..." |
Hybrid retrieve top-5, generate answer with citations |
python rag.py stats |
Document count, chunk count, table and index sizes |
All via environment variables:
| Variable | Default | Purpose |
|---|---|---|
RAG_DSN |
postgresql:///rag |
PostgreSQL connection string |
RAG_EMBED |
nomic-embed-text |
Embedding model name |
RAG_EMBED_DIM |
768 |
Embedding dimension (must match the model) |
RAG_CHAT |
llama3.1:8b |
Chat model name |
RAG_CHUNK_TOKENS |
800 |
Target tokens per chunk |
RAG_OVERLAP |
100 |
Overlap tokens between adjacent chunks |
RAG_TOP_K |
5 |
How many chunks to retrieve per query |
OLLAMA_HOST |
http://127.0.0.1:11434 |
Ollama API URL |
Pure cosine retrieval fails on technical documentation. Ask the demo corpus
"How do I install MetalLB on Kubernetes?" using only embedding <=> query
and the top 5 are all unrelated install guides — the embedding model matches
the "install on platform" pattern more strongly than the rare token "metallb".
The RRF fusion of cosine top-k + tsvector top-k surfaces the right document in 50 ms. See the full explanation in the article.
MIT. See LICENSE.
- Ollama Models Cheat Sheet — picking the right embedding and chat models
- Ollama Commands Cheat Sheet — full CLI and API reference
- Install pgvector on PostgreSQL — database side
- Install Ollama on Rocky Linux or Ubuntu