cfg-rag-demo

Self-hosted RAG pipeline with Ollama and pgvector. No API keys, no telemetry, no SaaS dependencies. Companion code for the Self-Hosted RAG Pipeline with Ollama and pgvector guide on computingforgeeks.com.

What it does

Ingests a folder of Markdown or text files into PostgreSQL + pgvector, indexes them with HNSW (cosine) for vector retrieval and GIN (tsvector) for full-text retrieval, then answers questions using a hybrid Reciprocal Rank Fusion query and a local Ollama chat model. Source citations included.

The whole thing is one Python file (~290 lines).

Stack

Component	Purpose	Why
PostgreSQL 17 + pgvector 0.8	Vector + full-text store	ACID, backups, joins, one ops story
Ollama 0.23+	Embedding + chat models	One service, open weights, no API
`nomic-embed-text` (768 dim)	Embedding model	Fast, accurate, small (274 MB)
`llama3.1:8b`	Chat model	Fits 6 GB VRAM at Q4_K_M
Python 3.10+	Glue	`psycopg`, `pgvector`, `ollama`

Quick start

Prerequisites: PostgreSQL 17 with pgvector, Ollama running. See the guide for install instructions on Rocky Linux and Ubuntu.

# Database role + DB
sudo -u postgres psql <<'SQL'
CREATE ROLE rag LOGIN;
CREATE DATABASE rag OWNER rag;
\c rag
CREATE EXTENSION vector;
SQL

# Models
ollama pull nomic-embed-text
ollama pull llama3.1:8b

# Python
python3 -m venv .venv
. .venv/bin/activate
pip install "psycopg[binary]>=3.2" "pgvector>=0.3.6" "ollama>=0.4"

# Initialise schema and run
export RAG_DSN="postgresql://rag@/rag?host=/var/run/postgresql"
python rag.py init
python rag.py ingest path/to/docs/
python rag.py query "What does the documentation say about X?"
python rag.py stats

Commands

Command	What it does
`python rag.py init`	Create schema (documents, chunks) and HNSW + GIN indexes
`python rag.py ingest <dir>`	Recursively ingest `.md` and `.txt` from `<dir>` (idempotent via sha256)
`python rag.py query "..."`	Hybrid retrieve top-5, generate answer with citations
`python rag.py stats`	Document count, chunk count, table and index sizes

Configuration

All via environment variables:

Variable	Default	Purpose
`RAG_DSN`	`postgresql:///rag`	PostgreSQL connection string
`RAG_EMBED`	`nomic-embed-text`	Embedding model name
`RAG_EMBED_DIM`	`768`	Embedding dimension (must match the model)
`RAG_CHAT`	`llama3.1:8b`	Chat model name
`RAG_CHUNK_TOKENS`	`800`	Target tokens per chunk
`RAG_OVERLAP`	`100`	Overlap tokens between adjacent chunks
`RAG_TOP_K`	`5`	How many chunks to retrieve per query
`OLLAMA_HOST`	`http://127.0.0.1:11434`	Ollama API URL

Why hybrid retrieval

Pure cosine retrieval fails on technical documentation. Ask the demo corpus "How do I install MetalLB on Kubernetes?" using only embedding <=> query and the top 5 are all unrelated install guides — the embedding model matches the "install on platform" pattern more strongly than the rare token "metallb".

The RRF fusion of cosine top-k + tsvector top-k surfaces the right document in 50 ms. See the full explanation in the article.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
rag.py		rag.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cfg-rag-demo

What it does

Stack

Quick start

Commands

Configuration

Why hybrid retrieval

License

See also

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cfg-rag-demo

What it does

Stack

Quick start

Commands

Configuration

Why hybrid retrieval

License

See also

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages