{"id":167402,"date":"2026-05-06T21:48:19","date_gmt":"2026-05-06T18:48:19","guid":{"rendered":"https:\/\/computingforgeeks.com\/?p=167402"},"modified":"2026-05-06T21:48:20","modified_gmt":"2026-05-06T18:48:20","slug":"self-hosted-rag-ollama-pgvector","status":"publish","type":"post","link":"https:\/\/computingforgeeks.com\/self-hosted-rag-ollama-pgvector\/","title":{"rendered":"Self-Hosted RAG Pipeline with Ollama and pgvector (No API Keys)"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">This guide builds a Retrieval Augmented Generation pipeline that runs entirely on your hardware. No OpenAI key, no Anthropic key, no telemetry. PostgreSQL with pgvector stores the embeddings and the retrieval index. Ollama runs both the embedding model and the chat model. Around 250 lines of Python wires it together.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Most &#8220;self-hosted RAG&#8221; tutorials quietly call OpenAI for embeddings. That ships every chunk of every document you ingest to a third party and locks you into pricing and model choices you do not control. The pipeline here uses <code>nomic-embed-text<\/code> (768 dimensions, 274 MB) for embeddings and <code>llama3.1:8b<\/code> (Q4_K_M, 4.9 GB) for generation. Both run under Ollama. The vector store is plain Postgres 17 with the pgvector extension and a tsvector full text column for hybrid retrieval.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">What you will end up with: a small Python program that ingests a folder of Markdown or text files, embeds and stores them, and answers questions with citations to the source documents. We test it on 50 articles from this site as the corpus.<\/p>\n\n\n\n<p><strong>Tested on:<\/strong> Ubuntu 22.04 LTS \u00b7 PostgreSQL 17.9 \u00b7 pgvector 0.8.0 \u00b7 Ollama 0.23.1 \u00b7 Python 3.10 \u00b7 NVIDIA RTX 4090 (CPU inference also covered) \u00b7 Verified 2026-05.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Why self host the whole RAG stack<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">RAG with a hosted embedding API is the default in most tutorials because it is the path of least friction. The hidden costs add up quickly once you move past the toy demo:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Privacy.<\/strong> Every chunk of every document goes to a third party for embedding. Customer support tickets, internal wikis, code review history, contracts. Once it leaves your network, you have lost control of what happens next.<\/li>\n<li><strong>Cost at scale.<\/strong> A modest 50,000 document corpus at 800 tokens per chunk runs about 40 million tokens. OpenAI text-embedding-3-small at $0.02 per 1M tokens is $0.80 to ingest once. Re-embedding when you switch chunk size, or running ingestion in dev and prod, multiplies that. Then every query embeds the question too.<\/li>\n<li><strong>Lock in.<\/strong> The hosted embedding model gets deprecated on the provider&#8217;s schedule. You re-embed everything to migrate. Self hosting means you choose when to upgrade, and you can run two models side by side during the transition.<\/li>\n<li><strong>Latency.<\/strong> A round trip to the API for every retrieval adds 100 to 300 ms before any DB work. Local Ollama on the same host returns embeddings in 30 to 60 ms.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The pipeline below has no API keys anywhere. You can airgap the host and it keeps working.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Architecture<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">RAG is four stages in a row.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>  Documents (md\/pdf\/html)\n       \u2502\n       \u2502  chunk + embed via Ollama\n       \u25bc\n  PostgreSQL + pgvector\n   \u251c\u2500 chunks(content, embedding vector(768))\n   \u251c\u2500 HNSW index on embedding (cosine)\n   \u2514\u2500 GIN index on tsvector(content) for full text\n       \u2502\n       \u2502  hybrid retrieve: cosine top-K + tsvector top-K, fuse with RRF\n       \u25bc\n  Llama 3.1 8B via Ollama\n       \u2502  prompt = system rules + numbered context + question\n       \u25bc\n  Answer with citations [1] [2] [3]<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The choices are deliberate:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>pgvector, not Chroma or FAISS.<\/strong> Postgres is already the operational database in many environments. pgvector lets you join embeddings to your application data in one query. Backups, replication, point in time recovery, and pg_stat_statements all work the same way they always have. A separate vector DB is a separate operational story to learn and own.<\/li>\n<li><strong>Ollama for both embed and chat.<\/strong> One service to run, one set of models to keep up to date, one auth surface, one observability target. The Python SDK is the same for both calls.<\/li>\n<li><strong>nomic-embed-text plus Llama 3.1 8B.<\/strong> The smallest pair that holds up on a real technical corpus. Both are open source. nomic-embed-text produces 768 dimensions and runs at hundreds of texts per second on a small GPU. Llama 3.1 8B at Q4 quant fits in 6 GB of VRAM. We discuss when to graduate to bigger models near the end.<\/li>\n<li><strong>Hybrid retrieval, not pure vector.<\/strong> Pure cosine similarity has a known weakness on technical corpora: it matches semantic patterns better than exact tokens like product names, flag names, or version numbers. Section 6 shows a real failure on this corpus and section 7 fixes it.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Prerequisites<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Three components that this guide does not install from scratch, with links to the dedicated guides:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>PostgreSQL 17.<\/strong> Follow <a href=\"https:\/\/computingforgeeks.com\/install-postgresql-17-rocky-linux-10\/\">Install PostgreSQL 17 on Rocky Linux 10<\/a> or <a href=\"https:\/\/computingforgeeks.com\/install-postgresql-17-ubuntu-2404\/\">on Ubuntu 24.04<\/a>.<\/li>\n<li><strong>pgvector 0.8.0 or newer.<\/strong> The HNSW index and the operators we rely on are in 0.5.0 and later, but 0.8.0 ships meaningful build performance improvements. See <a href=\"https:\/\/computingforgeeks.com\/install-pgvector-postgresql-linux\/\">Install pgvector on PostgreSQL<\/a>.<\/li>\n<li><strong>Ollama 0.23.1 or newer.<\/strong> Earlier versions had a different embed API surface. <a href=\"https:\/\/computingforgeeks.com\/install-ollama-rocky-ubuntu\/\">Install Ollama on Rocky Linux or Ubuntu<\/a>.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Pull the two models you need:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>ollama pull nomic-embed-text\nollama pull llama3.1:8b<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Verify the embed API is reachable and returns the expected dimension:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ curl -s http:\/\/127.0.0.1:11434\/api\/embed \\\n    -d '{\"model\":\"nomic-embed-text\",\"input\":\"hello\"}' | jq '.embeddings[0] | length'\n768<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Create a database and role for the RAG app, and enable the vector extension:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>sudo -u postgres psql &lt;&lt;'SQL'\nCREATE ROLE rag LOGIN;\nCREATE DATABASE rag OWNER rag;\n\\c rag\nCREATE EXTENSION vector;\nSQL<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Set up a Python virtual environment and install the four dependencies we need:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>python3 -m venv .venv\n. .venv\/bin\/activate\npip install \"psycopg[binary]>=3.2\" \"pgvector>=0.3.6\" \"ollama>=0.4\" tqdm<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">That is the whole stack. No LangChain, no LlamaIndex, no separate vector database, no API keys.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Database schema and indexes<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Two tables. <code>documents<\/code> tracks one row per source file with a content hash for idempotent re ingestion. <code>chunks<\/code> holds one row per text fragment with its embedding and a generated tsvector column for full text search.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>CREATE TABLE IF NOT EXISTS documents (\n    id          BIGSERIAL PRIMARY KEY,\n    source      TEXT NOT NULL UNIQUE,\n    title       TEXT,\n    sha256      TEXT NOT NULL,\n    bytes       INTEGER NOT NULL,\n    ingested_at TIMESTAMPTZ DEFAULT NOW()\n);\n\nCREATE TABLE IF NOT EXISTS chunks (\n    id          BIGSERIAL PRIMARY KEY,\n    document_id BIGINT NOT NULL REFERENCES documents(id) ON DELETE CASCADE,\n    chunk_idx   INTEGER NOT NULL,\n    content     TEXT NOT NULL,\n    tokens      INTEGER NOT NULL,\n    embedding   vector(768) NOT NULL,\n    content_tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', content)) STORED,\n    UNIQUE (document_id, chunk_idx)\n);\n\nCREATE INDEX IF NOT EXISTS chunks_embedding_hnsw\n    ON chunks USING hnsw (embedding vector_cosine_ops)\n    WITH (m = 16, ef_construction = 64);\n\nCREATE INDEX IF NOT EXISTS chunks_content_tsv_gin\n    ON chunks USING gin (content_tsv);<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">A few choices worth calling out:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Generated tsvector column, not a tsvector index over content directly.<\/strong> Postgres maintains <code>content_tsv<\/code> automatically on every insert and update. The GIN index sees a stable column and stays compact. The application never has to remember to keep them in sync.<\/li>\n<li><strong>HNSW with <code>m=16<\/code> and <code>ef_construction=64<\/code>.<\/strong> These are sane defaults from the pgvector docs. <code>m<\/code> is the maximum number of bidirectional links per node in the HNSW graph. Higher <code>m<\/code> improves recall but enlarges the index roughly linearly. <code>ef_construction<\/code> controls how many candidates are considered when inserting each vector. Higher gives better quality at the cost of build time.<\/li>\n<li><strong>UNIQUE on <code>(document_id, chunk_idx)<\/code>.<\/strong> Idempotent re ingestion. If a source file changes, the parent <code>documents<\/code> row is deleted and re inserted. The CASCADE drops the old chunks. Re ingestion of an unchanged file is a no op (we compare sha256 first).<\/li>\n<li><strong>Vector dimension 768.<\/strong> Matches nomic-embed-text. If you switch to a different embedding model, change this and re ingest. There is no graceful migration. Embeddings from different models are not interchangeable.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">You can confirm the schema landed correctly with <code>\\d chunks<\/code>:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/rag-schema.png\" alt=\"psql \\\\d chunks output showing the chunks table with id, document_id, chunk_idx, content, tokens, embedding vector(768), and the content_tsv generated tsvector column, plus the four indexes including chunks_embedding_hnsw and chunks_content_tsv_gin\" title=\"\"><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Ingestion pipeline<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ingestion does three things: chunk a document, embed each chunk, and insert the chunks. The whole pipeline lives in <code>rag.py<\/code>. The chunking function is the only piece that requires care.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Chunking<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The aim is to split a document into pieces that are small enough that retrieval can be precise but large enough that each piece carries enough context to be useful in an answer. Around 800 tokens with 100 token overlap is a good default. We split paragraphs first, then sentences if a single paragraph is too big, then characters as a last resort. The overlap copies the tail of each chunk into the head of the next so an answer that straddles a chunk boundary still has continuous context.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>TOKEN_RE = re.compile(r\"\\w+|[^\\w\\s]\")\n\ndef tokens(text):\n    return len(TOKEN_RE.findall(text))\n\ndef split_text(text, target=800, overlap=100):\n    paras = re.split(r\"\\n\\s*\\n\", text)\n    out, buf, buf_tokens = [], [], 0\n    for p in paras:\n        pt = tokens(p)\n        if buf_tokens + pt &lt;= target:\n            buf.append(p); buf_tokens += pt\n            continue\n        if buf:\n            out.append(\"\\n\\n\".join(buf))\n        if pt &gt; target:\n            sents = re.split(r\"(?&lt;=[.!?])\\s+\", p)\n            sb, st = [], 0\n            for s in sents:\n                sti = tokens(s)\n                if st + sti &gt; target and sb:\n                    out.append(\" \".join(sb)); sb, st = [], 0\n                sb.append(s); st += sti\n            if sb: out.append(\" \".join(sb))\n            buf, buf_tokens = [], 0\n        else:\n            buf, buf_tokens = [p], pt\n    if buf:\n        out.append(\"\\n\\n\".join(buf))\n    if overlap and len(out) &gt; 1:\n        bridged = [out[0]]\n        for prev, cur in zip(out, out[1:]):\n            tail = \" \".join(TOKEN_RE.findall(prev)[-overlap:])\n            bridged.append(tail + \" \" + cur)\n        return bridged\n    return out<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The token counter uses a simple regex, not a real tokenizer. For chunking purposes the count just needs to be approximately right. Off by 5 percent is fine. Off by a factor of two is a bug.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Embedding and inserting<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Ollama exposes an <code>\/api\/embed<\/code> endpoint that the Python SDK wraps. We embed one chunk per call here for clarity. In production, batch up to 32 or 64 chunks per call and parallelize across a few workers (see the production hardening section).<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>oclient = ollama.Client(host=os.environ.get(\"OLLAMA_HOST\", \"http:\/\/127.0.0.1:11434\"))\n\ndef embed(texts):\n    out = []\n    for t in texts:\n        r = oclient.embed(model=\"nomic-embed-text\", input=t)\n        out.append(r[\"embeddings\"][0])\n    return out<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The ingest function reads each file, computes a sha256, skips it if the database already has a row with the same source and the same hash, otherwise re inserts:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>def cmd_ingest(directory):\n    files = [p for p in Path(directory).rglob(\"*.md\") if not p.name.startswith(\".\")]\n    files += [p for p in Path(directory).rglob(\"*.txt\") if not p.name.startswith(\".\")]\n    with db() as c, c.cursor() as cur:\n        for fp in sorted(files):\n            data = fp.read_bytes()\n            sha = hashlib.sha256(data).hexdigest()\n            source = str(fp.resolve())\n            cur.execute(\"SELECT id, sha256 FROM documents WHERE source = %s\", (source,))\n            row = cur.fetchone()\n            if row and row[1] == sha:\n                continue                          # unchanged, skip\n            if row:\n                cur.execute(\"DELETE FROM documents WHERE id = %s\", (row[0],))\n            cur.execute(\n                \"INSERT INTO documents (source, title, sha256, bytes) \"\n                \"VALUES (%s, %s, %s, %s) RETURNING id\",\n                (source, fp.stem, sha, len(data)),\n            )\n            doc_id = cur.fetchone()[0]\n            text = data.decode(\"utf-8\", \"replace\").replace(\"\\x00\", \"\")\n            chs = split_text(text)\n            embs = embed(chs)\n            cur.executemany(\n                \"INSERT INTO chunks (document_id, chunk_idx, content, tokens, embedding) \"\n                \"VALUES (%s, %s, %s, %s, %s)\",\n                [(doc_id, i, ck, tokens(ck), e) for i, (ck, e) in enumerate(zip(chs, embs))],\n            )\n        c.commit()<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The <code>register_vector<\/code> call from the <code>pgvector<\/code> Python package teaches psycopg how to send Python lists as the <code>vector<\/code> type. Without it you get type errors on the executemany call.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Run it on a corpus. We use 50 articles from this site as a test set:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ python rag.py ingest articles\/\nscanning articles\/: 50 files\n  [5\/50] ansible-lamp-lemp-stack.md  (5 chunks)\n  [10\/50] argocd-ingress-metallb.md  (7 chunks)\n  ...\n  [50\/50] weave-gitops-install-migration-flux.md  (5 chunks)\n\ningest done: 311 chunks, ~246147 tokens in 320.5s (1.0 chunks\/s)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">320 seconds to embed 246k tokens of real technical writing on a CPU. The Ollama embed call is the bottleneck. On a GPU the same workload takes about 30 seconds. If you plan to run this on a GPU, our <a href=\"https:\/\/computingforgeeks.com\/best-gpu-for-llm\/\">GPU buyer guide for LLMs<\/a> helps you pick one by VRAM. The number we care about more is the per query embedding time, which we will measure in the next section.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Pure vector retrieval, and where it breaks<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The naive retrieval is one SQL query. Embed the question, ORDER BY cosine distance, LIMIT to the top K:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>def retrieve(question, k=5):\n    qe = embed([question])[0]\n    with db() as c, c.cursor() as cur:\n        cur.execute(\n            \"\"\"\n            SELECT d.title, c.content,\n                   c.embedding &lt;=&gt; %s::vector AS distance\n            FROM chunks c JOIN documents d ON d.id = c.document_id\n            ORDER BY c.embedding &lt;=&gt; %s::vector\n            LIMIT %s\n            \"\"\",\n            (qe, qe, k),\n        )\n        return cur.fetchall()<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The <code>&lt;=&gt;<\/code> operator is cosine distance, defined by pgvector. It returns 0 for identical vectors and 2 for opposite vectors. Lower is more similar. The HNSW index we created uses <code>vector_cosine_ops<\/code>, so this query uses the index.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">That looks complete. Run it against the corpus with a real question:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ python rag.py query \"How do I install MetalLB on Kubernetes?\"\n\n=== Retrieved 5 chunks in 65ms\n  [1] Install Coolify Ubuntu 2604                  cosine_dist=0.399\n  [2] Install Coolify Ubuntu 2604                  cosine_dist=0.406\n  [3] Gemini Cli Cheat Sheet                       cosine_dist=0.418\n  [4] How To Install Rethinkdb On Ubuntu Debian    cosine_dist=0.419\n  [5] Install Home Assistant Ubuntu 2604           cosine_dist=0.421<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">None of those articles are about MetalLB. The corpus contains an article literally titled &#8220;Install MetalLB on Kubernetes&#8221; and a second one called &#8220;Expose ArgoCD with MetalLB and NGINX Ingress&#8221; with seven chunks each. They both ranked below an article about Coolify. The MetalLB article does show up if you look further down the ranking, at distance 0.469 versus 0.399 for the wrong top result.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This is not a bug in pgvector, the HNSW index, or the chunking. It is a known weakness of pure semantic search on technical documentation. nomic-embed-text learns that &#8220;How do I install X on Y&#8221; looks structurally a lot like other &#8220;install X on Y&#8221; pages. The embedding for our question lands closer to a generic &#8220;install on Ubuntu&#8221; cluster than to the cluster that contains the rare token &#8220;metallb&#8221;. Product names, command flag names, and version numbers are exactly the high information tokens that semantic embeddings tend to under weight.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The fix is hybrid retrieval.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Hybrid retrieval with Reciprocal Rank Fusion<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Two retrievers run in parallel. One ranks chunks by cosine distance against the question embedding. The other ranks chunks by Postgres full text search against the question text. Each produces a top K list. The two lists are fused into one final ranking using Reciprocal Rank Fusion.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">RRF is the simplest robust fusion algorithm I know. It only looks at rank, not score, so it does not need the two retrievers to produce comparable values. The formula:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>score(chunk) = sum over retrievers of  1 \/ (k + rank_in_retriever)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The constant <code>k<\/code> (different from the top K we retrieve) is conventionally 60. It dampens the contribution of low ranked items so a chunk that appears in the top 5 of both lists beats a chunk that appears at rank 1 of one list but not at all in the other.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The full text retriever uses the <code>content_tsv<\/code> generated column we created earlier. Postgres handles stemming and stop word removal for English automatically through the <code>'english'<\/code> text search configuration. The <code>plainto_tsquery<\/code> function converts a free form question into a query without crashing on punctuation.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The whole thing is one SQL query with three CTEs:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>WITH vec AS (\n    SELECT c.id,\n           ROW_NUMBER() OVER (ORDER BY c.embedding &lt;=&gt; %s::vector) AS r,\n           c.embedding &lt;=&gt; %s::vector AS distance\n    FROM chunks c\n    ORDER BY c.embedding &lt;=&gt; %s::vector\n    LIMIT 20\n),\nfts AS (\n    SELECT c.id,\n           ROW_NUMBER() OVER (\n               ORDER BY ts_rank_cd(c.content_tsv, plainto_tsquery('english', %s)) DESC\n           ) AS r\n    FROM chunks c\n    WHERE c.content_tsv @@ plainto_tsquery('english', %s)\n    ORDER BY ts_rank_cd(c.content_tsv, plainto_tsquery('english', %s)) DESC\n    LIMIT 20\n),\nfused AS (\n    SELECT id, SUM(score) AS rrf_score\n    FROM (\n        SELECT id, 1.0 \/ (60 + r) AS score FROM vec\n        UNION ALL\n        SELECT id, 1.0 \/ (60 + r) AS score FROM fts\n    ) u\n    GROUP BY id\n)\nSELECT d.title, d.source, c.content, v.distance, f.rrf_score\nFROM fused f\nJOIN chunks c ON c.id = f.id\nJOIN documents d ON d.id = c.document_id\nLEFT JOIN vec v ON v.id = f.id\nORDER BY f.rrf_score DESC\nLIMIT 5;<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Run the same MetalLB query through hybrid retrieval:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ python rag.py query \"How do I install MetalLB on Kubernetes?\"\n\n=== Retrieved 5 chunks in 50ms (hybrid RRF)\n  [1] Install Coolify Ubuntu 2604       rrf=0.0164  cosine=0.399\n  [2] Install Metallb Kubernetes        rrf=0.0164  cosine=  -\n  [3] Flux Vs Argocd Multi Cluster      rrf=0.0161  cosine=  -\n  [4] Install Coolify Ubuntu 2604       rrf=0.0161  cosine=0.406\n  [5] Argocd Ingress Metallb            rrf=0.0159  cosine=  -<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/rag-hybrid-query.png\" alt=\"Terminal output of the hybrid retrieval query for How do I install MetalLB on Kubernetes, showing five chunks with their RRF scores and cosine distances. The Install Metallb Kubernetes article is now at position 2.\" title=\"\"><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The MetalLB article moves from absent in the top 5 to position 2, and the related ArgoCD with MetalLB article also surfaces. The chunks with <code>cosine = -<\/code> are ones that came in via the full text retriever only. The Coolify chunk still appears because it ranks highly in cosine, but it now competes with the lexically matched chunks instead of dominating.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Total retrieval latency stays around 50 ms on this corpus. The HNSW lookup and the GIN lookup run independently against indexes built specifically for them. The fusion happens in memory on at most 40 rows.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Hybrid retrieval should be your default for technical documentation, code, support tickets, or any corpus where exact tokens carry meaning that paraphrase loses.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Generation with citations<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The retrieved chunks become the context for the chat model. Two prompt rules matter more than the rest: force grounding and force refusal.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>system = (\n    \"You answer technical questions strictly from the provided context. Rules:\\n\"\n    \"1. Only state facts that appear in the context. Never invent commands, \"\n    \"version numbers, file paths, or library names.\\n\"\n    \"2. Quote command lines verbatim from the context. Do not paraphrase code.\\n\"\n    \"3. After every fact you state, cite the source number in brackets like [1].\\n\"\n    \"4. If the context does not contain the answer, reply exactly: \"\n    '\"The provided context does not answer this question.\" Then stop.\\n'\n    \"5. Do not add unrelated information from other context blocks.\\n\"\n    \"6. Do not give multi-part answers covering things the user did not ask.\"\n)\n\nprompt = f\"Context:\\n{context}\\n\\nQuestion: {question}\\n\\nAnswer:\"<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The &#8220;If the context does not contain the answer&#8221; rule is the most important one. Without it, a small model will produce confident answers entirely from its training data. The user has no way to tell which sentences came from your documents and which the model invented. With the rule, the model refuses cleanly when the corpus does not cover the question.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Run it through Ollama with low temperature. Streaming is enabled so the answer appears progressively:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>for chunk in oclient.chat(\n    model=\"llama3.1:8b\",\n    messages=[\n        {\"role\": \"system\", \"content\": system},\n        {\"role\": \"user\", \"content\": prompt},\n    ],\n    options={\"temperature\": 0.1, \"num_ctx\": 8192},\n    stream=True,\n):\n    sys.stdout.write(chunk[\"message\"][\"content\"])\n    sys.stdout.flush()<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Temperature 0.1 keeps the model close to the context instead of drifting into creative paraphrase. <code>num_ctx<\/code> is the working context window. 8192 is large enough to fit the system prompt, five chunks of around 800 tokens each, and a few hundred tokens of answer headroom.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Demo: refusal on an off corpus question<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The corpus has nothing about FreeBSD or RAID 6. Asking anyway:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ python rag.py query \"How do I configure RAID 6 on FreeBSD?\"\n\n=== Retrieved 5 chunks in 65ms (hybrid RRF)\n  [1] Install Home Assistant Ubuntu 2604        rrf=0.0164  cosine=0.369\n  [2] Install Coolify Ubuntu 2604               rrf=0.0161  cosine=0.396\n  [3] How To Install Rethinkdb On Ubuntu Debian rrf=0.0159  cosine=0.404\n  [4] Configure Static Ip Ubuntu 2604 Netplan   rrf=0.0156  cosine=0.406\n  [5] Ansible Lamp Lemp Stack                   rrf=0.0154  cosine=0.414\n\n=== Answer\nUnfortunately, none of the provided articles cover configuring\nRAID 6 on FreeBSD. They are all related to Ubuntu or other Linux\ndistributions.<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/computingforgeeks.com\/wp-content\/uploads\/2026\/05\/rag-refusal.png\" alt=\"Terminal output of an off-corpus query about RAID 6 on FreeBSD. Five unrelated Ubuntu install chunks are retrieved, and the model refuses with a one-line answer that the provided articles do not cover the question.\" title=\"\"><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The retriever returns its best guesses but every chunk has a high cosine distance and zero lexical overlap with &#8220;RAID&#8221; or &#8220;FreeBSD&#8221;. The model recognises that the context does not answer the question and refuses, instead of inventing a fake FreeBSD configuration. That refusal behaviour is the most important property of a production RAG system. Without it, every off topic question becomes a plausible looking hallucination.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">When small models still confabulate<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Strong prompts reduce hallucination but do not eliminate it on small models. Llama 3.1 8B Q4 has parameters memorised from its training set that occasionally leak into answers even when the retrieval is perfect. Ask &#8220;Which embedding model should I use with Ollama for RAG and what dimension does it produce?&#8221; against this corpus and the retriever returns four of five chunks from the Ollama Models Cheat Sheet (which discusses nomic-embed-text and 768 dimensions). The model still references &#8220;sentence-transformers\/all-MiniLM-L6-v2&#8221; and &#8220;384 dimensions&#8221; in the answer. Those strings exist in its weights, not in the context.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Three mitigations, in order of cost:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Stricter prompt language. Adding &#8220;Do not use any knowledge from outside the context&#8221; helps a little.<\/li>\n<li>Lower temperature. We are already at 0.1. There is not much further to drop.<\/li>\n<li>Bigger or better grounded model. Mistral Small 24B (16 GB VRAM at Q4) and Qwen 2.5 32B (20 GB VRAM at Q4) both grade noticeably better on grounding benchmarks than Llama 3.1 8B. If your VRAM allows, use one of them for the chat model. The retrieval and ingestion stack does not change.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Production hardening<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">HNSW pays off at scale<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">On 311 rows, Postgres correctly chooses a sequential scan with sort over the HNSW index. The seq scan is faster on tiny tables. EXPLAIN ANALYZE confirms it:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>rag=&gt; EXPLAIN (ANALYZE, BUFFERS)\n      SELECT id, embedding &lt;=&gt; (SELECT embedding FROM chunks LIMIT 1) AS d\n      FROM chunks ORDER BY d LIMIT 5;\n   Sort Method: top-N heapsort  Memory: 25kB\n   ->  Seq Scan on chunks  (cost=0.00..14.89 rows=311 width=16)\n                           (actual time=0.076..2.102 rows=311 loops=1)\n Execution Time: 2.195 ms<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The HNSW index becomes the right choice somewhere between 5,000 and 50,000 rows depending on dimensionality and ef_search. You can force it for benchmarking with <code>SET enable_seqscan = OFF;<\/code>. In production you do not need to tune this. Postgres picks the right plan once the table grows.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If you do build very large indexes, set <code>maintenance_work_mem<\/code> high during the CREATE INDEX. HNSW build is memory hungry. 4 GB is reasonable for tens of millions of vectors.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Embedding throughput<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">One Ollama embed call per chunk is fine for the 50 article demo. For a 50,000 document corpus, two changes matter:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Batch via the API. Pass an array to the <code>input<\/code> field. nomic-embed-text on a single 4090 hits over 1,000 chunks per second when batched.<\/li>\n<li>Set <code>OLLAMA_NUM_PARALLEL=4<\/code> in the environment that runs Ollama and run multiple Python workers feeding it. The model stays loaded; you amortize launch and tokenizer cost across more chunks.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Backups and migrations<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><code>pg_dump<\/code> handles the <code>vector<\/code> type natively as of pgvector 0.4. Restore is the same. <code>pg_basebackup<\/code> and replication work without any vector-specific configuration. There is no separate vector database to back up, no extra cron job, no second restore drill.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The only migration that requires care is changing the embedding model. nomic-embed-text 768 dim and Snowflake&#8217;s arctic-embed 1024 dim are not interchangeable. To migrate, add a second embedding column with the new dimension, dual write during ingest until you are confident, then drop the old column.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Concurrency and isolation<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>pgbouncer in transaction mode<\/strong> in front of Postgres for the RAG app, separate from any application traffic. Embedding inserts and HNSW reads are short transactions and benefit from pooling.<\/li>\n<li><strong>Bind Ollama to 127.0.0.1.<\/strong> The default in newer versions. The RAG app is the only thing that needs to call it. Do not expose port 11434 publicly.<\/li>\n<li><strong>RBAC for the rag role.<\/strong> Grant only INSERT, UPDATE, DELETE, SELECT on <code>documents<\/code> and <code>chunks<\/code>. The role does not need superuser, CREATEDB, or LOGIN to other databases.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Observability<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Three numbers track per query: embed time, retrieve time, generate time. The retrieve query is short and benefits from <code>pg_stat_statements<\/code>. The generate time is dominated by the chat model and depends on hardware. On a single RTX 4090 expect around 80 to 120 tokens per second on Llama 3.1 8B Q4. On a 16 core CPU expect 5 to 10 tokens per second.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The cheapest useful evaluation is to take 30 questions you know the answer to, run them through the pipeline, and grade top 1 retrieval accuracy by hand. Aim for 0.85 or higher. If you fall below, check whether your tsvector index is being used (<code>EXPLAIN<\/code> the fts CTE) and whether your chunks are too large or too small.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">When to graduate from pgvector<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Rarely, and not before 10 million vectors. The reasons people switch are usually fixable inside Postgres: missing index (use HNSW), wrong distance metric, no full text fallback, or shared instance contention. A dedicated vector database earns its operational tax when you have multi-tenant isolation needs, hundred millions of vectors, or specific GPU acceleration in the index path. None of those apply to a normal team.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Troubleshooting and FAQ<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Retrieval misses obvious matches<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You almost certainly need hybrid retrieval. See section 7. Pure cosine on technical text systematically under weights product names and flag tokens.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Embedding is slow<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Three checks. First, <code>nvidia-smi<\/code> while ingesting. If utilization stays at 0, Ollama is not using the GPU. Restart Ollama and look for &#8220;discovered GPU&#8221; in <code>journalctl -u ollama<\/code> or wherever its logs go. Second, batch the embed calls. Single requests are dominated by HTTP overhead. Third, set <code>OLLAMA_NUM_PARALLEL=4<\/code> and run multiple Python workers.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">HNSW index build runs out of memory<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Increase <code>maintenance_work_mem<\/code> for the session that builds the index. <code>SET maintenance_work_mem = '4GB'<\/code> handles tens of millions of vectors at 768 dimensions on a server with 32 GB RAM.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Generated answers cite sources that do not exist<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Tighten the system prompt (rule 1: only state facts in the context, rule 4: refuse). Lower the temperature to 0.1. If problems persist, switch to a stronger model. Mistral Small 24B follows context grounding much better than Llama 3.1 8B for the same VRAM footprint.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How do I evaluate retrieval quality?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Manual top 1 accuracy on 30 hand picked questions is the fastest signal. <code>ragas<\/code> automates this on bigger eval sets if you have ground truth answers.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Can I use this with PDFs and HTML?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Yes. Add a parser layer in front of <code>cmd_ingest<\/code>. <code>pypdf<\/code> handles most PDFs, <code>beautifulsoup4<\/code> handles HTML, and <code>markitdown<\/code> handles a wide range of formats. The chunking and embedding code does not change.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Companion code is on GitHub: <a href=\"https:\/\/github.com\/c4geeks\/cfg-rag-demo\" rel=\"noopener\" target=\"_blank\" rel=\"noreferrer noopener\">github.com\/c4geeks\/cfg-rag-demo<\/a> (MIT licensed, the full <code>rag.py<\/code> from this guide plus a one-page README). See also the <a href=\"https:\/\/computingforgeeks.com\/ollama-models-cheat-sheet\/\">Ollama Models Cheat Sheet<\/a> for a deeper comparison of models you can swap in for embedding or generation, the <a href=\"https:\/\/computingforgeeks.com\/ollama-commands-cheat-sheet\/\">Ollama Commands Cheat Sheet<\/a> for the full CLI and API surface, and the <a href=\"https:\/\/computingforgeeks.com\/install-pgvector-postgresql-linux\/\">pgvector install guide<\/a> for the database side.<\/p>\n\n","protected":false},"excerpt":{"rendered":"<p>This guide builds a Retrieval Augmented Generation pipeline that runs entirely on your hardware. No OpenAI key, no Anthropic key, no telemetry. PostgreSQL with pgvector stores the embeddings and the retrieval index. Ollama runs both the embedding model and the chat model. Around 250 lines of Python wires it together. Most &#8220;self-hosted RAG&#8221; tutorials quietly &#8230; <a title=\"Self-Hosted RAG Pipeline with Ollama and pgvector (No API Keys)\" class=\"read-more\" href=\"https:\/\/computingforgeeks.com\/self-hosted-rag-ollama-pgvector\/\" aria-label=\"Read more about Self-Hosted RAG Pipeline with Ollama and pgvector (No API Keys)\">Read more<\/a><\/p>\n","protected":false},"author":3,"featured_media":167398,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[329,690],"tags":[17245,39833,39832,39835,688,39834],"cfg_series":[39813],"class_list":["post-167402","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-automation","category-dev","tag-ai","tag-llm","tag-ollama","tag-pgvector","tag-postgresql","tag-rag","cfg_series-local-llms-self-hosted-ai"],"_links":{"self":[{"href":"https:\/\/computingforgeeks.com\/wp-json\/wp\/v2\/posts\/167402","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/computingforgeeks.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/computingforgeeks.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/computingforgeeks.com\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/computingforgeeks.com\/wp-json\/wp\/v2\/comments?post=167402"}],"version-history":[{"count":1,"href":"https:\/\/computingforgeeks.com\/wp-json\/wp\/v2\/posts\/167402\/revisions"}],"predecessor-version":[{"id":167403,"href":"https:\/\/computingforgeeks.com\/wp-json\/wp\/v2\/posts\/167402\/revisions\/167403"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/computingforgeeks.com\/wp-json\/wp\/v2\/media\/167398"}],"wp:attachment":[{"href":"https:\/\/computingforgeeks.com\/wp-json\/wp\/v2\/media?parent=167402"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/computingforgeeks.com\/wp-json\/wp\/v2\/categories?post=167402"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/computingforgeeks.com\/wp-json\/wp\/v2\/tags?post=167402"},{"taxonomy":"cfg_series","embeddable":true,"href":"https:\/\/computingforgeeks.com\/wp-json\/wp\/v2\/cfg_series?post=167402"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}