Part 2 of the Semantic Caching in PostgreSQL series that’ll take you from a working demo to a production-ready system.

From Demo to Production

In Part 1, we set up pg_semantic_cache in a Docker container and demonstrated how semantic similarity matching works. In summary, semantic caching associates a string with each query that allows us to search the cache by meaning instead of by the exact query text. We demonstrated a cache hit at 99.9% similarity and a cache miss at 68% (configurable defaults), and discussed why this can make a difference for LLM-powered applications.

Now let's make it production-ready. A cache that can store and retrieve is useful, but a cache you can organize, monitor, evict, and integrate into your application is what you actually need for a production deployment. In this post, we'll cover all of that.

We'll continue using the same Docker environment from Part 1. If you need to set it up again, refer to the Dockerfile and setup instructions in that post.

Organizing with Tags

Tags let you group and manage cache entries by category; the ARRAY at the end of the cache entry contains the tags associated with a specific query.  For example, tags in our first query:

-- Cache entries with different tags
SELECT semantic_cache.cache_query(
    'What is the weather in New York?',
    '[0.45, 0.67, 0.23, 0.89, 0.12, 0.56, 0.78, 0.34]',
    '{"temperature": "42F", "conditions": "Partly cloudy"}'::jsonb,
    1800,  -- 30-minute TTL for weather data
    ARRAY['weather', 'location-ny']
);

Identify the query as being associated with weather and location-ny.

In our second query, the tags identify the query as being associated with devops and docker:

SELECT semantic_cache.cache_query(
    'Explain Docker networking',
    '[0.22, 0.91, 0.55, 0.13, 0.77, 0.38, 0.64, 0.09]',
    '{"answer": "Docker provides bridge, host, overlay, and macvlan
    network drivers..."}'::jsonb,
    86400,  -- 24-hour TTL for stable technical content
    ARRAY['devops', 'docker']
);

We can view the tags with the following SELECT statement:

-- View entries by tag
SELECT * FROM semantic_cache.cache_by_tag;
tag      | entry_count | total_size | avg_access_count
-------------+-------------+------------+------------------
 postgres    |           3 | 1872 bytes |                1
 concepts    |           1 |  624 bytes |                1
 indexing    |           1 |  621 bytes |                1
 replication |           1 |  627 bytes |                1
 weather     |           1 |  589 bytes |                0
 location-ny |           1 |  589 bytes |                0
 devops      |           1 |  614 bytes |                0
 docker      |           1 |  614 bytes |                0
(8 rows)

When the underlying data changes in our backing database, we can then use the tags to Invalidate old content in our data set:

-- Weather data is stale, clear it
SELECT semantic_cache.invalidate_cache(tag := 'weather');

-- Or invalidate by query pattern
SELECT semantic_cache.invalidate_cache(pattern := '%Docker%');
invalidate_cache
------------------
                1
(1 row)

Eviction Strategies

Caches need boundaries. You can use those boundaries to keep data sets fresh:

-- Remove expired entries (past their TTL)
SELECT semantic_cache.evict_expired();

-- Keep only the 1000 most recently accessed entries
SELECT semantic_cache.evict_lru(1000);

-- Keep only the 1000 most frequently accessed entries
SELECT semantic_cache.evict_lfu(1000);

-- Or let the extension decide based on your configured policy
SELECT semantic_cache.auto_evict();

-- Nuclear option: clear everything
SELECT semantic_cache.clear_cache();

pg_semantic_cache provides eviction strategies you can use to set those boundaries for your cache:

  • semantic_cache.evict_expired() removes expired data.

  • semantic_cache.evict_lru() removes the least recently used data.

  • semantic_cache.evict_lfu() removes the least frequently used data.

  • semantic_cache.auto_evict() removes data based on a configured policy.

  • semantic_cache.auto_evict() clears the complete cache.

For easy maintenance in a production environment, you can schedule automatic cache clean up with pg_cron:

-- Evict expired entries every 15 minutes
SELECT cron.schedule('cache-eviction', '*/15 * * * *',
    $$SELECT semantic_cache.evict_expired()$$);

-- Run full auto-eviction daily at 3 AM
SELECT cron.schedule('cache-maintenance', '0 3 * * *',
    $$SELECT semantic_cache.auto_evict()$$);

Monitoring

The extension provides built-in views for observability. The semantic_cache.cache_health view provides an overview of the number of entries and use of a given cache:

-- Overall health
SELECT * FROM semantic_cache.cache_health;
total_entries | expired_entries | total_size | avg_access_count | total_hits | total_misses | hit_rate_pct
--------------+-----------------+------------+------------------+------------+--------------+--------------
            3 |               0 | 1872 bytes |         1.000000 |          1 |            1 |        50.00
(1 row)

The semantic_cache.recent_cache_activity view provides insight into the queries that are coming in for your cache to resolve:

-- Recent activity
SELECT * FROM semantic_cache.recent_cache_activity;
id |                    query_preview                       | access_count |         created_at         |       last_accessed_at       | result_size
----+--------------------------------------------------------+--------------+----------------------------+------------------------------+-------------
  1 | How does PostgreSQL handle transactions?               |            2 | 2025-05-10 14:23:12.441203 | 2025-05-10 14:23:20.112038   | 624 bytes
  3 | How do I set up replication in PostgreSQL?             |            1 | 2025-05-10 14:23:12.441203 | 2025-05-10 14:23:18.551024   | 627 bytes
  2 | What types of indexes does PostgreSQL support?         |            1 | 2025-05-10 14:23:12.441203 | 2025-05-10 14:23:16.337201   | 621 bytes
(3 rows)

The semantic_cache.cache_stats view provides insight into the statistical hit rates for your cache:

-- Hit rate and statistics
SELECT * FROM semantic_cache.cache_stats();
total_entries | total_hits | total_misses | hit_rate_percent
--------------+------------+--------------+------------------
            3 |          1 |            1 |               50
(1 row)

To help manage cost tracking, you should log each access with its associated cost:

-- Log a cache hit (no cost incurred)
SELECT semantic_cache.log_cache_access(
    md5('How does PostgreSQL handle transactions?'),
    true,    -- cache_hit
    0.9991,  -- similarity_score
    0.03     -- the cost that would have been incurred
);

-- Log a cache miss (cost was incurred)
SELECT semantic_cache.log_cache_access(
    md5('Completely new question about Kubernetes'),
    false,   -- cache_miss
    0.68,    -- closest similarity score
    0.03     -- cost incurred
);

The semantic_cache.get_cost_savings view can help provide estimates of cost savings:

-- View cost savings report for the last 7 days
SELECT * FROM semantic_cache.get_cost_savings(7);
total_queries | cache_hits | cache_misses | hit_rate | total_cost_saved | avg_cost_per_hit | total_cost_if_no_cache
--------------+------------+--------------+----------+------------------+------------------+------------------------
            2 |          1 |            1 |     0.50 |         0.030000 |         0.030000
|               0.060000
(1 row)

-- Daily breakdown
SELECT * FROM semantic_cache.cost_savings_daily;
date    | total_queries | cache_hits | cache_misses | hit_rate_pct | total_cost_saved | avg_cost_per_hit
------------+---------------+------------+--------------+--------------+------------------+------------------
 2025-05-10 |             2 |          1 |            1 |        50.00 |         0.030000 |         0.030000
(1 row)

Configuring Vector Dimensions

The extension works with any embedding model. Each dimension in a vector string (for example, [0.45, 0.67, 0.23, 0.89, 0.12, 0.56, 0.78, 0.34]) contributes a small part of the semantic meaning, and together they position text or objects according to similarity rather than keywords. You configure the vector dimension to store the number of numeric components used to represent a piece of data in your embedding model, as well as the percentage of vector dimensions (the similarity threshold) that is considered a match:

-- OpenAI text-embedding-3-small / ada-002 (default)
SELECT semantic_cache.set_vector_dimension(1536);

-- Ollama nomic-embed-text or BERT-based models
SELECT semantic_cache.set_vector_dimension(768);

-- Ollama mxbai-embed-large
SELECT semantic_cache.set_vector_dimension(1024);

-- Apply the change (clears cache, alters column, rebuilds index)
SELECT semantic_cache.rebuild_index();

-- Verify configuration
SELECT semantic_cache.get_vector_dimension();
SELECT semantic_cache.get_index_type();

Note: pgvector limits IVFFlat and HNSW indexes to a maximum of 2,000 dimensions. Models that produce higher-dimensional embeddings (e.g., OpenAI text-embedding-3-large at 3,072 dimensions) will work for storage and retrieval via sequential scan, but cannot use a vector index. For large caches with high-dimensional embeddings, consider using OpenAI's built-in dimension reduction (e.g., dimensions=1536) to stay within the indexable range.

If your cache has more than 100,000 entries, switch to the HNSW index for better performance:

SELECT semantic_cache.set_index_type('hnsw');

-- Apply the index type change
SELECT semantic_cache.rebuild_index();

Integration Pattern: Python Example

The following example demonstrates how a pg_semantic_cache fits into a typical Python application:

import psycopg2
import json
from openai import OpenAI

client = OpenAI()
conn = psycopg2.connect("dbname=semcache_demo user=postgres")
conn.autocommit = True

def get_embedding(text: str) -> list[float]:
    """Generate embedding using OpenAI."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def ask(question: str) -> str:
    """Ask a question with semantic caching."""
    embedding = get_embedding(question)
    embedding_str = '[' + ','.join(str(x) for x in embedding) + ']'

    # Check the cache (typically 2-3ms)
    with conn.cursor() as cur:
        cur.execute(
            "SELECT * FROM semantic_cache.get_cached_result(%s, 0.95)",
            (embedding_str,)
        )
        row = cur.fetchone()

        if row and row[0]:  # found == True
            print(f"Cache HIT (similarity: {row[2]:.4f})")
            return row[1]["answer"]

    # Cache miss - call the LLM (typically 500ms-2s)
    print("Cache MISS - calling LLM...")
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": question}]
    )
    answer = response.choices[0].message.content

    # Store in cache for future similar queries
    result_json = json.dumps({"answer": answer})
    with conn.cursor() as cur:
        cur.execute(
            "SELECT semantic_cache.cache_query(%s, %s, %s::jsonb, 3600, %s)",
            (question, embedding_str, result_json, ['llm'])
        )

    return answer

# First call: cache miss, full LLM round trip
print(ask("How does PostgreSQL handle transactions?"))

# Second call: different words, same meaning - cache hit!
print(ask("Explain transaction management in Postgres"))

# Third call: still a hit
print(ask("Tell me about MVCC and transactions in PG"))

The first call to our application takes 1-2 seconds (embedding + LLM). The second and third return in under 5ms. Same answer, a fraction of the time, zero additional API cost.

Why a PostgreSQL Extension?

You could build semantic caching as a standalone service, as a Redis layer, or as application-level code. We built it as a PostgreSQL extension for practical reasons:

  • Zero new infrastructure. If you already run PostgreSQL (and you probably do), pg_semantic_cache is a CREATE EXTENSION away. No new services to deploy, monitor, or page on at 3 AM.

  • ACID compliance. Cache operations participate in PostgreSQL transactions. No split-brain scenarios between your cache and your database.

  • Inherited operations. Backup, replication, authentication, monitoring - your cache gets this functionality for free from your existing PostgreSQL setup.

  • Language-agnostic. Anything that speaks SQL can use the cache, including Python, Node.js, Go, Java, & Ruby. No client SDK required.

Tuning for Your Workload

Similarity Threshold

The threshold balances the tradeoff between hit rate and accuracy:

  • 0.98+: Very conservative. Only near-identical rephrasings match. Low hit rate but zero risk of serving wrong answers.

  • 0.95 (default): Good balance for most applications. Catches obvious rephrasings while maintaining accuracy.

  • 0.90-0.93: Aggressive. Higher hit rate but increased risk of matching queries with subtly different intent.

  • As a rule, you can start with a threshold setting of 0.95 and adjust the value based on your data hits and misses. The extension reports the closest similarity score even on misses, so you can see where to adjust your strategy.

TTL Strategy

Time-to-live (TTL) controls how long an entry stays cached. Set it based on how frequently the underlying data changes:

-- Real-time data (weather, stock prices): short TTL
SELECT semantic_cache.cache_query('...', '...', '...'::jsonb, 300);    -- 5 min

-- Daily reports: medium TTL
SELECT semantic_cache.cache_query('...', '...', '...'::jsonb, 86400);  -- 24 hours

-- Stable reference data: long TTL
SELECT semantic_cache.cache_query('...', '...', '...'::jsonb, 604800); -- 7 days

The last value specified in the call to semantic_cache.cache_query is the number of seconds that an entry stays in the cache.

PostgreSQL Settings

For production workloads with large caches, tune these postgresql.conf settings. The values below are sized for a 16GB server. Scale the values proportionally for your environment (shared_buffers ~25% of RAM, effective_cache_size ~75% of RAM):

-- More shared buffers for frequently accessed entries (~25% of RAM)
ALTER SYSTEM SET shared_buffers = '4GB';

-- Help the query planner estimate costs accurately (~75% of RAM)
ALTER SYSTEM SET effective_cache_size = '12GB';

-- Moderate working memory — applies per-operation, not per-connection
-- For vector-heavy queries, use SET LOCAL work_mem = '256MB' in the session
ALTER SYSTEM SET work_mem = '64MB';

SELECT pg_reload_conf();

Cleaning Up

When you're done experimenting with your container environment, you can use the following commands to clean up:

docker stop semcache
docker rm semcache
docker rmi pg-semantic-cache

What's Next

pg_semantic_cache is open source, and distributed under the PostgreSQL license.

Repository: github.com/pgedge/pg_semantic_cache

The repo includes:

  • Interactive demos with Ollama (local, free) and OpenAI

  • Docker Compose setup for quick evaluation

  • Benchmark scripts for measuring performance in your environment

  • Full API documentation for all 18+ functions

If you're running LLM-powered features in production, semantic caching is one of the highest-leverage optimizations you can make. A 60-80% hit rate means 60-80% fewer API calls, 60-80% lower costs, and sub-5ms response times for the majority of your traffic. And it takes about 10 minutes to set up.

Your database already has the answer. Let it remember.