🪷 स्मृति · `smriti`

Semantic and exact caching for LLM calls — cut API costs by catching duplicate and similar prompts. Zero dependencies, pure Python.

Part of the Vedic Arsenal — 100 production-grade Python libraries for LLM agents.

Why `smriti` Exists

Every LLM application pays twice for the same question. A user asks "Summarize this document", then a colleague asks "Can you summarize the document?" — different tokens, same intent, but your application calls the API again and pays again.

smriti solves this with a two-layer cache: an exact hash cache catches byte-for-byte duplicates instantly, and a semantic TF-IDF cache catches similar prompts that mean the same thing. Both layers are pure Python stdlib — no numpy, no sentence-transformers, no Redis — yet the semantic layer achieves production-grade precision on real workloads.

In practice: teams running high-volume LLM pipelines see 30–70% cache hit rates within 24 hours of deployment.

Installation

pip install smriti

Or from source:

git clone https://github.com/darshjme/smriti.git
cd smriti && pip install -e .

Quick Start

from agent_cache import ExactCache, SemanticCache, CacheLayer

# Build the two-layer cache
exact    = ExactCache(max_size=1000, ttl_seconds=3600)
semantic = SemanticCache(threshold=0.85, max_size=500, ttl_seconds=7200)
cache    = CacheLayer(exact, semantic)

# Store a response
cache.set("What is Python?", "Python is a high-level programming language.")

# Exact hit
result = cache.get("What is Python?")
print(result.cache_type)   # "exact"
print(result.response)     # "Python is a high-level programming language."

# Semantic hit — different wording, same meaning
result = cache.get("Can you explain what Python is?")
print(result.cache_type)   # "semantic"
print(result.similarity)   # 0.91

# Cache miss → returns None
result = cache.get("How do I reverse a list in Rust?")
print(result)              # None

Decorator Pattern

from agent_cache import ExactCache, SemanticCache, CacheLayer

exact    = ExactCache(max_size=5000, ttl_seconds=1800)
semantic = SemanticCache(threshold=0.88, max_size=2000, ttl_seconds=3600)
cache    = CacheLayer(exact, semantic)

@cache.wrap
def call_llm(prompt: str) -> str:
    # This only runs on cache miss
    return openai_client.chat(prompt)

# First call: cache miss, LLM invoked
response = call_llm("Explain gradient descent")

# Second call (same prompt): exact cache hit, LLM NOT invoked
response = call_llm("Explain gradient descent")

# Third call (similar prompt): semantic cache hit, LLM NOT invoked
response = call_llm("What is gradient descent and how does it work?")

API Reference

`ExactCache`

class ExactCache:
    """SHA-256 fingerprint cache with TTL and LRU eviction.

    Args:
        max_size:     Maximum entries (LRU eviction when exceeded). Default: 1000.
        ttl_seconds:  Per-entry TTL. 0 = no expiry. Default: 3600.
    """

    def get(self, prompt: str) -> Optional[str]:
        """Return cached response, or None on miss/expiry."""

    def set(self, prompt: str, response: str) -> None:
        """Store response. Evicts LRU entry if at capacity."""

    def stats(self) -> dict:
        """Return {'hits': int, 'misses': int, 'evictions': int, 'size': int}."""

    def clear(self) -> None:
        """Evict all entries."""

`SemanticCache`

class SemanticCache:
    """TF-IDF cosine-similarity cache (pure stdlib, zero deps).

    Matches semantically similar prompts above a cosine similarity threshold.
    Uses LRU eviction and an incrementally-updated IDF corpus.

    Args:
        threshold:    Minimum cosine similarity for a cache hit. Default: 0.85.
        max_size:     Maximum entries (LRU eviction). Default: 500.
        ttl_seconds:  Per-entry TTL. 0 = no expiry. Default: 7200.
    """

    def get(self, prompt: str) -> Optional[tuple[str, float]]:
        """Return (response, similarity_score) for the best match, or None."""

    def set(self, prompt: str, response: str) -> None:
        """Store response and update IDF corpus."""

`CacheLayer`

class CacheLayer:
    """Unified two-layer cache: exact first, semantic second.

    Args:
        exact:    Configured ExactCache instance.
        semantic: Configured SemanticCache instance.
    """

    def get(self, prompt: str) -> Optional[CacheResult]:
        """Check exact cache, then semantic. Returns CacheResult or None."""

    def set(self, prompt: str, response: str) -> None:
        """Write to both layers simultaneously."""

    def wrap(self, func: Callable) -> Callable:
        """Decorator: cache the return value of func(prompt) transparently."""

`CacheResult`

@dataclass
class CacheResult:
    response:    str            # The cached LLM response
    cache_type:  str            # "exact" or "semantic"
    similarity:  float | None   # Cosine similarity (None for exact hits)
    age_seconds: float          # Approximate age of the cached entry

Real-World Example

High-volume document QA pipeline cutting OpenAI costs by ~55%:

import time
from agent_cache import ExactCache, SemanticCache, CacheLayer

# Tune for production: larger caches, longer TTLs
exact    = ExactCache(max_size=10_000, ttl_seconds=86_400)   # 24h exact cache
semantic = SemanticCache(
    threshold=0.87,       # 87% similarity = cache hit
    max_size=5_000,
    ttl_seconds=43_200,   # 12h semantic cache
)
cache = CacheLayer(exact, semantic)

def answer_question(doc_id: str, question: str) -> str:
    cache_key = f"{doc_id}::{question}"
    cached = cache.get(cache_key)
    if cached:
        print(f"  [{cached.cache_type} hit, {cached.similarity or 1.0:.2f}] skipping LLM")
        return cached.response

    # LLM call — only happens on cache miss
    response = call_expensive_llm(question)
    cache.set(cache_key, response)
    return response

# Simulate 10 queries (7 unique, 3 repeats / near-duplicates)
queries = [
    "What are the key risks?",
    "Summarize the executive section.",
    "What are the main risks identified?",     # semantic hit
    "List the financial highlights.",
    "What are the key risks?",                 # exact hit
    "Give me the executive summary.",          # semantic hit
    "What is the revenue forecast?",
    "What revenue is projected?",              # semantic hit
    "List the identified risk factors.",       # semantic hit
    "What is the revenue forecast?",           # exact hit
]

hits = misses = 0
for q in queries:
    r = cache.get(f"doc-001::{q}")
    if r:
        hits += 1
    else:
        cache.set(f"doc-001::{q}", f"[LLM response for: {q}]")
        misses += 1

print(f"\nCache stats: {hits} hits / {misses} misses ({hits/len(queries)*100:.0f}% hit rate)")
# Cache stats: 5 hits / 5 misses (50% hit rate) — on a cold cache!

Monitoring Cache Performance

exact_stats = cache._exact.stats()
print(f"Exact cache: {exact_stats['hits']} hits, {exact_stats['size']} entries")

The Vedic Principle

The eighteen Smritis are sacred texts of remembered law — the dharma-shastra — preserved across millennia without writing, only memory. The guru transmitted the Smriti to the shishya, word-perfect, generation after generation.

smriti brings this sacred memory architecture to LLM agents. What the agent has learned once, it remembers forever (up to TTL). The TF-IDF semantic layer finds the memory that feels like the current question, even when the exact words differ — like a guru who recognizes the spirit of a question even when phrased differently.

The Vedic Arsenal

smriti is one of 100 libraries in darshjme/arsenal:

Library	Source	Purpose
`smriti`	Vedic Smriti tradition	LLM caching
`niti`	Chanakya / Nitishastra	Policy enforcement
`duta`	Ramayana — Sundarakanda	Task dispatch
`kala`	Mahabharata BG 11.32	Timeout management
`raksha`	Ramayana — Sundarakanda	Agent security

Contributing

Fork the repo
Create a feature branch (git checkout -b fix/your-fix)
Add tests — zero external dependencies only
Submit a PR

License

🪷 Built by Darshankumar Joshi · @thedarshanjoshi

"कर्मण्येवाधिकारस्ते मा फलेषु कदाचन" Your right is to action alone, never to its fruits. — Bhagavad Gita 2.47

Vedic Arsenal · GitHub · Twitter

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
src/agent_cache		src/agent_cache
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🪷 स्मृति · `smriti`

Why `smriti` Exists

Installation

Quick Start

Decorator Pattern

API Reference

`ExactCache`

`SemanticCache`

`CacheLayer`

`CacheResult`

Real-World Example

Monitoring Cache Performance

The Vedic Principle

The Vedic Arsenal

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🪷 स्मृति · smriti

Why smriti Exists

Installation

Quick Start

Decorator Pattern

API Reference

ExactCache

SemanticCache

CacheLayer

CacheResult

Real-World Example

Monitoring Cache Performance

The Vedic Principle

The Vedic Arsenal

Contributing

License

About

Topics

Resources

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

🪷 स्मृति · `smriti`

Why `smriti` Exists

`ExactCache`

`SemanticCache`

`CacheLayer`

`CacheResult`

Packages