Skip to content

darshjme/herald

The right agent for the right task. Every time. Without burning tokens.

Python PyPI CI License: MIT Tests Routing


The Problem

You built a multi-agent system. Now every task hits the same LLM for routing — burning tokens on a decision that doesn't need intelligence. It needs math.

Keyword rules break the moment a user rephrases. Prompt-based routers cost 200–500 tokens per call, every call, forever. At scale, that's not overhead — it's a tax on competence.

There is no reason a routing decision should consume a single LLM token.


The Solution

herald embeds your task once using sentence-transformers (all-MiniLM-L6-v2), computes cosine similarity against pre-registered specialist profiles, and dispatches to the best match in O(1) time. No API call. No prompt. No tokens.

from herald import LLMRouter

router = LLMRouter()

# Register specialists with representative phrases — embedded once, cached forever
router.register(
    name="coder",
    examples=["write a function", "fix this bug", "implement the algorithm", "debug the error"],
    handler=lambda task: call_claude_opus(task)
)

router.register(
    name="analyst",
    examples=["analyze the dataset", "summarize trends", "compare metrics", "explain the numbers"],
    handler=lambda task: call_sonnet(task)
)

router.register(
    name="writer",
    examples=["draft the email", "write the announcement", "create the blog post"],
    handler=lambda task: call_haiku(task)
)

# Route a task — zero LLM tokens spent on this decision
result = router.route("write a Python function to parse JSON")
print(result.specialist)   # → "coder"
print(result.confidence)   # → 0.847

# Route and execute in one call
output = router.route_and_run("draft a LinkedIn announcement")

That's it. No config files. No YAML. No prompt engineering for routing.


How It Works

flowchart LR
    A[User Task] --> B[herald]
    B -->|embed query| C[Cosine Similarity]
    C -->|0.94| D[🧑‍💻 Coder]
    C -->|0.21| E[🔬 Researcher]
    C -->|0.08| F[✍️ Writer]
    C -->|0.11| G[📊 Analyst]
    D -->|dispatched| H[Result]

    style D fill:#1f6feb,color:#fff
Loading
  1. Registration — each specialist provides representative phrases. herald embeds them once and caches the vectors.
  2. Routing — the incoming task is embedded and compared via cosine similarity against all specialist profiles.
  3. Dispatch — the highest-scoring specialist's handler is invoked. The routing decision costs zero LLM tokens.

Multi-Specialist Architecture

flowchart TD
    T[Task Input] --> H[herald router]
    H -->|similarity ≥ 0.7| S1[🧑‍💻 Coder Agent\nGPT-4o]
    H -->|similarity ≥ 0.7| S2[🔬 Research Agent\nClaude Opus]
    H -->|similarity ≥ 0.7| S3[✍️ Writer Agent\nGemini Flash]
    H -->|similarity < 0.7| FB[🔄 Default Agent\nfallback]
    S1 --> R[Result]
    S2 --> R
    S3 --> R
    FB --> R
Loading

Installation

pip install sentence-transformers
# Clone and use directly
git clone https://github.com/darshjme/herald.git
cd herald
pip install -r requirements.txt

PyPI package (pip install herald) — coming soon.

For sub-50ms routing on CPU, install the ONNX backend:

pip install "sentence-transformers[onnx]"

Performance

Routing is pure vector math. No LLM call. No network round-trip.

Metric CPU GPU / ONNX
Routing latency ~300–600ms < 50ms
LLM tokens per route 0 0
Model size ~80MB same
Works offline

The ~300ms on CPU is one-time model load amortised across every subsequent request. After warmup, repeated routing calls cost microseconds.


API Reference

LLMRouter

from herald import LLMRouter
router = LLMRouter(model="all-MiniLM-L6-v2", threshold=0.0)
Method Signature Returns Description
register (name, examples, handler) None Register a specialist with example phrases and a callable handler
route (task, top_k=1) RouteResult Compute similarity and return routing decision without executing
route_and_run (task) Any Route the task and invoke the matched specialist's handler
benchmark (task, n=50) dict Measure mean/p95/p99 routing latency over n iterations
list_specialists () list[str] Return names of all registered specialists
unregister (name) None Remove a specialist by name

RouteResult

Field Type Description
specialist str Name of the matched specialist
confidence float Cosine similarity score (0.0–1.0)
handler callable The matched specialist's handler function
alternatives list[RouteResult] Next-best matches when top_k > 1

Constructor Parameters

Parameter Default Description
model "all-MiniLM-L6-v2" sentence-transformers model name or path
threshold 0.0 Minimum confidence to accept a match (0.0 = always match best)
device "cpu" Torch device: "cpu", "cuda", or "mps"

Why herald

Every serious multi-agent system eventually needs task routing. Here is how the options compare:

Approach Accuracy Cost per Route Maintainability Offline
One LLM for everything ⚠️ Mediocre 💸 High ⚠️ Brittle
if/elif keyword matching ❌ Fragile ✅ Free ❌ Breaks on rephrase
LLM-as-router ✅ High 💸 200–500 tokens/call ✅ Flexible
Function calling router ✅ High 💸 Tokens + latency ✅ Flexible
herald ✅ High ✅ Zero tokens ✅ Flexible

herald is not a replacement for LLM intelligence. It is the infrastructure layer beneath it — handling dispatch so your expensive models spend their tokens on tasks that require reasoning, not on deciding which box to put the task in.


Philosophy

"Shreyan swadharmo vigunah paradharmat svanushthitat." — Bhagavad Gita 3.35

Better to do one's own dharma imperfectly than another's dharma well. Every specialist in your system has a dharma — the domain they were built for. herald ensures each task reaches the agent whose dharma matches it. No generalist pretending. No token waste on indirection.


Part of Arsenal

verdict · sentinel · herald · engram · arsenal
Repo Purpose
verdict Score and evaluate your agents
sentinel Stop runaway agents
herald ← Route tasks to the right agent
engram Agent memory and recall
arsenal The full production pipeline

License

MIT © Darshankumar Joshi · Built as part of the Arsenal toolkit.

About

Semantic task router for multi-agent systems. Dispatch to specialists, not generalists. Zero LLM tokens spent on routing.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages