GitHub - darshjme/herald: Semantic task router for multi-agent systems. Dispatch to specialists, not generalists. Zero LLM tokens spent on routing.

The right agent for the right task. Every time. Without burning tokens.

The Problem

You built a multi-agent system. Now every task hits the same LLM for routing — burning tokens on a decision that doesn't need intelligence. It needs math.

Keyword rules break the moment a user rephrases. Prompt-based routers cost 200–500 tokens per call, every call, forever. At scale, that's not overhead — it's a tax on competence.

There is no reason a routing decision should consume a single LLM token.

The Solution

herald embeds your task once using sentence-transformers (all-MiniLM-L6-v2), computes cosine similarity against pre-registered specialist profiles, and dispatches to the best match in O(1) time. No API call. No prompt. No tokens.

from herald import LLMRouter

router = LLMRouter()

# Register specialists with representative phrases — embedded once, cached forever
router.register(
    name="coder",
    examples=["write a function", "fix this bug", "implement the algorithm", "debug the error"],
    handler=lambda task: call_claude_opus(task)
)

router.register(
    name="analyst",
    examples=["analyze the dataset", "summarize trends", "compare metrics", "explain the numbers"],
    handler=lambda task: call_sonnet(task)
)

router.register(
    name="writer",
    examples=["draft the email", "write the announcement", "create the blog post"],
    handler=lambda task: call_haiku(task)
)

# Route a task — zero LLM tokens spent on this decision
result = router.route("write a Python function to parse JSON")
print(result.specialist)   # → "coder"
print(result.confidence)   # → 0.847

# Route and execute in one call
output = router.route_and_run("draft a LinkedIn announcement")

That's it. No config files. No YAML. No prompt engineering for routing.

How It Works

flowchart LR
    A[User Task] --> B[herald]
    B -->|embed query| C[Cosine Similarity]
    C -->|0.94| D[🧑‍💻 Coder]
    C -->|0.21| E[🔬 Researcher]
    C -->|0.08| F[✍️ Writer]
    C -->|0.11| G[📊 Analyst]
    D -->|dispatched| H[Result]

    style D fill:#1f6feb,color:#fff

Registration — each specialist provides representative phrases. herald embeds them once and caches the vectors.
Routing — the incoming task is embedded and compared via cosine similarity against all specialist profiles.
Dispatch — the highest-scoring specialist's handler is invoked. The routing decision costs zero LLM tokens.

Multi-Specialist Architecture

flowchart TD
    T[Task Input] --> H[herald router]
    H -->|similarity ≥ 0.7| S1[🧑‍💻 Coder Agent\nGPT-4o]
    H -->|similarity ≥ 0.7| S2[🔬 Research Agent\nClaude Opus]
    H -->|similarity ≥ 0.7| S3[✍️ Writer Agent\nGemini Flash]
    H -->|similarity < 0.7| FB[🔄 Default Agent\nfallback]
    S1 --> R[Result]
    S2 --> R
    S3 --> R
    FB --> R

Installation

pip install sentence-transformers

# Clone and use directly
git clone https://github.com/darshjme/herald.git
cd herald
pip install -r requirements.txt

PyPI package (pip install herald) — coming soon.

For sub-50ms routing on CPU, install the ONNX backend:

pip install "sentence-transformers[onnx]"

Performance

Routing is pure vector math. No LLM call. No network round-trip.

Metric	CPU	GPU / ONNX
Routing latency	~300–600ms	< 50ms
LLM tokens per route	0	0
Model size	~80MB	same
Works offline	✅	✅

The ~300ms on CPU is one-time model load amortised across every subsequent request. After warmup, repeated routing calls cost microseconds.

API Reference

`LLMRouter`

from herald import LLMRouter
router = LLMRouter(model="all-MiniLM-L6-v2", threshold=0.0)

Method	Signature	Returns	Description
`register`	`(name, examples, handler)`	`None`	Register a specialist with example phrases and a callable handler
`route`	`(task, top_k=1)`	`RouteResult`	Compute similarity and return routing decision without executing
`route_and_run`	`(task)`	`Any`	Route the task and invoke the matched specialist's handler
`benchmark`	`(task, n=50)`	`dict`	Measure mean/p95/p99 routing latency over n iterations
`list_specialists`	`()`	`list[str]`	Return names of all registered specialists
`unregister`	`(name)`	`None`	Remove a specialist by name

`RouteResult`

Field	Type	Description
`specialist`	`str`	Name of the matched specialist
`confidence`	`float`	Cosine similarity score (0.0–1.0)
`handler`	`callable`	The matched specialist's handler function
`alternatives`	`list[RouteResult]`	Next-best matches when `top_k > 1`

Constructor Parameters

Parameter	Default	Description
`model`	`"all-MiniLM-L6-v2"`	sentence-transformers model name or path
`threshold`	`0.0`	Minimum confidence to accept a match (0.0 = always match best)
`device`	`"cpu"`	Torch device: `"cpu"`, `"cuda"`, or `"mps"`

Why herald

Every serious multi-agent system eventually needs task routing. Here is how the options compare:

Approach	Accuracy	Cost per Route	Maintainability	Offline
One LLM for everything	⚠️ Mediocre	💸 High	⚠️ Brittle	❌
`if/elif` keyword matching	❌ Fragile	✅ Free	❌ Breaks on rephrase	✅
LLM-as-router	✅ High	💸 200–500 tokens/call	✅ Flexible	❌
Function calling router	✅ High	💸 Tokens + latency	✅ Flexible	❌
herald	✅ High	✅ Zero tokens	✅ Flexible	✅

herald is not a replacement for LLM intelligence. It is the infrastructure layer beneath it — handling dispatch so your expensive models spend their tokens on tasks that require reasoning, not on deciding which box to put the task in.

Philosophy

"Shreyan swadharmo vigunah paradharmat svanushthitat." — Bhagavad Gita 3.35

Better to do one's own dharma imperfectly than another's dharma well. Every specialist in your system has a dharma — the domain they were built for. herald ensures each task reaches the agent whose dharma matches it. No generalist pretending. No token waste on indirection.

Part of Arsenal

verdict · sentinel · herald · engram · arsenal

Repo	Purpose
verdict	Score and evaluate your agents
sentinel	Stop runaway agents
herald	← Route tasks to the right agent
engram	Agent memory and recall
arsenal	The full production pipeline

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
examples		examples
llm_router		llm_router
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
conftest.py		conftest.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The right agent for the right task. Every time. Without burning tokens.

The Problem

The Solution

How It Works

Multi-Specialist Architecture

Installation

Performance

API Reference

`LLMRouter`

`RouteResult`

Constructor Parameters

Why herald

Philosophy

Part of Arsenal

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The right agent for the right task. Every time. Without burning tokens.

The Problem

The Solution

How It Works

Multi-Specialist Architecture

Installation

Performance

API Reference

LLMRouter

RouteResult

Constructor Parameters

Why herald

Philosophy

Part of Arsenal

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`LLMRouter`

`RouteResult`

Packages