You built a multi-agent system. Now every task hits the same LLM for routing — burning tokens on a decision that doesn't need intelligence. It needs math.
Keyword rules break the moment a user rephrases. Prompt-based routers cost 200–500 tokens per call, every call, forever. At scale, that's not overhead — it's a tax on competence.
There is no reason a routing decision should consume a single LLM token.
herald embeds your task once using sentence-transformers (all-MiniLM-L6-v2), computes cosine similarity against pre-registered specialist profiles, and dispatches to the best match in O(1) time. No API call. No prompt. No tokens.
from herald import LLMRouter
router = LLMRouter()
# Register specialists with representative phrases — embedded once, cached forever
router.register(
name="coder",
examples=["write a function", "fix this bug", "implement the algorithm", "debug the error"],
handler=lambda task: call_claude_opus(task)
)
router.register(
name="analyst",
examples=["analyze the dataset", "summarize trends", "compare metrics", "explain the numbers"],
handler=lambda task: call_sonnet(task)
)
router.register(
name="writer",
examples=["draft the email", "write the announcement", "create the blog post"],
handler=lambda task: call_haiku(task)
)
# Route a task — zero LLM tokens spent on this decision
result = router.route("write a Python function to parse JSON")
print(result.specialist) # → "coder"
print(result.confidence) # → 0.847
# Route and execute in one call
output = router.route_and_run("draft a LinkedIn announcement")That's it. No config files. No YAML. No prompt engineering for routing.
flowchart LR
A[User Task] --> B[herald]
B -->|embed query| C[Cosine Similarity]
C -->|0.94| D[🧑💻 Coder]
C -->|0.21| E[🔬 Researcher]
C -->|0.08| F[✍️ Writer]
C -->|0.11| G[📊 Analyst]
D -->|dispatched| H[Result]
style D fill:#1f6feb,color:#fff
- Registration — each specialist provides representative phrases. herald embeds them once and caches the vectors.
- Routing — the incoming task is embedded and compared via cosine similarity against all specialist profiles.
- Dispatch — the highest-scoring specialist's handler is invoked. The routing decision costs zero LLM tokens.
flowchart TD
T[Task Input] --> H[herald router]
H -->|similarity ≥ 0.7| S1[🧑💻 Coder Agent\nGPT-4o]
H -->|similarity ≥ 0.7| S2[🔬 Research Agent\nClaude Opus]
H -->|similarity ≥ 0.7| S3[✍️ Writer Agent\nGemini Flash]
H -->|similarity < 0.7| FB[🔄 Default Agent\nfallback]
S1 --> R[Result]
S2 --> R
S3 --> R
FB --> R
pip install sentence-transformers# Clone and use directly
git clone https://github.com/darshjme/herald.git
cd herald
pip install -r requirements.txtPyPI package (pip install herald) — coming soon.
For sub-50ms routing on CPU, install the ONNX backend:
pip install "sentence-transformers[onnx]"Routing is pure vector math. No LLM call. No network round-trip.
| Metric | CPU | GPU / ONNX |
|---|---|---|
| Routing latency | ~300–600ms | < 50ms |
| LLM tokens per route | 0 | 0 |
| Model size | ~80MB | same |
| Works offline | ✅ | ✅ |
The ~300ms on CPU is one-time model load amortised across every subsequent request. After warmup, repeated routing calls cost microseconds.
from herald import LLMRouter
router = LLMRouter(model="all-MiniLM-L6-v2", threshold=0.0)| Method | Signature | Returns | Description |
|---|---|---|---|
register |
(name, examples, handler) |
None |
Register a specialist with example phrases and a callable handler |
route |
(task, top_k=1) |
RouteResult |
Compute similarity and return routing decision without executing |
route_and_run |
(task) |
Any |
Route the task and invoke the matched specialist's handler |
benchmark |
(task, n=50) |
dict |
Measure mean/p95/p99 routing latency over n iterations |
list_specialists |
() |
list[str] |
Return names of all registered specialists |
unregister |
(name) |
None |
Remove a specialist by name |
| Field | Type | Description |
|---|---|---|
specialist |
str |
Name of the matched specialist |
confidence |
float |
Cosine similarity score (0.0–1.0) |
handler |
callable |
The matched specialist's handler function |
alternatives |
list[RouteResult] |
Next-best matches when top_k > 1 |
| Parameter | Default | Description |
|---|---|---|
model |
"all-MiniLM-L6-v2" |
sentence-transformers model name or path |
threshold |
0.0 |
Minimum confidence to accept a match (0.0 = always match best) |
device |
"cpu" |
Torch device: "cpu", "cuda", or "mps" |
Every serious multi-agent system eventually needs task routing. Here is how the options compare:
| Approach | Accuracy | Cost per Route | Maintainability | Offline |
|---|---|---|---|---|
| One LLM for everything | 💸 High | ❌ | ||
if/elif keyword matching |
❌ Fragile | ✅ Free | ❌ Breaks on rephrase | ✅ |
| LLM-as-router | ✅ High | 💸 200–500 tokens/call | ✅ Flexible | ❌ |
| Function calling router | ✅ High | 💸 Tokens + latency | ✅ Flexible | ❌ |
| herald | ✅ High | ✅ Zero tokens | ✅ Flexible | ✅ |
herald is not a replacement for LLM intelligence. It is the infrastructure layer beneath it — handling dispatch so your expensive models spend their tokens on tasks that require reasoning, not on deciding which box to put the task in.
"Shreyan swadharmo vigunah paradharmat svanushthitat." — Bhagavad Gita 3.35
Better to do one's own dharma imperfectly than another's dharma well. Every specialist in your system has a dharma — the domain they were built for. herald ensures each task reaches the agent whose dharma matches it. No generalist pretending. No token waste on indirection.
verdict · sentinel · herald · engram · arsenal
| Repo | Purpose |
|---|---|
| verdict | Score and evaluate your agents |
| sentinel | Stop runaway agents |
| herald | ← Route tasks to the right agent |
| engram | Agent memory and recall |
| arsenal | The full production pipeline |
MIT © Darshankumar Joshi · Built as part of the Arsenal toolkit.