Right-Sizing AI Agents
Why specialisation beats scale
The company that sells the world’s most powerful GPUs just told developers to use smaller models.
At GTC 2026, NVIDIA unveiled the Nemotron 3 family, not one monolithic model, but a coordinated stack of purpose-built models designed to work together in production agentic AI systems.
The reasoning model activates only 12B of its 120B parameters per inference call.
The safety classifier is 4B parameters.
The embedding model is 1.7B.
The Cost of “One Model to Rule Them All”
Here’s a pattern I see constantly in production AI systems…
A development team builds an agent that needs to reason about user queries, retrieve relevant documents, check safety constraints, and generate responses.
They route every single one of these tasks through the same massive model…often a 400B+ parameter behemoth.
The logic seems sound… bigger models are more capable, so use the biggest one for everything.
But production agents don’t make one model call.
They make dozens, sometimes hundreds, per user interaction.
An agent that plans a multi-step task, retrieves context at each step, validates safety at each output boundary and synthesises a final response might rack up 50+ inference calls for a single user query.
At that scale, the economics change dramatically:
Single query through 400B model: ~$0.03
Agent workflow (50 calls × 400B): ~$1.50 per interaction
Agent workflow (specialised stack): ~$0.15 per interaction
At 100K daily interactions:
Monolith: ~$150,000/day
Specialized: ~$15,000/day
Savings: ~$135,000/day → $49M/yearThese numbers are only illustrative, but the order-of-magnitude difference is real.
When you’re paying per token and your agents are chatty, model size directly hits your bottom line.
NVIDIA’s answer is the specialised stack
The Nemotron 3 family isn’t just a collection of models, it’s an architectural thesis.
Each model is purpose-built for a specific role in the agentic pipeline…
Nemotron 3 Super is the reasoning engine (12B active / 120B total)
The flagship model uses a hybrid Mamba-Transformer architecture with mixture-of-experts routing.
Despite containing 120B total parameters, only 12B activate per inference pass.
This isn’t a compromise, it’s by design.
Key capabilities:
1M-token context window for maintaining extensive agent histories across multi-step tasks
Configurable thinking budget that lets you dial chain-of-thought reasoning up or down based on task complexity
Multi-token prediction for faster generation
The model trained across 10+ reinforcement learning environments and ranks among top open-weight models under 250B on the Artificial Analysis Intelligence Index.
Sitting in the “most attractive” quadrant that combines intelligence with throughput.
Nemotron 3 Content Safety is the guardrail (4B parameters)
This is where the specialisation thesis becomes most compelling.
Most teams implement safety by prompting their main model: “Before responding, check if this content is safe…”
Nemotron 3 Content Safety is a dedicated 4B-parameter multimodal classifier built on the Gemma-3–4B backbone:
23-category safety taxonomy covering hate, harassment, violence, sexual content, and unauthorised advice
Multimodal detection across text and images
12-language support with zero-shot generalisation beyond
~84% accuracy on multimodal, multilingual benchmarks
Fast binary classification or full taxonomy reporting mode
At 4B parameters, this model runs incredibly fast, fast enough to serve as an inline guardrail on every single agent output without meaningfully impacting latency or cost.
Llama Nemotron Embed VL & Rerank VL is the Retrieval Specialists (1.7B each)
For retrieval-augmented generation, NVIDIA built two complementary models:
Embed VL (1.7B parameters):
Encodes page images and text into single-dimensional vectors
Built on NVIDIA Eagle vision-language model with Llama 3.2 1B backbone
Sits on the Pareto frontier of the ViDoRe V3/MTEB benchmark for accuracy vs. throughput
Rerank VL (1.7B parameters):
Cross-encoder that scores query-page relevance
Paired with Embed VL for a two-stage retrieval pipeline
Further increases accuracy when reranking chunks and images
These models don’t try to reason or generate, they do one thing exceptionally well…find the right information.
Nemotron 3 VoiceChat is the Conversationalist (12B parameters)
Rather than cascading separate ASR to LLM to TTS pipelines, VoiceChat is an end-to-end speech model:
Sub-300ms end-to-end latency processing 80ms audio chunks
Full-duplex, interruptible conversations … it can listen while it speaks
Unified streaming architecture eliminates inter-pipeline latency
The routing pattern
The core architectural pattern is an intent-aware router that dispatches to specialised models:
class AgentRouter:
def __init__(self):
self.models = {
“reasoning”: NemotronSuper(active_params=”12B”),
“safety”: NemotronContentSafety(params=”4B”),
“embedding”: LlamaNemotronEmbedVL(params=”1.7B”),
“reranking”: LlamaNemotronRerankVL(params=”1.7B”),
“voice”: NemotronVoiceChat(params=”12B”),
}
def route(self, task):
model = self.models[task.type]
return model.infer(task.payload)This isn’t complicated and I guess that’s the point.
The complexity lives in choosing the right model for each task, not in the routing logic itself.
The safety-as-a-service pattern
Instead of embedding safety prompts into your main model calls, treat safety as an independent service:
# Before: Safety as a prompt hack
response = big_model.generate(
f”Check if this is safe, then answer: {user_query}”
)
# After: Safety as an independent, specialised check
response = reasoning_model.generate(user_query)
safety = safety_model.classify(response) # 4B params, ~5ms
if not safety.is_safe:
response = apply_guardrail(response, safety.categories)The specialised approach is faster, cheaper, more reliable, and independently tunable.
Configurable thinking budget
One of the most practical features in Nemotron 3 Super is the configurable thinking budget.
Not every reasoning task requires deep chain-of-thought…
This is right-sizing at the inference level, not just choosing the right model, but choosing the right amount of reasoning within that model.
Open weights and open architecture
All Nemotron 3 models ship under permissive NVIDIA open model licenses.
Customisation
Fine-tune the safety classifier on your domain’s specific hazards
Deployment flexibility
Run on-premises, in your VPC, or at the edge
Transparency
Audit model behaviour rather than trusting a black-box API
No vendor lock-in
The architectural pattern works with any combination of specialised models
Thinking of the future
The Nemotron 3 stack represents a maturation of the AI industry.
We’re moving past the era of throw a bigger model at it and into an era of thoughtful, production-grade architecture where:
Every model earns its parameter count
Efficiency is a feature, not a limitation
Specialisation enables capabilities that generalisation can’t (like sub-300ms voice or inline safety at every boundary)
The question for developers isn’t “which single model should I use?” anymore. It’s “what does my agent stack look like?”
Chief AI Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.
GitHub - cobusgreyling/right-sizing-ai-agents: Right-Sizing AI Agents: Why Specialization Beats…
Right-Sizing AI Agents: Why Specialization Beats Scale - Blog, code examples, and architecture diagrams exploring…github.com
Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety | NVIDIA…
Agentic AI is an ecosystem where specialized models work together to handle planning, reasoning, retrieval, and safety…developer.nvidia.com


