Blog Post

Microsoft Foundry Blog
11 MIN READ

Foundry IQ: Improve recall by up to 54% with knowledge bases

MattGotteiner's avatar
MattGotteiner
Icon for Microsoft rankMicrosoft
Jun 02, 2026

Foundry IQ has improved its agentic retrieval engine resulting in better answer quality and improved token cost savings.

By Abhishree Shetty, Alec Berntson, Alina Stoica Beck, Amaia Salvador Aguilera, Arnau Quindós Sánchez, Jieyu Cheng, Jim Singh, Jing Zhu, Lihang Li, Matt Gotteiner, Mike Kim, Thibault Gisselbrecht

Agents are only valuable for the enterprise when they have access to your organization's knowledge. A common starting point is to give your agent a retrieval tool to find relevant context for its task. What do you do when your agent can’t find relevant context using the tool?

Foundry IQ knowledge bases deliver strong search performance while keeping your agent focused on the broader task. Knowledge bases orchestrate agentic retrieval by connecting agents to one or more knowledge sources, retrieving the necessary context, and producing accurate and faithful answers. We compared standalone retrieval tools to knowledge bases using the challenging BrowseComp-Plus benchmark and found:

  • Replacing single-shot RAG with a knowledge base improves evidence recall by up to 46%.
  • Combining a smaller agent model with agentic retrieval improves evidence recall by up to 54% while controlling costs and increasing agent responsiveness.
  • In both cases, the amount of retrieval tool calls your agent makes is reduced, resulting in 34% token cost savings.

Outer Model

Retrieval Configuration

Evidence Recall

Tool Calls (Mean)

Normalized Tokens / Query

gpt-5.4-mini (low)

Hybrid search

42.8

7.2

12.7k

gpt-5.4-full (low)

Hybrid search

48.9

4.9

44.3k

gpt-5.4-nano (low)

Knowledge base (minimal)

50.1

1.7

3.9k

gpt-5.4-mini (low)

Knowledge base (minimal)

62.6

4.3

7.3k

gpt-5.4 (low)

Knowledge base (minimal)

74.9

10.2

32.5k

gpt-5.4-mini (low)

Knowledge base (medium, gpt-5.4-mini)

71.1

2.3

13.3k

gpt-5.4-mini (low)

Knowledge base (medium, gpt-5.4)

75

2.1

28k

Table 1: Comparison of agent orchestrated retrieval using either knowledge bases or a standalone hybrid search tool

Since our last agentic retrieval release, we’ve made improvements in evidence recall as measured on multilingual enterprise content across all retrieval reasoning effort levels: 10% in minimal, 8% in low, and 9% in medium.

 

Figure 1. Average evidence recall across single knowledge source datasets using gpt-5.4-mini, reported at the minimal, low, and medium retrieval-reasoning-effort tiers. Higher scores are better.

We’ve also made improvements in answer quality: 20% in minimal, 8% in low, and 10% in medium.

 

Figure 2. Average answer completeness across single knowledge source datasets using gpt-5.4-mini, reported at the minimal, low, and medium retrieval-reasoning-effort tiers. Higher scores are better.

What’s driving the improvements

  • Improved knowledge source retrieval. A static retrieval orchestration workflow is replaced with a dynamic agentic retrieval loop. Now the agent batches requests and customizes queries for each knowledge source. The agent can review the retrieved content and determine if follow up queries are required.
  • Enhanced semantic ranker. We retrained our reranking model to surface more relevant passages from large document collections. This returns higher-quality context with fewer irrelevant passages from knowledge sources.
  • Improved answer synthesis. Knowledge base answer synthesis output mode now returns better and more complete answers to a wider variety of questions. We improved answer-generation guidelines so the model synthesizes retrieved content into better-structured responses, while staying grounded on the retrieved evidence.
  • Token caching and prompt efficiency. We reduced token cost by making LLM calls cache friendly to benefit from model provider caching. This reduces token processing without sacrificing answer quality.
  • Schema specialization for MCP calls. The MCP tool description for KB automatically adapts based on the model and configuration being used.
    • Model-size variants adapt the schema to nano, mini, and full-size models. Smaller models receive tighter guidance to avoid redundant queries and preserve key terms. Larger models have more flexibility while still being guided away from excessive query splitting and unnecessary tool calls.
    • Retrieval-mode variants adapt the schema for how retrieval is performed. At minimal retrieval effort, the calling model is directed to generate keyword preserving queries. At low and medium effort, the schema shifts to require full task delegation including all constraints.

Metrics

In our previous work, we relied on LLM-as-a-judge relevance as the primary answer and content quality metric. We improve upon this with ground-truth-based metrics, which take inspiration from a Evaluations Report for the TREC 2024 RAG Track and define ground-truth nuggets as pieces of information that a perfect answer to a query should contain. This unlocks a range of questions we couldn't previously answer: how good the result is relative to the gold answer, what is missing for it to be perfect, or what share of the ground truth is covered. Throughout this blog post, we use Evidence Recall, which is the share of ground-truth nuggets that are covered by the documents the agent retrieved.

For agentic evaluation, our system simulates an answer-blinded user with a specific information need. The simulated user requests information from a search agent connected to a search index or knowledge base. An independent judge observes the interaction and generates key metrics such as evidence recall and answer correctness. We’ve found that a simulated user without access to the answer is critical to replicating user interactions where a user is truly attempting to achieve a search task.

Knowledge bases offer superior recall and cost tradeoffs

We expect developers will have different tradeoffs for how much incremental token cost they are willing to spend on incremental recall (measured as evidence recall). We can compare retrieval systems based on how efficiently they make the tradeoff. The best system, at any given operating point, will operate on the evidence recall and cost frontier. We ran extensive tests of both knowledge bases and standalone search indexes with different model and parameter settings to demonstrate that knowledge bases operate on the frontier compared to standalone retrieval tools:

  • Lowest cost: A minimal knowledge base with gpt-5.4-mini for agent orchestration (low reasoning) is cheaper and higher quality than any standalone hybrid search tool.
  • Highest Evidence Recall: A minimal knowledge base with gpt-5.4 for agent orchestration (medium reasoning) offers the most evidence recall of any solution we tested.
  • Balanced: A medium knowledge base that uses gpt-5.4 for agentic retrieval combined with gpt-5.4-mini (low reasoning) for agent orchestration provides a balance between the lowest cost and highest evidence retrieval solutions.

The graph below shows this tradeoff at different points.

Figure 3. Pareto frontier diagram showing cost (tokens) vs evidence coverage across agent orchestrated retrieval using either knowledge base or a standalone search index. Knowledge bases (green) clearly sit on the efficiency frontier and various stand-alone tool options (red) lag significantly. Dataset is BrowseComp-Plus.

Improved MCP schemas for Agentic Retrieval

Knowledge bases can be queried over a REST API or as an MCP tool.  Our latest preview has improved how the knowledge base is presented to agents over MCP. Instead of serving a single fixed schema, the knowledge base dynamically specializes the schema for the orchestrating model’s size and retrieval settings. This improves how efficiently MCP clients can use the knowledge base:

  • Model-size variants adapt the schema to nano, mini, and full-size models. Smaller models receive tighter guidance to avoid redundant queries and preserve key terms. Larger models have more flexibility while still being guided away from excessive query splitting and unnecessary tool calls.
  • Retrieval-mode variants adapt the schema for how retrieval is performed. At minimal retrieval effort, the calling model is directed to generate keyword preserving queries. At low and medium effort, the schema shifts to require full task delegation including all constraints. Because even smaller models can delegate retrieval tasks well, developers can choose a more responsive user-facing model while still accessing stronger agentic search capabilities through the knowledge base.

Model Query Generation Can Distort Retrieval Intent

Consider this accounting-focused user question:

“What are swaptions used for in the context of undesignated derivatives and derivative instruments?”

The outer model generates the following search query:

“swaptions used to achieve a targeted mix of fixed and variable rate debt”

The missing terminology “undesignated derivatives” and additional term “debt” may look like a minor substitution, but it changes the retrieval target. The model has injected a debt-related answer hypothesis into the query. Instead of targeting documents about the accounting treatment of derivatives, retrieval pulls toward interest rates instead. To the user, this looks like poor retrieval. But the hidden failure happened when the outer model distorted the retrieval task based on its assumptions. This is surprisingly common, particularly when the model believes it can answer the user’s question directly. It’s a core behavior that this update aims to improve.

Measuring Subagent Delegation Accuracy

To measure delegation accuracy, we labeled a dataset of user questions and follow-ups with the retrieval-critical semantics they contained. Examples include entities, keywords, units of measure, alphanumeric tokens, filters, and structural requests. For follow-up questions, we resolved the full conversational context to produce a specific retrieval request. If the initial user question was “What are the copays for the high-deductible plan?” with follow-up “what about PPO?”, the resolved question would be “What are the copays for the PPO plan?”.

We calculate delegation accuracy as function of retrieval-critical constraints preserved in the delegated task. These constraints are conditions in the query that change the scope of what context is relevant (e.g. “Financial reports from 2026”, “Part number XYZ2B manual”). Handoff from the orchestrating agent to the retriever must maintain these constraints. We penalize both dropped constraints and unsupported model-added constraints. Semantic equivalents are not penalized when they represent the same retrieval constraint, for example “grouped by” vs “group by”.

Results: Knowledge Base delegation improves retrieval across datasets

To measure impact, we compared medium knowledge bases with an agent using a standalone search index. The evaluation uses a two-agent setup, with a simulated user agent presenting questions to either the knowledge base or the search agent. We measured constraint-preservation rates to determine if preservation correlates with higher evidence recall and better answer quality:

Turn

Search Agent

Knowledge Base

Δ Acc

1

0.85

1.00

+0.15

2

0.36

0.82

+0.47

3

0.32

0.74

+0.42

4

0.31

0.72

+0.40

5

0.30

0.69

+0.39

6

0.26

0.65

+0.39

Figure 4: Delegation accuracy by conversational turn. Knowledge bases preserves more retrieval-critical semantics than search agents across all six turns, with an average gap of +0.370

Query Set

Evidence Recall Lift

Tool Call savings

Customer

+2.8

-2.3%

MIML

+2.7

-5.4%

SEC

+2.1

-6.5%

BrowseComp-Plus

+2.8

-12.2%

 

Table 2: Comparing a search agent and knowledge base across four datasets, 4,100 scenarios, 3x (gpt-5.4-mini using a medium knowledge base with gpt-5.4)

Across four datasets and more than 4,000 scenarios, Knowledge Base consistently improves evidence recall and answer match against a Search Agent while also reducing the number of tool calls. The largest answer-quality gains and tool call savings appear on more demanding tasks such as those presented by BrowseComp-Plus.

Knowledge Bases Improve Answer Quality

Knowledge bases can response with either granular extractive content or with a synthesized answer. We compared the quality and completeness of synthesized answers between traditional RAG with basic relevance configurations and knowledge bases. The test spanned approximately 3,300 enterprise queries across domains and languages.

The evaluated answer synthesis retrieval configurations were:

  • BM25: keyword-based retrieval.
  • Hybrid: BM25 plus vector retrieval.
  • Knowledge Bases:
    • Minimal: Hybrid retrieval with semantic reranking, irrelevant content filtering, and cross index content merging.
    • Low: Minimal with an added query decomposition via a single agentic retrieval turn.
    • Medium: Performs up to two iterative agentic retrieval turns.

Each additional tier and system component improves the RAG stack, producing strong improvements in answer quality that are visible in metrics and typical usage. One of the main benefits of using a knowledge base over a simple search is the ability to scale retrieval effort with a single setting. Overall, when compared with commonly used BM25 search, the complete system lowers no-answer rate by 94.5% while improving evidence recall by 37.9%.

Config

Evidence Recall

No-Answer

LLM calls

BM25

57.5

21.3

1.0

Hybrid

67.1

13.7

1.0

KB Min

72.1

7.6

2.1

KB Low

76.8

2.3

3.0

KB Med

79.3

1.0

5.4

Table 3. Absolute metric values for BM25, Hybrid, knowledge base minimal, knowledge base Low, knowledge base medium with gpt-5.4-mini for all LLM calls

Heterogeneous Retrieval

Many enterprise scenarios involve questions across structured and unstructured content. To evaluate our new structured knowledge sources, we issued a series of challenging queries that required combining structured and unstructured content. The table below shows that knowledge bases can effectively combine content across different types of knowledge sources to boost evidence recall:

New knowledge source type

Structured Only

Unstructured Only

Combined

MCP

31.2

48.4

54.3

Fabric Ontology

18.6

36.1

42.4

Fabric Data Agent

17.1

36.1

48.8

Indexed SQL

71.0

70.0

83.5

Table 4. Evidence recall for HFWIKI (MCP), FABRICIQ-SEC (Fabric), SEC-SQL-Hybrid (SQL)

Get Started

This new functionality is available today in the latest preview API version. To try it, create a knowledge base, connect one or more knowledge sources, and call the retrieve API with your preferred retrieval reasoning effort. You can learn more about knowledge bases and agentic retrieval here.

Appendix

We evaluated our new release on several benchmark query sets spanning single-source and routing-heavy workloads. Two of them (SEC and MIML) include both a single knowledge source and a routing variant, giving us five single knowledge source configurations and two routing configurations in total. Three of them (HFWIKI, FABRICIQ-SEC, SEC-SQL-Hybrid) evaluate a knowledge base's capability on heterogeneous knowledge source retrieval:

  • Customer datasets: customer-provided corporate and member-document collections in English, covering domains such as oil and gas corporate reports and health-insurance member documents. We evaluate them as single knowledge source workloads over chunked PDF indexes.
  • DAYI: a Chinese medical QA corpus in Simplified Chinese, evaluated as a single knowledge source workload over a chunked PDF index.
  • FDA: FDA drug-label and clinical documents in English, evaluated as a single knowledge source workload over a chunked PDF index.
  • SEC: SEC filings of US public companies in English, evaluated in two variants: a single knowledge source setup spanning all sectors, and a routing setup with one KS per GICS sector.
  • MIML: a multi-industry, multi-language corporate document benchmark in English, French, and Simplified Chinese. We evaluate both single knowledge source variants restricted to one language-industry slice and a routing variant spanning multiple languages and industries.
  • FreshQA: a freshness-sensitive benchmark in English, evaluated as a web-routing workload where questions often require up-to-date external evidence.
  • HFWIKI: a heterogeneous benchmark in English combining Hugging Face model documentation and Wikipedia.
  • FABRICIQ-SEC: a heterogeneous benchmark in English combining financial loan records and SEC filings.
  • SEC-SQL-Hybrid: a heterogeneous benchmark in English that requires both structured SEC metadata and narrative SEC filings.
  • BrowseComp-PlusWe use the 830 human-verified BrowseComp-Plus queries and indexed the full corpus, including distractor documents, into 512-token chunks with OpenAI text-embedding-3-large. To support continuous evidence-recall measurement, we decompose each question into multiple atomic factoid evidence nuggets. Evaluations use a two-agent neutral user plus search agent configuration with a standard system prompt and per question effort capped at 40 tool calls.

Metrics

Metric

What it measures

Evidence Coverage

Share of ground-truth nuggets that are covered by the documents the agent retrieved.

Answer Completeness

Share of ground-truth nuggets that are present in the generated answer.

No Answer Rate

Whether the model declined to answer / returned an "I don't know"-style response.

 

 

 

 

 

 

 

Updated Jun 03, 2026
Version 3.0