Publications

SemBench: A Benchmark for Semantic Query Processing Engines

Published in VLDB, 2026

We present a benchmark targeting a novel class of systems: semantic query processing engines. Those systems rely inherently on generative and reasoning capabilities of state-of-the-art large language models (LLMs). They extend SQL with semantic operators, configured with natural language instructions, that are evaluated via LLMs and enable users to perform various operations on multimodal data. Our benchmark introduces diversity across three key dimensions: scenarios, modalities, and operators. Included are scenarios ranging from movie review analysis to medical question-answering. Within these scenarios, we cover different data modalities, including images, audio, and text. Finally, the queries involve a diverse set of operators, including semantic filters, joins, mappings, ranking, and classification operators. We evaluated our benchmark on three academic systems (LOTUS, Palimpzest, and ThalamusDB) and one industrial system, Google BigQuery. Although these results reflect a snapshot of systems under continuous development, our study offers crucial insights into their current strengths and weaknesses, illuminating promising directions for future research.

Recommended citation: Lao, Jiale; Zimmerer, Andreas; Ovcharenko, Olga; Cong, Tianji; Russo, Matthew; Vitagliano, Gerardo; Cochez, Michael; Özcan, Fatma; Gupta, Gautam; Hottelier, Thibaud; Jagadish, HV; Kissel, Kris; Schelter, Sebastian; Kipf, Andreas; Trummer, Immanuel. (2026). "SemBench: A Benchmark for Semantic Query Processing Engines." VLDB. https://arxiv.org/pdf/2511.01716

Abacus: A Cost-Based Optimizer for Semantic Operator Systems

Published in VLDB, 2026

LLMs enable an exciting new class of data processing applications over large collections of unstructured documents. Several new programming frameworks have enabled developers to build these applications by composing them out of semantic operators: a declarative set of AI-powered data transformations with natural language specifications. These include LLM-powered maps, filters, joins, etc. used for document processing tasks such as information extraction, summarization, and more. While systems of semantic operators have achieved strong performance on benchmarks, they can be difficult to optimize. An optimizer for this setting must determine how to physically implement each semantic operator in a way that optimizes the system globally. Existing optimizers are limited in the number of optimizations they can apply, and most (if not all) cannot optimize system quality, cost, or latency subject to constraint(s) on the other dimensions. In this paper we present Abacus, an extensible, cost-based optimizer which searches for the best implementation of a semantic operator system given a (possibly constrained) optimization objective. Abacus estimates operator performance by leveraging a minimal set of validation examples, prior beliefs about operator performance, and/or an LLM judge. We evaluate Abacus on document processing workloads in the biomedical and legal domains (BioDEX; CUAD) and multi-modal question answering (MMQA). We demonstrate that, on-average, systems optimized by Abacus achieve 6.7%-39.4% better quality and are 10.8x cheaper and 3.4x faster than the next best system.

Recommended citation: Russo, Matthew; Liu, Chunwei; Sudhir, Sivaprasad; Vitagliano, Gerardo; Cafarella, Michael; Kraska, Tim; Madden, Samuel. (2026). "Abacus: A Cost-Based Optimizer for Semantic Operator Systems." VLDB. https://arxiv.org/pdf/2505.14661

KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes

Published in ICLR, 2026

Constructing real-world data-to-insight pipelines often involves data extraction from data lakes, data integration across data sources, and diverse operations from data cleaning to analysis. The design and implementation of data science pipelines require domain knowledge, technical expertise, and even project-specific insights. AI systems have shown remarkable reasoning, coding, and understanding capabilities. However, it remains unclear to what extent these capabilities translate into successful design and execution of such complex pipelines. We introduce KRAMABENCH: a benchmark composed of 104 manually-curated real-world data science pipelines spanning 1700 data files from 24 data sources in 6 different domains. We show that these pipelines test the end-to-end capabilities of AI systems on data processing, requiring data discovery, wrangling and cleaning, efficient processing, statistical reasoning, and orchestrating data processing steps given a high-level task.

Recommended citation: Lai, Eugenie; Vitagliano, Gerardo; Zhang, Ziyu; Chabra, Om; Sudhir, Sivaprasad; Zeng, Anna; Zabreyko, Anton A.; Li, Chenning; Kossmann, Ferdi; Ding, Jialin; Chen, Jun; Markakis, Markos; Russo, Matthew; Wang, Weiyang; Wu, Ziniu; Cafarella, Michael J.; Cao, Lei; Madden, Samuel; Kraska, Tim. (2026). "KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes." ICLR. https://arxiv.org/pdf/2506.06541

Deep Research is the New Analytics System: Towards Building the Runtime for AI-Driven Analytics

Published in CIDR, 2026

With advances in large language models (LLMs), researchers are creating new systems that can perform AI-driven analytics over large unstructured datasets. Recent work has explored executing such analytics queries using semantic operators – a declarative set of AI-powered data transformations with natural language specifications. However, even when optimized, these operators can be expensive to execute on millions of records and their iterator execution semantics make them ill-suited for interactive data analytics tasks. In another line of work, Deep Research systems have demonstrated an ability to answer natural language question(s) over large datasets. These systems use one or more LLM agent(s) to plan their execution, process the dataset(s), and iteratively refine their answer. However, these systems do not explicitly optimize their query plans which can lead to poor plan execution. In order for AI-driven analytics to excel, we need a runtime which combines the optimized execution of semantic operators with the flexibility and more dynamic execution of Deep Research systems. As a first step towards this vision, we build a prototype which enables Deep Research agents to write and execute optimized semantic operator programs. We evaluate our prototype and demonstrate that it can outperform a handcrafted semantic operator program and open Deep Research systems on two basic queries.

Recommended citation: Russo, Matthew; Kraska, Tim. (2026). "Deep Research is the New Analytics System: Towards Building the Runtime for AI-Driven Analytics." CIDR. https://arxiv.org/pdf/2509.02751

Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing

Published in CIDR, 2025

A long-standing goal of data management systems has been to build systems which can compute quantitative insights over large corpora of unstructured data in a cost-effective manner. Until recently, it was difficult and expensive to extract facts from company documents, data from scientific papers, or metrics from image and video corpora. Today’s models can accomplish these tasks with high accuracy. However, a programmer who wants to answer a substantive AI-powered query must orchestrate large numbers of models, prompts, and data operations. For even a single query, the programmer has to make a vast number of decisions such as the choice of model, the right inference method, the most cost-effective inference hardware, the ideal prompt design, and so on. The optimal set of decisions can change as the query changes and as the rapidly-evolving technical landscape shifts. In this paper we present PALIMPZEST, a system that enables anyone to process AI-powered analytical queries simply by defining them in a declarative language. The system uses its cost optimization framework to implement the query plan with the best trade-offs between runtime, financial cost, and output data quality. We describe the workload of AI-powered analytics tasks, the optimization methods that PALIMPZEST uses, and the prototype system itself. We evaluate PALIMPZEST on tasks in Legal Discovery, Real Estate Search, and Medical Schema Matching. We show that even our simple prototype offers a range of appealing plans, including one that is 3.3x faster and 2.9x cheaper than the baseline method, while also offering better data quality. With parallelism enabled, PALIMPZEST can produce plans with up to a 90.3x speedup at 9.1x lower cost relative to a single-threaded GPT-4 baseline, while obtaining an F1-score within 83.5% of the baseline. These require no additional work by the user.

Recommended citation: Liu, Chunwei; Russo, Matthew; Cafarella, Michael; Cao, Lei; Chen, Peter Baile; Chen, Zui; Franklin, Michael; Kraska, Tim; Madden, Samuel; Shahout, Rana; Vitagliano, Gerardo. (2025). "Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing." CIDR. https://arxiv.org/pdf/2405.14696

Accelerating Aggregation Queries on Unstructured Streams of Data

Published in VLDB, 2023

Analysts and scientists are interested in querying streams of video, audio, and text to extract quantitative insights. For example, an urban planner may wish to measure congestion by querying the live feed from a traffic camera. Prior work has used deep neural networks (DNNs) to answer such queries in the batch setting. However, much of this work is not suited for the streaming setting because it requires access to the entire dataset before a query can be submitted or is specific to video. Thus, to the best of our knowledge, no prior work addresses the problem of efficiently answering queries over multiple modalities of streams. In this work we propose InQuest, a system for accelerating aggregation queries on unstructured streams of data with statistical guarantees on query accuracy. InQuest leverages inexpensive approximation models (“proxies”) and sampling techniques to limit the execution of an expensive high-precision model (an “oracle”) to a subset of the stream. It then uses the oracle predictions to compute an approximate query answer in real-time. We theoretically analyzed InQuest and show that the expected error of its query estimates converges on stationary streams at a rate inversely proportional to the oracle budget. We evaluated our algorithm on six real-world video and text datasets and show that InQuest achieves the same root mean squared error (RMSE) as two streaming baselines with up to 5.0x fewer oracle invocations. We further show that InQuest can achieve up to 1.9x lower RMSE at a fixed number of oracle invocations than a state-of-the-art batch setting algorithm.

Recommended citation: Russo, Matthew; Hashimoto, Tatsunori; Kang, Daniel; Sun, Yi; Zaharia, Matei. (2023). "Accelerating Aggregation Queries on Unstructured Streams of Data." VLDB. 16(11). https://arxiv.org/pdf/2308.09157