Djoerd Hiemstra – Research, Teaching and More

Towards a shared infrastructure for assembling web search engines

Web search engines are essential for navigating the web. Suppose we look at the web as a service that is provided by public utility companies, a service similar to electricity, water or telephone. To make sure that everyone has access to the web, public utility companies have to be subject to public control and regulation. Without regulation, a single firm may abuse their natural monopoly, for instance by raising prices, by deteriorating the service, by delivering unequal quality to different groups, or by pushing advertisements and propaganda. Equity requires that all citizens can access the web at a fair price, and at a sufficient level of quality, via transparent, well-regulated, community-based or government-based control.

OpenWebSearch.eu is a European Union funded project that researches what a transparent, well-regulated, community-based web search engine would look like. The project builds the index for a web search engine on open infrastructure that is distributed over four data centers in four different European countries. The data centers cooperatively crawl the web, cooperatively preprocess and enrich the web data, and cooperatively build an inverted index that is shared with the world. We envision a future where a search engine is “assembled” from parts provided by many different companies, based on public standards. I will discuss public standards for search engine indexes, such as the common index file format (CIFF) and approaches based on open data formats like Parquet and open cloud object storage like S3. Furthermore, I will show how researchers can query the Open Web Index remotely using a low-cost local machine, without the need to download the full index, even though it currently consists of more than 10 billion web pages.

Invited talk to be presented at the European Conference on Information Retrieval (ECIR 2026) IR 4Good track on 30 March 2026 in Delft, the Netherlands

Open Web Indexes for Remote Querying

by Gijs Hendriksen, Djoerd Hiemstra, and Arjen P. de Vries

We propose to redesign the access to Web-scale indexes. Instead of using custom search engine software and hiding access behind an API or a user interface, we store the inverted file in a standard, open source file format (Parquet) on publicly accessible (and cheap) object storage. Users can perform retrieval by fetching the relevant postings for the query terms and performing ranking locally. By using standard data formats and cloud infrastructure, we (a) natively support a wide range of downstream clients, and (b) can directly benefit from improvements in analytical query processing engines. We show the viability of our approach through a series of experiments using the ClueWeb corpora. While our approach (naturally) has a higher latency than dedicated search APIs, we show that we can still obtain results in reasonable time (usually within 10-20 seconds). Therefore, we argue that the increased accessibility and decreased deployment costs make this a suitable setup for cooperation in IR research by sharing large indexes publicly.

To be presented at the 48th European Conference on Information Retrieval (ECIR 2026) on 30 March to 1 April 2026 in Delft, The Netherlands.

[download pdf]

The 3rd International Workshop on Open Web Search (WOWS)

We organize the third International Workshop on Open Web Search (#WOWS) at ECIR 2026 with two calls for contributions: The first call targets scientific contributions on collaborative search engine building, including crawling, deployment, evaluation, and use of the web as a resource by researchers and innovators. The second call is for the WOWS-Eval shared task, which aims to gain practical experience with collaborative, cooperative search engine evaluation by focusing on using the Open Web Index (OWI) for (RAG) retrieval experiments.

https://opensearchfoundation.org/events-osf/wows2026/

IRRJ Volume 1 Number 2

In the second issue of IRRJ, Paul Kantor, writes an editorial arguing for a more critical adoption of generative AI in information retrieval (IR). He puts his concerns under three distinct headings: consistency, confidence, and completeness. Kantor is the founder of the predecessor Information Retrieval Journal, and together with co-founder Stephen Robertson, his advise helped to make IRRJ’s first year a great success. Also in this issue, a survey of Inclusive Information Access by Yue Zheng and colleagues; Catarina Pires and colleagues discussing the Multimodal Medical Case Retrieval; Madhukar Dwivedi and Jaap Kamps presenting a reproducibility study of Identifying Passages for Due Diligence; Massimo Melucci’s latest study of achieving fair rankings; Meng Yuan exploring the Correspondences Between Topic Models and Text Embeddings; and finally, Bhaskar Mitra’s provocation for the role of information retrieval in emancipatory struggles. We hope this work will deepen your knowledge of IR. Thank your for reading IRRJ.

Read the full issue at: https://irrj.org/issue/view/vol1no2

Chris Kamphuis defends PhD thesis on Exploring Relations and Graphs for Information Retrieval

by Chris Kamphuis

Finding relevant information in a large collection of documents can be challenging, especially when only text is considered when determining relevancy. This research leverages graph data to express information needs that consider more information than just text data. In some cases, instead of using inverted indexes for the data representation in our work, we use database management systems to store data.
First, we show that relational database systems are suited for retrieval experiments. A prototype system we built implements many proposed improvements to the BM25 ranking algorithm. In a large-scale reproduction study, we compare these improvements and find that the differences in effectiveness are smaller than we would expect, given the literature. We can easily change between versions of BM25 by rewriting the SQL query slightly, validating the usefulness of relational databases for reproducible IR research.
Then, we extend the data model to a graph data model. Using a graph data model; we can include more diverse data than just text. We show that we can more easily express complex information needs with a corresponding graph query language than when a relational language is used. This model is built on top of an embedded database system, allowing fast materialization of output data and using it for further steps.
One of the aspects we capture in the graph is information about entities. We use the Radboud Entity Linking (REL) system to connect entity information with documents. In order to efficiently annotate a large document collection with REL, we improved its efficiency. After these improvements, we used REL to create annotations for the MS MARCO document and passage collections. We can significantly improve recall for harder MS MARCO queries using these annotations. These entities are also used for an interactive demonstration where the geographical data of entities is used.

[more information]

Clause-Driven Automated Grading of SQL’s DDL and DML Statements

by Benard Wanjiru, Patrick van Bommel and Djoerd Hiemstra

Automated grading systems for SQL courses can significantly reduce instructor workload while ensuring consistency and objectivity in assessment. At our university, an automated SQL grading tool has become essential for evaluating assignments. Initially, we focused on grading Data Query Language (SELECT) statements, which constitute the core content of assignments in our first-year computer science course. SELECT statements produce a results table, which makes automatic grading relatively easy. However, other SQL statements, such as CREATE TABLE, INSERT, DELETE, UPDATE, do not produce a results table. This makes grading these statements more difficult. Recognizing the need to cover broader course material, we have extended our system to evaluate advanced Data Definition Language (DDL) and Data Manipulation Language (DML) statements. In this paper, we describe our approach to automated DDL/DML grading and illustrate our method of clause-driven tailored feedback generation. We explain how our system generates precise, targeted feedback based on specific SQL clauses or components. In addition, we present a practical example to highlight the benefits of our approach. Finally, we benchmark our grading tool against existing systems. Our extended tool can parse and provide feedback on most student SQL submissions. It can consistently provide targeted feedback, generating nearly one suggestion per error. It generates shorter feedback for simpler DML queries, while more complex syntax leads to longer feedback. It has the ability to pinpoint precise SQL errors. Lastly, it can generate precise and actionable suggestions, with each message directly tied to the specific component that caused the error.

To be presented at the SIGCSE Technical Symposium on Computer Science Education (SIGCSE 2026) on 18-21 February 2026 in St. Louis, United States of America.

Welcome to Databases 2025!

Welcome to Part B, Databases! We will resume Tuesday 4 November with a lecture at 15:30h. in HG00.304

The Databases part contains mandatory, individual quizzes, for which the following honour code applies:

You do not share the solutions;
The solutions to the quizzes should be your own work;
You do not post the quizzes, nor the solutions anywhere online;
You do not use instruction-tuned large language models like Github Copilot or ChatGPT;
You are allowed, and encouraged, to discuss the quizzes, and to ask clarifying questions to your fellow students; Please use the Brightspace Discussion Forum to reach out to me, the teaching assistants and your fellow students.

New this year are the online Socoles SQL exercises. Please, register yourself with the Socoles Autograder, see the previous announcement. Socoles will automatically give feedback on open questions that require SQL solutions. Socoles helps us grade the assignments for about 200 students in the course. Of course, you will get human feedback too, during the tutorials on Thursday mornings.

Wishing you a fruitful Part B!
Best wishes, Djoerd Hiemstra and Benard Wanjiru

Fatemeh Sarvi defends PhD thesis on Learning to Rank for e-Commerce Search

by Fatemeh Sarvi

Ranking is at the core of information retrieval, from search engines to recommendation systems. The objective of a ranking model is to order items based on their degree of relevance to the user’s information need, which is often expressed by a textual query. In product search, customers search through numerous options using brief, unstructured phrases, and the goal is to find not only relevant but also appealing products that match their preferences and lead to purchases. On the other side, there are the providers of the products who expect the ranking model to fairly expose their items to customers. These complications introduce unique characteristics that set product search apart from other types of search.
This thesis investigates the specific challenges of applying learning to rank models in product search and present methods to improve relevance, fairness, and effectiveness in this setting. We start by focusing on query-product matching based on textual data, as traditional information retrieval methods rely heavily on text to determine relevance. It has been shown that the vocabulary gap is larger in product search, mainly due to the limited and unstructured nature of queries and product descriptions. The vocabulary gap refers to the difference in the language used in queries and the terms found in product descriptions. In Chapter 2, we conduct a comprehensive evaluation of state-of-the-art supervised learning to match models, comparing their performance in product search. Our findings identify models that balance both accuracy and efficiency, offering practical insights for real-world applications.
Next, in Chapters 3 and 4 we address fairness in ranking on two-sided platforms, where the goal is to satisfy both groups of product search users at the same time. Accurate exposure estimation is crucial to achieve this balance. To this end, we introduce the phenomenon of outlierness in ranking as a factor that can influence the exposure-based fair ranking algorithms. Outlier items are products that deviate from others in a ranked list, due to distinct presentational features. We show empirically that these items attract more user attention and can impact exposure distribution in a list. To account for this effect, we propose OMIT, a method that reduces outlierness without compromising user utility or fairness towards providers. In the next chapter, we investigate whether outlier items influence user clicks. We introduce outlier bias as a new type of click bias, and propose OPBM. OPBM is an outlier-aware click model designed to account for both outlier and position bias. Our experiments show that in the worst case, OPBM performs similarly to the well-known Position-based model, making it a more reliable choice.
Finally, in Chapter 5 we explore how different presentational features influence user attention and perception of outliers in product search results. Through visual search and eye-tracking experiments, along with visual saliency modeling, we identify user scanning patterns and determine the role of bottom-up and top-down factors in guiding attention and shaping the perception of outliers.

[download pdf]

Team OpenWebSearch at LongEval

Using Historical Data for Scientific Search

by Daria Alexander, Maik Fröbe, Gijs Hendriksen, Matthias Hagen, Djoerd Hiemstra, Martin Potthast and Arjen de Vries

We describe the submissions of the OpenWebSearch team for the CLEF 2025 LongEval Sci-Retrieval track. Our approaches aim to explore how historical data from the past can be re-used to build effective rankings. The Sci-Retrieval track uses click-data and documents from the CORE search engine. We start all our submissions from rankings of the CORE search engine that we crawled for all queries of the track. This has two motivations: first, we hypothesize that a good practical search engine should only make minor improvements in the ranking at a time (i.e., we would like to only make small adjustments to the production ranking), and, second, we hypothesize that only documents that are in the top ranks of the CORE ranking can be relevant in the setup of LongEval where relevance is derived from clicks (i.e., we try to incorporate the position bias of the clicks into our rankings). Based on this crawled CORE ranking, we try to make improvements via qrel-boosting, RM3 keyqueries, clustering, monoT5 re-ranking and user intent prediction. Our evaluation shows that qrel-boosting, RM3 keyqueries, clustering and intent prediction improve the CORE ranking that we re-rank.

To be presented at the 16th Conference and Labs of the Evaluation Forum (CLEF 2025) on 9-12 September in Madrid, Spain.

[download pdf]

Join us at DIR 2025!

Be part of the 22nd Dutch-Belgian Information Retrieval Workshop at Radboud University, Nijmegen. We warmly invite you to register and to share your latest research with the community.

Submission deadline: Friday 10 October 2025, 23:59 CEST
Notification: Monday 13 October 2025
Registration deadline: Monday 20 October 2025, 23:59 CEST

Sponsored by SIGIR (ACM Special Interest Group on Information Retrieval) and SIKS (School of Information and Knowledge Systems)

More information at: https://informagus.nl/dir2025/