LARA: LLM-Assisted Relevance Assessments

LARA is a flexible framework that leverages large language models (LLMs) for active learning and relevance annotation calibration to efficiently create of high-quality test collections.

Paper: LLM-Assisted Relevance Assessments: When Should We Ask LLMs for Help? (SIGIR 2025 Full Paper).

See here for proofs omitted in the paper.

🚀 Quick Start Guide

Run the evaluation scripts as follows:

🔹 Binary Relevance (0/1 labels)

python src/test-al.py Ours Random Naive Pool MTF OursGroup[N]

🔸 Graded Relevance (0~k labels)

python src/test-al-graded.py Ours Random Naive Pool MTF OursGroup[N]

Ours: Proposed LARA method.
OursGroup[N]: LARA with N annotators grouped.

📂 Input File Format (CSV)

Provide the LLM prediction results in a .csv file.

Binary Relevance Format:

topic_id,doc_id,annotation,prob_yes,prob_no

topic_id: ID of the query/topic.
doc_id: Document identifier.
annotation: Human-annotated label (0 or 1).
prob_yes: LLM's confidence that the label is relevant (1).
prob_no: LLM's confidence that the label is not relevant (0).

Graded Relevance Format:

topic_id,doc_id,annotation,prob_0,...,prob_k

annotation: Human-annotated label (0–k).
prob_n: LLM's confidence for label n.

📌 Example Implementations (using TREC-COVID from BEIR on Llama-3.1-8B-Instruct):

Binary relevance example: src/example-covid-itachi.py
Graded relevance example: src/example-covid-itachi-graded.py

📝 Prompt Templates

We tested the following prompts in the paper:

📘 Simple Prompt
🔧 Utility Prompt based on [Thomas et al., 2024]
📙 Rational Prompt based on [Upadhyay et al., 2024]

{description} and {narrative} are given by the NIST assessors.

🔹 Binary Relevance (0/1)

These prompts produce Yes (1) or No (0) responses.

📘 Simple Prompt (Binary)

Consider the following web page content:
—BEGIN WEB PAGE CONTENT—
{text}
—END WEB PAGE CONTENT—

Setting:
A person has typed "{query}" into a search engine.
Intent: "{description}".

Instruction:
Answer if the web content is relevant to the query. {narrative}

Answer yes or no.

Your answer:

🔧 Utility Prompt (Binary)

Given a query and web page, provide a Yes or No answer indicating relevance.

Imagine writing a report on the topic.
If the web page contains vital information, answer Yes; otherwise, answer No.

Query: "{query}"
Intent: "{description} {narrative}"

Web Page:
—BEGIN WEB PAGE CONTENT—
{text}
—END WEB PAGE CONTENT—

Instruction:
Is the web content relevant to the query?

Answer yes or no.

Your answer:

Based on the utility framing by Thomas et al. (2024).

📙 Rational Prompt (Binary)

You're an expert content judge. Use commonsense reasoning to determine relevance.

Yes: Passage dedicated to query, contains answer.

No: Passage unrelated to query.

Only return Yes or No.

Query: "{query}"
Intent: "{description} {narrative}"

Passage: "{text}"

Instruction:
Is the web content relevant to the query?

Your answer:

Based on rational prompt style from Upadhyay et al. (2024).

🔸 Graded Relevance (example with k=2)

📘 Simple Prompt (Graded)

Consider the following web page content:
—BEGIN WEB PAGE CONTENT—
{text}
—END WEB PAGE CONTENT—

A person typed "{query}" into a search engine.
Intent: "{description}".

Instruction:
Answer relevance of web content. {narrative}

Answer 2 (highly relevant), 1 (partially relevant), or 0 (not relevant).

Your answer:

🔧 Utility Prompt (Graded)

Given a query and web page, provide a relevance score:

2: Highly relevant, very helpful.

1: Relevant, partially useful.

0: Not relevant.

Imagine writing a report.
If vital information is present, score 2; if partly useful, 1; else 0.

Query: "{query}"
Intent: "{description} {narrative}"

Web Page:
—BEGIN WEB PAGE CONTENT—
{text}
—END WEB PAGE CONTENT—

Instructions:
Provide a relevance score (no explanation).

Relevance Score: {

Based on the utility approach from Thomas et al. (2024).

📙 Rational Prompt (Graded)

You're an expert judge. Use commonsense reasoning to determine relevance:

2: Highly relevant

1: Somewhat relevant

0: Not relevant

Provide only the categorical value, no explanation.

Query: "{query}"
Intent: "{description} {narrative}"

Passage: "{text}"

Instruction:
Is the web content relevant to the query?

Relevance Score: {

Based on rational prompts in Upadhyay et al. (2024).

📚 References

Shivani Upadhyay et al. (2024). "A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look." arXiv: 2411.08275.
Paul Thomas et al. (2024). "Large language models can accurately predict searcher preferences." arXiv: 2309.10621.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs		docs
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LARA: LLM-Assisted Relevance Assessments

🚀 Quick Start Guide

🔹 Binary Relevance (0/1 labels)

🔸 Graded Relevance (0~k labels)

📂 Input File Format (CSV)

Binary Relevance Format:

Graded Relevance Format:

📝 Prompt Templates

🔹 Binary Relevance (0/1)

📘 Simple Prompt (Binary)

🔧 Utility Prompt (Binary)

📙 Rational Prompt (Binary)

🔸 Graded Relevance (example with k=2)

📘 Simple Prompt (Graded)

🔧 Utility Prompt (Graded)

📙 Rational Prompt (Graded)

📚 References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LARA: LLM-Assisted Relevance Assessments

🚀 Quick Start Guide

🔹 Binary Relevance (0/1 labels)

🔸 Graded Relevance (0~k labels)

📂 Input File Format (CSV)

Binary Relevance Format:

Graded Relevance Format:

📝 Prompt Templates

🔹 Binary Relevance (0/1)

📘 Simple Prompt (Binary)

🔧 Utility Prompt (Binary)

📙 Rational Prompt (Binary)

🔸 Graded Relevance (example with k=2)

📘 Simple Prompt (Graded)

🔧 Utility Prompt (Graded)

📙 Rational Prompt (Graded)

📚 References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages