EvalAgent

EvalAgent is a framework for extracting evaluation criteria from instructional web documents. It consists of several components that work together to generate grounded evaluation criteria. At a high level, given a user prompt, EvalAgent:

Generates search queries
Retrieves relevant instructional web documents, generate answers to the queries and summarizes them to a query specific criteria
Aggregates query-criteria answers into an evaluation criteria grounded in instructional content

🛠️ Installation

First, install the required dependencies:

pip install -r requirements.txt

Then, set up the necessary API keys and environment variables:

Model API keys — for query generation, answering, and criteria aggregation.
Search API credentials — to retrieve web documents.

🔧 Our search is backed by Google Search API. The set up instructions can be found here.

📘 Optional - Reddit Setup: Some queries may return Reddit URLs. If you'd like to scrape Reddit content, set up praw by following the Reddit setup guide. This is optional—if not configured, Reddit URLs will simply be skipped without errors.

🔐 Environment variables

Create and populate a file named environment_variables.sh with the following:

## OpenAI if you using their models; else set up anthropic keys if needed
export OPENAI_API_KEY=
## Google search credentials 
export GOOGLE_SEARCH_API_KEY=
export CSE_ID=
## OPTIONAL reddit credentials
export reddit_client_id=
export reddit_client_secret=
export reddit_user_agent=
export reddit_username=
export reddit_password=

🚀 Running EvalAgent

EvalAgent generates evaluation criteria using two complementary sources:

(1) LLM-n: generated directly via prompting an LLM (n criteria).

(2) EA-Web: extracted from instructional web documents through retrieval and aggregation.

These are then combined to produce the final output: EA-Full, a combined set of evaluation criteria sorted by their relevance to the user-prompt.

Quick Setup

TBA

Detailed Setup

To generate EA-Full, use the given criteria_gen_args.yaml file:

input_file: "data/sample.jsonl"
output_file: "data/sample_criteria.jsonl"
ea: true 
llm: true
search: true
score: true
query_model: gpt-4o-mini-2024-07-18
aggregator_model: gpt-4o-mini-2024-07-18
answer_model: gpt-4o-mini-2024-07-18
scoring_model: gpt-4o-mini-2024-07-18
n: 10

Then source your environment variables and run:

./environment_variables.sh
python evaluation_criteria_generator.py --config criteria_gen_args.yaml

The output file will contain three types of criteria:

llm_criteria : LLM prompted criteria that generates n criteria
ea_criteria: EA-Web criteria generated from instructional web documents
ea_full_criteria: merged criteria that combines LLM and EA-Web

📊 Evaluating with EvalAgent

EvalAgent can also evaluate model responses using the generated criteria. First, create an evaluation config (evaluation_args.yaml):

input_file: "data/sample_with_responses.jsonl"
output_file: "data/sample_test_evaluation.jsonl"
generate_criteria: true
criteria_generator_config: "criteria_gen_args.yaml"
criteria_column: "ea_criteria"
evaluation_model: gpt-4o-mini-2024-07-18

Ensure your criteria_gen_args.yaml is configured as described above. Then run:

python evaluate.py --config evaluation_args.yaml

You may use any of the three generated criteria types (llm_criteria, ea_criteria, or ea_full_criteria) in the evaluation.

⚙️ Alternate modes for criteria gen with EvalAgent

You can flexibly run EvalAgent criteria generation in different modes. In all cases, the rest of the config (models, n, input/output files) remains the same but the flags need to be modified as following:

🔹LLM-n

Only generating LLM prompted criteria

ea: false
llm: true
search: false
score: false
n: 10

🔹 EvalAgent-LLM (LLM-generated answers to queries, no search):

ea: true
llm: false
search: false
score: false

🔹 EvalAgent-Web (search + document-based)

ea: true
llm: false
search: true
score: false

🖥️ Visualization

EvalAgent includes a Flask-based UI to visualize criteria generated via search.

cd data/
python app.py --data ../data/sample_data_criteria_search.jsonl

Here is a sample visualization of the criteria generated for sample.jsonl

sample_visualization.mp4.mov

Data

We have uploaded the data for different datasets to huggingface! wadhma/evalagent

Paper and citation

You can find the arxiv paper here.

@InProceedings{,
  title = {EvalAgent: Discovering Implicit Evaluation Criteria from the Web},
  author = {Manya Wadhwa and Zayne Sprague and Chaitanya Malaviya and Philippe Laban and Junyi Jessy Li and Greg Durrett},
  booktitle = {arXiv},
  year = {2025},
  url={https://arxiv.org/abs/2504.15219}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
data		data
images		images
ui		ui
README.md		README.md
criteria_aggregation.py		criteria_aggregation.py
criteria_gen_args.yaml		criteria_gen_args.yaml
criteria_generator.py		criteria_generator.py
environment_variables.sh		environment_variables.sh
evaluate.py		evaluate.py
evaluation_args.yaml		evaluation_args.yaml
evaluation_criteria_generator.py		evaluation_criteria_generator.py
get_search_results.py		get_search_results.py
google_setup.md		google_setup.md
reddit_setup.md		reddit_setup.md
requirements.txt		requirements.txt
run_model.py		run_model.py
score_and_merge.py		score_and_merge.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvalAgent

🛠️ Installation

🔐 Environment variables

🚀 Running EvalAgent

Quick Setup

Detailed Setup

📊 Evaluating with EvalAgent

⚙️ Alternate modes for criteria gen with EvalAgent

🔹LLM-n

🔹 EvalAgent-LLM (LLM-generated answers to queries, no search):

🔹 EvalAgent-Web (search + document-based)

🖥️ Visualization

Data

Paper and citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EvalAgent

🛠️ Installation

🔐 Environment variables

🚀 Running EvalAgent

Quick Setup

Detailed Setup

📊 Evaluating with EvalAgent

⚙️ Alternate modes for criteria gen with EvalAgent

🔹LLM-n

🔹 EvalAgent-LLM (LLM-generated answers to queries, no search):

🔹 EvalAgent-Web (search + document-based)

🖥️ Visualization

Data

Paper and citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages