EvalAgent is a framework for extracting evaluation criteria from instructional web documents. It consists of several components that work together to generate grounded evaluation criteria. At a high level, given a user prompt, EvalAgent:
- Generates search queries
- Retrieves relevant instructional web documents, generate answers to the queries and summarizes them to a query specific criteria
- Aggregates query-criteria answers into an evaluation criteria grounded in instructional content
First, install the required dependencies:
pip install -r requirements.txt
Then, set up the necessary API keys and environment variables:
- Model API keys — for query generation, answering, and criteria aggregation.
- Search API credentials — to retrieve web documents.
🔧 Our search is backed by Google Search API. The set up instructions can be found here.
📘 Optional - Reddit Setup: Some queries may return Reddit URLs. If you'd like to scrape Reddit content, set up praw by following the Reddit setup guide. This is optional—if not configured, Reddit URLs will simply be skipped without errors.
Create and populate a file named environment_variables.sh with the following:
## OpenAI if you using their models; else set up anthropic keys if needed
export OPENAI_API_KEY=
## Google search credentials
export GOOGLE_SEARCH_API_KEY=
export CSE_ID=
## OPTIONAL reddit credentials
export reddit_client_id=
export reddit_client_secret=
export reddit_user_agent=
export reddit_username=
export reddit_password=
EvalAgent generates evaluation criteria using two complementary sources:
(1) LLM-n: generated directly via prompting an LLM (n criteria).
(2) EA-Web: extracted from instructional web documents through retrieval and aggregation.
These are then combined to produce the final output: EA-Full, a combined set of evaluation criteria sorted by their relevance to the user-prompt.
TBA
To generate EA-Full, use the given criteria_gen_args.yaml file:
input_file: "data/sample.jsonl"
output_file: "data/sample_criteria.jsonl"
ea: true
llm: true
search: true
score: true
query_model: gpt-4o-mini-2024-07-18
aggregator_model: gpt-4o-mini-2024-07-18
answer_model: gpt-4o-mini-2024-07-18
scoring_model: gpt-4o-mini-2024-07-18
n: 10
Then source your environment variables and run:
./environment_variables.sh
python evaluation_criteria_generator.py --config criteria_gen_args.yaml
The output file will contain three types of criteria:
- llm_criteria : LLM prompted criteria that generates n criteria
- ea_criteria: EA-Web criteria generated from instructional web documents
- ea_full_criteria: merged criteria that combines LLM and EA-Web
EvalAgent can also evaluate model responses using the generated criteria.
First, create an evaluation config (evaluation_args.yaml):
input_file: "data/sample_with_responses.jsonl"
output_file: "data/sample_test_evaluation.jsonl"
generate_criteria: true
criteria_generator_config: "criteria_gen_args.yaml"
criteria_column: "ea_criteria"
evaluation_model: gpt-4o-mini-2024-07-18
Ensure your criteria_gen_args.yaml is configured as described above. Then run:
python evaluate.py --config evaluation_args.yaml
You may use any of the three generated criteria types (llm_criteria, ea_criteria, or ea_full_criteria) in the evaluation.
You can flexibly run EvalAgent criteria generation in different modes. In all cases, the rest of the config (models, n, input/output files) remains the same but the flags need to be modified as following:
Only generating LLM prompted criteria
ea: false
llm: true
search: false
score: false
n: 10
ea: true
llm: false
search: false
score: false
ea: true
llm: false
search: true
score: false
EvalAgent includes a Flask-based UI to visualize criteria generated via search.
cd data/
python app.py --data ../data/sample_data_criteria_search.jsonl
Here is a sample visualization of the criteria generated for sample.jsonl
sample_visualization.mp4.mov
We have uploaded the data for different datasets to huggingface! wadhma/evalagent
You can find the arxiv paper here.
@InProceedings{,
title = {EvalAgent: Discovering Implicit Evaluation Criteria from the Web},
author = {Manya Wadhwa and Zayne Sprague and Chaitanya Malaviya and Philippe Laban and Junyi Jessy Li and Greg Durrett},
booktitle = {arXiv},
year = {2025},
url={https://arxiv.org/abs/2504.15219},
}
