AgenticShop: Benchmarking Agentic Product Curation for Personalized Web Shopping

This is code repository for AgenticShop: Benchmarking Agentic Product Curation for Personalized Web Shopping

🎉 Our paper has been accepted Accepted at The Web Conference 2026

Introduction

AgenticShop is a benchmark for evaluating how well agentic systems curate personalized products in open-web shopping environments. It captures realistic shopping intents, diverse user profiles, and fine-grained preferences, and introduces a checklist-driven evaluation framework grounded in verifiable product evidence to measure true personalization beyond simple product search.

Pipeline

The workflow consists of two main phases:

Benchmark Construction: Generate user contexts, queries, and evaluation checklists that form the foundation of the benchmark dataset.

# Step 1: Generate diverse user contexts with shopping preferences and behaviors
python src/benchmark_construction/1_gen_user_context.py --domain clothing --samples 1

# Step 2: Create realistic user queries based on the generated contexts
python src/benchmark_construction/2_gen_user_query.py --domain clothing --samples 1

# Step 3: Build evaluation checklists tailored to each user's preferences
python src/benchmark_construction/3_gen_user_checklist.py --domain clothing --samples 1

Benchmark Evaluation: Run the evaluation pipeline to test different models and approaches against the constructed benchmark, measuring their performance in product curation tasks.

# Run complete evaluation pipeline with example parameters:
# --model-type: Type of model (search_llms or web_agents)
# --model-name: Specific model name (gpt, claude, etc.)
# --category: Product category (clothing, electronics, home, etc.)
# --num-users: Number of users to evaluate
python src/benchmark_evaluation/run_pipeline.py \
  --model-type search_llms \
  --model-name gpt \
  --category clothing \
  --num-users 1

Setup

1. Create Python Virtual Environment

conda create -n agenticshop python=3.10.13
conda activate agenticshop

2. Install Dependencies

pip install -r requirements.txt

3. Environment Configuration

Copy the environment template and add your API keys:

cp env.example .env.local
# Edit .env.local and add your OpenAI API key

4. Project Structure

src/benchmark_construction/ - Scripts for generating benchmark data
src/benchmark_evaluation/ - Evaluation framework and modules
eval_inputs/ - Input data for evaluation
eval_results/ - Evaluation outputs and results

Data

Sample user profile data is available in data/user_profiles/ to help you get started with the benchmark. This includes:

Pre-generated user contexts and queries
Sample evaluation checklists

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
data/user_profiles		data/user_profiles
src		src
.env.local.example		.env.local.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgenticShop: Benchmarking Agentic Product Curation for Personalized Web Shopping

Introduction

Pipeline

Setup

1. Create Python Virtual Environment

2. Install Dependencies

3. Environment Configuration

4. Project Structure

Data

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AgenticShop: Benchmarking Agentic Product Curation for Personalized Web Shopping

Introduction

Pipeline

Setup

1. Create Python Virtual Environment

2. Install Dependencies

3. Environment Configuration

4. Project Structure

Data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages