This is code repository for AgenticShop: Benchmarking Agentic Product Curation for Personalized Web Shopping
📄 Paper Link: AgenticShop
🎉 Our paper has been accepted Accepted at The Web Conference 2026
AgenticShop is a benchmark for evaluating how well agentic systems curate personalized products in open-web shopping environments. It captures realistic shopping intents, diverse user profiles, and fine-grained preferences, and introduces a checklist-driven evaluation framework grounded in verifiable product evidence to measure true personalization beyond simple product search.
The workflow consists of two main phases:
Benchmark Construction: Generate user contexts, queries, and evaluation checklists that form the foundation of the benchmark dataset.
# Step 1: Generate diverse user contexts with shopping preferences and behaviors
python src/benchmark_construction/1_gen_user_context.py --domain clothing --samples 1
# Step 2: Create realistic user queries based on the generated contexts
python src/benchmark_construction/2_gen_user_query.py --domain clothing --samples 1
# Step 3: Build evaluation checklists tailored to each user's preferences
python src/benchmark_construction/3_gen_user_checklist.py --domain clothing --samples 1Benchmark Evaluation: Run the evaluation pipeline to test different models and approaches against the constructed benchmark, measuring their performance in product curation tasks.
# Run complete evaluation pipeline with example parameters:
# --model-type: Type of model (search_llms or web_agents)
# --model-name: Specific model name (gpt, claude, etc.)
# --category: Product category (clothing, electronics, home, etc.)
# --num-users: Number of users to evaluate
python src/benchmark_evaluation/run_pipeline.py \
--model-type search_llms \
--model-name gpt \
--category clothing \
--num-users 1conda create -n agenticshop python=3.10.13
conda activate agenticshoppip install -r requirements.txtCopy the environment template and add your API keys:
cp env.example .env.local
# Edit .env.local and add your OpenAI API keysrc/benchmark_construction/- Scripts for generating benchmark datasrc/benchmark_evaluation/- Evaluation framework and moduleseval_inputs/- Input data for evaluationeval_results/- Evaluation outputs and results
Sample user profile data is available in data/user_profiles/ to help you get started with the benchmark. This includes:
- Pre-generated user contexts and queries
- Sample evaluation checklists
