Skip to content

kaispriestersbach/gio-pilot-study

Repository files navigation

GIO Pilot Study -- Experiment Code

Replication code and materials for the pilot annotation study of the Generative Intent Operationalization (GIO) framework, as described in:

Spriestersbach, K. & Vollmer, S. (2025). From Search Intent to Retrieval Demand: A Pre-Generation Framework for Generative Engine Optimization (GEO) — Proposing the Generative Intent Operationalization (GIO). RPTU Kaiserslautern-Landau.

The pilot study tests whether the eight GIO modes and the four Grounding Necessity (GN) variables can be reliably annotated by human raters on real-world LLM prompts drawn from the WildChat-1M dataset.

The current recommended annotation frontend is the static web app in annotation_app/. It supports an Expert mode for the two-rater pilot workflow and a simplified Participant mode for Prolific-style runs.


Overview

Work Package Script Description
AP1 ap1_filter_wildchat.py Download and filter WildChat-1M (EN/DE, 5-150 words, no code)
AP2 ap2_stratified_sampling.py Stratified sampling: pre-tag, generate candidate lists, validate
AP3 ap3_keyword_baseline.py Keyword + NER baseline for retrieval prediction
AP4 ap4_create_annotation.py Generate the legacy Excel annotation spreadsheet
AP5 ap5_evaluate.py Compute Cohen's kappa, Bootstrap CI, F1, disagreement analysis from JSON exports

Study Design

  • Exploratory pilot: codebook sharpening, feasibility check, and first reliability estimate
  • 50 study prompts + 5 calibration prompts drawn from a fixed WildChat-1M sample
  • 2 expert raters annotate independently; reliability is reported with Cohen's kappa
  • Actual pilot sample: 21 low_gn, 16 high_gn, 13 edge (4 parametric_trap, 5 implicit_demand, 4 creative_volatile)
  • Current pilot sample language: all 55 prompts are English, although the filtered source pool is EN/DE
  • Annotation dimensions: GIO mode, I_gap, T_decay, E_spec, V_volatility, GN level, retrieval judgment, confidence, notes
  • Pilot operationalization: I_gap is collected as Low/Medium/High; for H1-style analyses, High is recoded as gate-open and Low|Medium as gate-closed
  • Known gap: Mode 3.1 (Transactional) is systematically absent from WildChat — an exhaustive search of 230k prompts yielded only 1 genuine transactional prompt in candidate review; the final pilot study set contains only 1 transactional prompt, so claims about Mode 3.1 should remain cautious (see Sampling Documentation)

Annotation Frontends

  • annotation_app/: current static web app used for annotation, calibration compare, backup/import, JSON export, and optional Prolific submission
  • output/annotation_spreadsheet.xlsx: legacy spreadsheet generated by AP4; still useful for archival comparison and some manual checks, but no longer the primary workflow

For the current app behavior and workflow, see docs/annotation_app_current.md.


Quick Start

Prerequisites

  • Docker (with Docker Compose)
  • A HuggingFace account with an access token (the WildChat-1M dataset requires authentication)

1. Build the Docker image

make build

2. Set your HuggingFace token

export HF_TOKEN=hf_your_token_here

3. Run the full pipeline

# AP1: Filter WildChat-1M (~10 min, downloads ~3 GB)
make ap1

# AP2: Generate candidate lists for manual sampling
make ap2-tag
make ap2-template

# --- Manual step: select 55 prompts into data/sampled_prompts.csv ---

# AP2: Validate selection
make ap2-validate

# AP3: Compute keyword baseline
make ap3

# AP2: Export sampling documentation
make ap2-export

# Optional: create the legacy Excel spreadsheet
make ap4

# --- Manual step: expert raters annotate in Experiment/annotation_app/ ---
# Export the final files as:
#   output/annotations_rater_a.json
#   output/annotations_rater_b.json

# AP5: Evaluate inter-rater agreement
make ap5

Pipeline Test (automated, no manual steps)

To run the entire pipeline with automatically generated samples and simulated annotations (for testing/verification):

make build
export HF_TOKEN=hf_your_token_here
make ap1
make ap2-tag
make ap2-auto-sample
make ap3
make ap4
make ap5-simulate
make ap5

Project Structure

Experiment/
|-- config.py                  # Central configuration (GIO definitions, paths, keywords)
|-- Dockerfile                 # Python 3.11 + spaCy + DuckDB + pyarrow
|-- docker-compose.yml         # Container orchestration with volume mounts
|-- Makefile                   # All make targets (run `make help`)
|-- requirements.txt           # Python dependencies
|-- README.md                  # This file
|-- LICENSE                    # MIT License
|-- CITATION.cff               # Machine-readable citation metadata
|-- annotation_app/            # Static annotation web app (expert + participant modes)
|
|-- scripts/
|   |-- ap1_filter_wildchat.py       # WildChat download + filtering
|   |-- ap2_stratified_sampling.py   # Stratified sampling helper
|   |-- ap2_auto_sample.py          # Automated sampling (pipeline test)
|   |-- ap3_keyword_baseline.py      # Keyword + NER baseline
|   |-- ap4_create_annotation.py     # Legacy Excel spreadsheet generator
|   |-- ap5_evaluate.py              # Evaluation (expects JSON exports in output/)
|   |-- ap5_simulate_annotations.py  # Simulated JSON exports (+ optional legacy XLSX)
|
|-- data/                      # Study data (included in repository + Zenodo)
|   |-- sampled_prompts.csv         # 55 selected prompts (study + calibration)
|   |-- baseline_predictions.csv    # Keyword baseline retrieval predictions
|   |-- evaluation_results.csv      # Inter-rater agreement metrics
|   |-- disagreements.csv           # Rater disagreement details
|   |-- candidates/                 # Candidate lists per sampling block
|   |-- filtered_pool.csv           # ~230k filtered prompts (Zenodo only, 59 MB)
|   |-- raw/                        # AP1 shard checkpoints (not published)
|   |-- hf_cache/                   # HuggingFace download cache (not published)
|
|-- output/
|   |-- annotation_spreadsheet.xlsx  # Legacy annotation workbook (optional)
|   |-- annotations_rater_a.json     # Final export from web app / simulation
|   |-- annotations_rater_b.json     # Final export from web app / simulation
|
|-- docs/
|   |-- sampling_documentation.md    # Sampling methodology and decisions
|   |-- annotation_app_current.md    # Current app workflow and export reference

Web App Workflow

The current expert workflow is:

  1. Open annotation_app/index.html without URL parameters.
  2. Choose Rater A or Rater B.
  3. Complete the 5 calibration prompts.
  4. Click Download calibration JSON.
  5. Use Open calibration compare and load both calibration files.
  6. Continue via Continue as Rater A / Continue as Rater B.
  7. Complete the 50 study prompts.
  8. Export the final files as annotations_rater_a.json and annotations_rater_b.json.
  9. Place both files in output/ and run make ap5.

The participant / Prolific workflow is documented in annotation_app/DEPLOY.md.


Data Availability

Included in this repository

The following data files are included directly:

File Description Size
data/sampled_prompts.csv 55 selected prompts (50 study + 5 calibration) 24 KB
data/baseline_predictions.csv Keyword baseline retrieval predictions ~10 KB
data/evaluation_results.csv Inter-rater agreement metrics (latest local run; may be provisional) ~1 KB
data/disagreements.csv Detailed rater disagreement analysis (latest local run; may be provisional) ~15 KB
data/candidates/*.csv Candidate lists per sampling block (5 files, 100 each) ~200 KB
output/annotation_spreadsheet.xlsx Legacy annotation workbook with dropdowns 48 KB
docs/sampling_documentation.md Sampling methodology documentation 8 KB

Available on Zenodo

The full filtered prompt pool is published separately on Zenodo due to its size:

File Description Size
filtered_pool.csv 230,289 filtered WildChat prompts (EN/DE) 59 MB

DOI

Download from: https://zenodo.org/records/18593414

To use the Zenodo data, download filtered_pool.csv and place it in the data/ directory. Alternatively, regenerate it with make ap1.

Source dataset

This study uses the WildChat-1M dataset (Zhao et al., 2024), which contains 1 million real-world conversations with ChatGPT collected via the Hugging Face ChatGPT deployment.

  • License: The dataset is released under the ODC-BY license (changed from AI2 ImpACT on 2024-06-26, retroactively applied).
  • Access: Requires a HuggingFace account and acceptance of the dataset terms.
  • Download: AP1 downloads Parquet files via huggingface_hub (CDN/Git-LFS) and filters locally. Only English and German conversations are retained.
  • Statistics: Of 1M conversations, 495,363 are EN/DE. After filtering (word count, code removal, deduplication), 230,289 prompts remain (226,042 EN / 4,247 DE).

GIO Framework Reference

The eight GIO modes (from the paper):

Mode Name Category GN Level
1.1 Fact Retrieval ASKING Low
1.2 Real-Time Synthesis ASKING High
1.3 Advisory ASKING High
2.1 Utility DOING None
2.2 Ungrounded Generation DOING Low
2.3 Grounded Generation DOING N/A
3.1 Transactional ACTING High
3.2 Open-Ended Investigation ACTING High

The four Grounding Necessity (GN) variables:

Variable Description Anchors
I_gap Information demand density; theory treats it as binary gate, pilot collects it on 3 levels Low: poem / Medium: explain concept / High: clinical trial data
T_decay Temporal distance from training cutoff Low: historical / Medium: recent / High: post-cutoff
E_spec Entity specificity Low: abstract / Medium: category / High: named entity
V_volatility Answer change frequency Low: physical constant / Medium: census / High: stock price

Evaluation Metrics

AP5 computes the following metrics:

  • Cohen's kappa (binary): retrieval judgment (Yes/No)
  • Cohen's kappa (nominal): GIO mode (8 categories)
  • Cohen's kappa (linear-weighted ordinal): GN level (None/Grounding from Context/Low/Medium/High)
  • Bootstrap 95% CI: 1000 iterations for all kappa values
  • Per-field percent agreement: i_gap, t_decay, e_spec, v_volatility
  • Exploratory baseline comparison: F1, Precision, Recall of keyword baseline vs. agreement-case expert labels
  • Disagreement analysis: per-field, per-prompt, and by edge-case subtype

This pilot is descriptive rather than confirmatory: it does not use McNemar's tests or make strong significance claims from n=50.


Reproducibility

All generated data files can be reproduced from scratch:

make clean   # Remove all generated files
make build   # Rebuild Docker image
make ap1     # Re-download and filter WildChat-1M

The pipeline is fully deterministic (random seed = 42) except for:

  • The WildChat-1M dataset itself (immutable on HuggingFace)
  • The manual prompt selection step (AP2)
  • The expert annotations (AP5 input)

System Requirements

  • Docker with at least 4 GB RAM allocated
  • ~3 GB disk space for HuggingFace cache
  • ~500 MB for filtered data
  • Internet connection for initial WildChat download

License

This code is released under the MIT License.

The WildChat-1M dataset is subject to the Open Data Commons Attribution License (ODC-BY).


Citation

If you use this code or data, please cite both the paper and the dataset:

@article{spriestersbach2025gio,
  title       = {From Search Intent to Retrieval Demand: A Pre-Generation
                 Framework for Generative Engine Optimization ({GEO}) --
                 Proposing the Generative Intent Operationalization ({GIO})},
  author      = {Spriestersbach, Kai and Vollmer, Sebastian},
  year        = {2025},
  institution = {RPTU Kaiserslautern-Landau},
  note        = {Department of Computer Science}
}
@dataset{spriestersbach2025gio_data,
  title     = {WildChat-GIO: Filtered English/German Prompt Pool
               for the GIO Pilot Annotation Study},
  author    = {Spriestersbach, Kai and Vollmer, Sebastian},
  year      = {2025},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.18593413}
}

Please also cite the WildChat dataset:

@article{zhao2024wildchat,
  title   = {WildChat: 1M ChatGPT Interaction Logs in the Wild},
  author  = {Zhao, Wenting and others},
  year    = {2024}
}

About

Replication code and data for the GIO (Generative Intent Operationalization) pilot annotation study

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors