GIO Pilot Study -- Experiment Code

Replication code and materials for the pilot annotation study of the Generative Intent Operationalization (GIO) framework, as described in:

Spriestersbach, K. & Vollmer, S. (2025). From Search Intent to Retrieval Demand: A Pre-Generation Framework for Generative Engine Optimization (GEO) — Proposing the Generative Intent Operationalization (GIO). RPTU Kaiserslautern-Landau.

The pilot study tests whether the eight GIO modes and the four Grounding Necessity (GN) variables can be reliably annotated by human raters on real-world LLM prompts drawn from the WildChat-1M dataset.

The current recommended annotation frontend is the static web app in annotation_app/. It supports an Expert mode for the two-rater pilot workflow and a simplified Participant mode for Prolific-style runs.

Overview

Work Package	Script	Description
AP1	`ap1_filter_wildchat.py`	Download and filter WildChat-1M (EN/DE, 5-150 words, no code)
AP2	`ap2_stratified_sampling.py`	Stratified sampling: pre-tag, generate candidate lists, validate
AP3	`ap3_keyword_baseline.py`	Keyword + NER baseline for retrieval prediction
AP4	`ap4_create_annotation.py`	Generate the legacy Excel annotation spreadsheet
AP5	`ap5_evaluate.py`	Compute Cohen's kappa, Bootstrap CI, F1, disagreement analysis from JSON exports

Study Design

Exploratory pilot: codebook sharpening, feasibility check, and first reliability estimate
50 study prompts + 5 calibration prompts drawn from a fixed WildChat-1M sample
2 expert raters annotate independently; reliability is reported with Cohen's kappa
Actual pilot sample: 21 low_gn, 16 high_gn, 13 edge (4 parametric_trap, 5 implicit_demand, 4 creative_volatile)
Current pilot sample language: all 55 prompts are English, although the filtered source pool is EN/DE
Annotation dimensions: GIO mode, I_gap, T_decay, E_spec, V_volatility, GN level, retrieval judgment, confidence, notes
Pilot operationalization: I_gap is collected as Low/Medium/High; for H1-style analyses, High is recoded as gate-open and Low|Medium as gate-closed
Known gap: Mode 3.1 (Transactional) is systematically absent from WildChat — an exhaustive search of 230k prompts yielded only 1 genuine transactional prompt in candidate review; the final pilot study set contains only 1 transactional prompt, so claims about Mode 3.1 should remain cautious (see Sampling Documentation)

Annotation Frontends

annotation_app/: current static web app used for annotation, calibration compare, backup/import, JSON export, and optional Prolific submission
output/annotation_spreadsheet.xlsx: legacy spreadsheet generated by AP4; still useful for archival comparison and some manual checks, but no longer the primary workflow

For the current app behavior and workflow, see docs/annotation_app_current.md.

Quick Start

Prerequisites

Docker (with Docker Compose)
A HuggingFace account with an access token (the WildChat-1M dataset requires authentication)

1. Build the Docker image

make build

2. Set your HuggingFace token

export HF_TOKEN=hf_your_token_here

3. Run the full pipeline

# AP1: Filter WildChat-1M (~10 min, downloads ~3 GB)
make ap1

# AP2: Generate candidate lists for manual sampling
make ap2-tag
make ap2-template

# --- Manual step: select 55 prompts into data/sampled_prompts.csv ---

# AP2: Validate selection
make ap2-validate

# AP3: Compute keyword baseline
make ap3

# AP2: Export sampling documentation
make ap2-export

# Optional: create the legacy Excel spreadsheet
make ap4

# --- Manual step: expert raters annotate in Experiment/annotation_app/ ---
# Export the final files as:
#   output/annotations_rater_a.json
#   output/annotations_rater_b.json

# AP5: Evaluate inter-rater agreement
make ap5

Pipeline Test (automated, no manual steps)

To run the entire pipeline with automatically generated samples and simulated annotations (for testing/verification):

make build
export HF_TOKEN=hf_your_token_here
make ap1
make ap2-tag
make ap2-auto-sample
make ap3
make ap4
make ap5-simulate
make ap5

Project Structure

Experiment/
|-- config.py                  # Central configuration (GIO definitions, paths, keywords)
|-- Dockerfile                 # Python 3.11 + spaCy + DuckDB + pyarrow
|-- docker-compose.yml         # Container orchestration with volume mounts
|-- Makefile                   # All make targets (run `make help`)
|-- requirements.txt           # Python dependencies
|-- README.md                  # This file
|-- LICENSE                    # MIT License
|-- CITATION.cff               # Machine-readable citation metadata
|-- annotation_app/            # Static annotation web app (expert + participant modes)
|
|-- scripts/
|   |-- ap1_filter_wildchat.py       # WildChat download + filtering
|   |-- ap2_stratified_sampling.py   # Stratified sampling helper
|   |-- ap2_auto_sample.py          # Automated sampling (pipeline test)
|   |-- ap3_keyword_baseline.py      # Keyword + NER baseline
|   |-- ap4_create_annotation.py     # Legacy Excel spreadsheet generator
|   |-- ap5_evaluate.py              # Evaluation (expects JSON exports in output/)
|   |-- ap5_simulate_annotations.py  # Simulated JSON exports (+ optional legacy XLSX)
|
|-- data/                      # Study data (included in repository + Zenodo)
|   |-- sampled_prompts.csv         # 55 selected prompts (study + calibration)
|   |-- baseline_predictions.csv    # Keyword baseline retrieval predictions
|   |-- evaluation_results.csv      # Inter-rater agreement metrics
|   |-- disagreements.csv           # Rater disagreement details
|   |-- candidates/                 # Candidate lists per sampling block
|   |-- filtered_pool.csv           # ~230k filtered prompts (Zenodo only, 59 MB)
|   |-- raw/                        # AP1 shard checkpoints (not published)
|   |-- hf_cache/                   # HuggingFace download cache (not published)
|
|-- output/
|   |-- annotation_spreadsheet.xlsx  # Legacy annotation workbook (optional)
|   |-- annotations_rater_a.json     # Final export from web app / simulation
|   |-- annotations_rater_b.json     # Final export from web app / simulation
|
|-- docs/
|   |-- sampling_documentation.md    # Sampling methodology and decisions
|   |-- annotation_app_current.md    # Current app workflow and export reference

Web App Workflow

The current expert workflow is:

Open annotation_app/index.html without URL parameters.
Choose Rater A or Rater B.
Complete the 5 calibration prompts.
Click Download calibration JSON.
Use Open calibration compare and load both calibration files.
Continue via Continue as Rater A / Continue as Rater B.
Complete the 50 study prompts.
Export the final files as annotations_rater_a.json and annotations_rater_b.json.
Place both files in output/ and run make ap5.

The participant / Prolific workflow is documented in annotation_app/DEPLOY.md.

Data Availability

Included in this repository

The following data files are included directly:

File	Description	Size
`data/sampled_prompts.csv`	55 selected prompts (50 study + 5 calibration)	24 KB
`data/baseline_predictions.csv`	Keyword baseline retrieval predictions	~10 KB
`data/evaluation_results.csv`	Inter-rater agreement metrics (latest local run; may be provisional)	~1 KB
`data/disagreements.csv`	Detailed rater disagreement analysis (latest local run; may be provisional)	~15 KB
`data/candidates/*.csv`	Candidate lists per sampling block (5 files, 100 each)	~200 KB
`output/annotation_spreadsheet.xlsx`	Legacy annotation workbook with dropdowns	48 KB
`docs/sampling_documentation.md`	Sampling methodology documentation	8 KB

Available on Zenodo

The full filtered prompt pool is published separately on Zenodo due to its size:

File	Description	Size
`filtered_pool.csv`	230,289 filtered WildChat prompts (EN/DE)	59 MB

Download from: https://zenodo.org/records/18593414

To use the Zenodo data, download filtered_pool.csv and place it in the data/ directory. Alternatively, regenerate it with make ap1.

Source dataset

This study uses the WildChat-1M dataset (Zhao et al., 2024), which contains 1 million real-world conversations with ChatGPT collected via the Hugging Face ChatGPT deployment.

License: The dataset is released under the ODC-BY license (changed from AI2 ImpACT on 2024-06-26, retroactively applied).
Access: Requires a HuggingFace account and acceptance of the dataset terms.
Download: AP1 downloads Parquet files via huggingface_hub (CDN/Git-LFS) and filters locally. Only English and German conversations are retained.
Statistics: Of 1M conversations, 495,363 are EN/DE. After filtering (word count, code removal, deduplication), 230,289 prompts remain (226,042 EN / 4,247 DE).

GIO Framework Reference

The eight GIO modes (from the paper):

Mode	Name	Category	GN Level
1.1	Fact Retrieval	ASKING	Low
1.2	Real-Time Synthesis	ASKING	High
1.3	Advisory	ASKING	High
2.1	Utility	DOING	None
2.2	Ungrounded Generation	DOING	Low
2.3	Grounded Generation	DOING	N/A
3.1	Transactional	ACTING	High
3.2	Open-Ended Investigation	ACTING	High

The four Grounding Necessity (GN) variables:

Variable	Description	Anchors
I_gap	Information demand density; theory treats it as binary gate, pilot collects it on 3 levels	Low: poem / Medium: explain concept / High: clinical trial data
T_decay	Temporal distance from training cutoff	Low: historical / Medium: recent / High: post-cutoff
E_spec	Entity specificity	Low: abstract / Medium: category / High: named entity
V_volatility	Answer change frequency	Low: physical constant / Medium: census / High: stock price

Evaluation Metrics

AP5 computes the following metrics:

Cohen's kappa (binary): retrieval judgment (Yes/No)
Cohen's kappa (nominal): GIO mode (8 categories)
Cohen's kappa (linear-weighted ordinal): GN level (None/Grounding from Context/Low/Medium/High)
Bootstrap 95% CI: 1000 iterations for all kappa values
Per-field percent agreement: i_gap, t_decay, e_spec, v_volatility
Exploratory baseline comparison: F1, Precision, Recall of keyword baseline vs. agreement-case expert labels
Disagreement analysis: per-field, per-prompt, and by edge-case subtype

This pilot is descriptive rather than confirmatory: it does not use McNemar's tests or make strong significance claims from n=50.

Reproducibility

All generated data files can be reproduced from scratch:

make clean   # Remove all generated files
make build   # Rebuild Docker image
make ap1     # Re-download and filter WildChat-1M

The pipeline is fully deterministic (random seed = 42) except for:

The WildChat-1M dataset itself (immutable on HuggingFace)
The manual prompt selection step (AP2)
The expert annotations (AP5 input)

System Requirements

Docker with at least 4 GB RAM allocated
~3 GB disk space for HuggingFace cache
~500 MB for filtered data
Internet connection for initial WildChat download

License

This code is released under the MIT License.

The WildChat-1M dataset is subject to the Open Data Commons Attribution License (ODC-BY).

Citation

If you use this code or data, please cite both the paper and the dataset:

@article{spriestersbach2025gio,
  title       = {From Search Intent to Retrieval Demand: A Pre-Generation
                 Framework for Generative Engine Optimization ({GEO}) --
                 Proposing the Generative Intent Operationalization ({GIO})},
  author      = {Spriestersbach, Kai and Vollmer, Sebastian},
  year        = {2025},
  institution = {RPTU Kaiserslautern-Landau},
  note        = {Department of Computer Science}
}

@dataset{spriestersbach2025gio_data,
  title     = {WildChat-GIO: Filtered English/German Prompt Pool
               for the GIO Pilot Annotation Study},
  author    = {Spriestersbach, Kai and Vollmer, Sebastian},
  year      = {2025},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.18593413}
}

Please also cite the WildChat dataset:

@article{zhao2024wildchat,
  title   = {WildChat: 1M ChatGPT Interaction Logs in the Wild},
  author  = {Zhao, Wenting and others},
  year    = {2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GIO Pilot Study -- Experiment Code

Overview

Study Design

Annotation Frontends

Quick Start

Prerequisites

1. Build the Docker image

2. Set your HuggingFace token

3. Run the full pipeline

Pipeline Test (automated, no manual steps)

Project Structure

Web App Workflow

Data Availability

Included in this repository

Available on Zenodo

Source dataset

GIO Framework Reference

Evaluation Metrics

Reproducibility

System Requirements

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Prolific		Prolific
annotation_app		annotation_app
data		data
docs		docs
output		output
scripts		scripts
.gitignore		.gitignore
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
config.py		config.py
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

GIO Pilot Study -- Experiment Code

Overview

Study Design

Annotation Frontends

Quick Start

Prerequisites

1. Build the Docker image

2. Set your HuggingFace token

3. Run the full pipeline

Pipeline Test (automated, no manual steps)

Project Structure

Web App Workflow

Data Availability

Included in this repository

Available on Zenodo

Source dataset

GIO Framework Reference

Evaluation Metrics

Reproducibility

System Requirements

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages