Published February 10, 2026 | Version v1
Dataset Open

WildChat-GIO: Filtered English/German Prompt Pool for the GIO Pilot Annotation Study

  • 1. ROR icon Rheinland-Pfälzische Technische Universität Kaiserslautern-Landau

Description

This dataset contains a filtered subset of the WildChat-1M dataset (Zhao et al., 2024), prepared for the pilot annotation study of the Generative Intent Operationalization (GIO) framework.

Filtering pipeline (AP1):
Starting from 1,048,576 conversations in WildChat-1M, the following filters were applied to first-turn user prompts:

  • Language: English and German only
  • Minimum word count: ≥ 5 words
  • Code removal: prompts dominated by code blocks excluded
  • Deduplication: near-duplicate prompts removed

Result: 230,289 filtered prompts (226,042 English / 4,247 German)

File format: CSV with columns: conversation_idpromptlanguageword_countblock

Usage: This pool serves as input for AP2 (stratified sampling), where 55 prompts are selected for expert annotation of GIO modes and grounding necessity variables. The full experiment pipeline is available at: https://github.com/kaispriestersbach/gio-pilot-study

Source dataset: Zhao, W. X. et al. (2024). WildChat-1M. Licensed under ODC-BY. https://huggingface.co/datasets/allenai/WildChat-1M

Files

filtered_pool.csv

Files (62.1 MB)

Name Size Download all
md5:56e477ea9e531f3df67121d0cacc39b2
62.1 MB Preview Download

Additional details

Dates

Submitted
2026-02-10
Initial Upload

Software

Repository URL
https://github.com/kaispriestersbach/gio-pilot-study
Programming language
Python
Development Status
Active