WildChat-GIO: Filtered English/German Prompt Pool for the GIO Pilot Annotation Study
Authors/Creators
Description
This dataset contains a filtered subset of the WildChat-1M dataset (Zhao et al., 2024), prepared for the pilot annotation study of the Generative Intent Operationalization (GIO) framework.
Filtering pipeline (AP1):
Starting from 1,048,576 conversations in WildChat-1M, the following filters were applied to first-turn user prompts:
- Language: English and German only
- Minimum word count: ≥ 5 words
- Code removal: prompts dominated by code blocks excluded
- Deduplication: near-duplicate prompts removed
Result: 230,289 filtered prompts (226,042 English / 4,247 German)
File format: CSV with columns: conversation_id, prompt, language, word_count, block
Usage: This pool serves as input for AP2 (stratified sampling), where 55 prompts are selected for expert annotation of GIO modes and grounding necessity variables. The full experiment pipeline is available at: https://github.com/kaispriestersbach/gio-pilot-study
Source dataset: Zhao, W. X. et al. (2024). WildChat-1M. Licensed under ODC-BY. https://huggingface.co/datasets/allenai/WildChat-1M
Files
filtered_pool.csv
Files
(62.1 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:56e477ea9e531f3df67121d0cacc39b2
|
62.1 MB | Preview Download |
Additional details
Dates
- Submitted
-
2026-02-10Initial Upload
Software
- Repository URL
- https://github.com/kaispriestersbach/gio-pilot-study
- Programming language
- Python
- Development Status
- Active