Welcome to DataNarrative, a dataset and benchmark for automated data-driven storytelling using visualizations and text. This repository contains structured datasets from diverse sources including Tableau, Pew, and GapMinder for training and evaluating LLMs and VLMs in visual-textual narrative generation.
├── Train
│ ├── Pew
│ │ ├── pew_train_images_final/
│ │ └── pew_train.zip
│ ├── Tableau
│ │ ├── tab_001/
│ │ ├── tab_005/
│ │ ├── ... (more tab folders)
| | └── tableau_train.json
├── Test
│ ├── GapMinder
│ │ ├── gap_001/
│ │ ├── gap_002/
│ │ └── ... (more gap folders)
│ │ └── gapminder_test.json
│ ├── Pew
│ │ ├── multiColumn/
│ │ ├── singleColumn/
│ │ └── pew_test.json
│ ├── Tableau
│ │ ├── tab_002/
│ │ ├── tab_003/
│ │ ├── ... (more tab folders)
| | └── tableau_test.json
-
Pew:
- Contains images in
pew_train_images_final/. - A zipped version of the metadata is available as
pew_train.zip.
- Contains images in
-
Tableau:
- Folders named
tab_001,tab_005, ..., each includes:- Chart images
- Tableau workbooks
- Associated datasets
- Metadata linked via
tableau_train.json
- Folders named
-
GapMinder:
- Contains folders like
gap_001,gap_002, ..., with charts and data. - Metadata is provided in
gapminder_test.json.
- Contains folders like
-
Pew:
- multiColumn/ and singleColumn/: Contain test charts.
- Metadata and narrative intents are stored in
pew_test.json.
-
Tableau:
- Folders like
tab_002,tab_003, ..., similar in structure to the training Tableau set. - Metadata is available in
tableau_test.json.
- Folders like
- Metadata: The
.jsonfiles (e.g.,pew_test.json,gapminder_train.json) contain comprehensive metadata for each data point. Key fields include:topic_name: The general topic of the article/data (e.g., "Politics & Policy").topic_link: URL to the topic page on the source website.intent: The title or main subject of the specific article or visualization.article_link: Direct URL to the source article.paragraph_table_pair: An array linking narrative paragraphs (paragraph) to associated data visualizations. Each entry contains:table_idortable_path: Identifier for the source data file.chart_image: Path to the corresponding chart image (e.g.,multiColumn/imgs/128.png).table: Raw text data associated with the chart.chart_type: Type of the visualization (e.g., "bar", "line").title: Title of the chart.vis_spec: Visualization specifications (often in a structured format like JSON within the string).gem_table,gpt_table: Processed or alternative table formats generated by GPT-4o and Gemini-1.5-Pro (present in Pew data).
- Chart Images: Located in the
imgs/subdirectories withinpew/orgap/. These are typically.pngfiles named according to theirtable_id. - Data Files: Located in the
data/subdirectories withinpew/orgap/. These are usually.txtor.csvfiles containing the raw data used for the charts.
- Metadata: The
.jsonfiles (e.g.,tableau_train.json) contain metadata for training stories. Key fields include:tab_id: Unique identifier for the Tableau story (e.g., "tab_001").topic_name: The general topic (e.g., "Environment", "Economy").story_link: A link to a Google Drive folder containing associated files.intent: The title of the Tableau story.paragraph_table_pair: Links narrative paragraphs (paragraph) to visualizations. Includes:table_path: Path to the data file.chart_image: Path to the chart image.chart_type: Type of visualization.vis_spec: Visualization specifications.
- Files: Located within the
tab/subdirectory:imgs/: Contains chart images (.png).workbooks/: Contains Tableau workbook files (.twbx).data/: Contains the source data files (often.csv).
-
Each
tab_*folder contains:.pngchart images..csvor.twbxTableau workbook files.- Metadata that links the narrative to visualizations.
-
JSON Files (
tableau_test.json,gapminder_test.json,pew_test.json,pew_train.json) contain relevant mappings:- Topic categories
- Chart-specific narrative intents
- Data paths for visualization and narrative context
You can find an example implementation of the multi-step LLM-Agent framework with GPT-Agent in the following notebook: Sample LLM-Agentic framework implementation.
For a detailed description of the framework, please refer to the paper.
For any queries, feel free to reach out at saidulis@yorku.ca ✉️
If you use this dataset in your research, please cite the following paper:
@inproceedings{islam-etal-2024-datanarrative,
title = "{D}ata{N}arrative: Automated Data-Driven Storytelling with Visualizations and Texts",
author = "Islam, Mohammed Saidul and
Laskar, Md Tahmid Rahman and
Parvez, Md Rizwan and
Hoque, Enamul and
Joty, Shafiq",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "[https://aclanthology.org/2024.emnlp-main.1073/](https://aclanthology.org/2024.emnlp-main.1073/)",
doi = "10.18653/v1/2024.emnlp-main.1073",
pages = "19253--19286",
}