Skip to content

saidul-islam98/DataNarrative

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

DataNarrative 📊📖

Welcome to DataNarrative, a dataset and benchmark for automated data-driven storytelling using visualizations and text. This repository contains structured datasets from diverse sources including Tableau, Pew, and GapMinder for training and evaluating LLMs and VLMs in visual-textual narrative generation.

📁 Repository Structure

├── Train
│   ├── Pew
│   │   ├── pew_train_images_final/
│   │   └── pew_train.zip
│   ├── Tableau
│   │   ├── tab_001/
│   │   ├── tab_005/
│   │   ├── ... (more tab folders)
|   |   └── tableau_train.json
├── Test
│   ├── GapMinder
│   │   ├── gap_001/
│   │   ├── gap_002/
│   │   └── ... (more gap folders)
│   │   └── gapminder_test.json
│   ├── Pew
│   │   ├── multiColumn/
│   │   ├── singleColumn/
│   │   └── pew_test.json
│   ├── Tableau
│   │   ├── tab_002/
│   │   ├── tab_003/
│   │   ├── ... (more tab folders)
|   |   └── tableau_test.json

Dataset Overview

1. Train Set

  • Pew:

    • Contains images in pew_train_images_final/.
    • A zipped version of the metadata is available as pew_train.zip.
  • Tableau:

    • Folders named tab_001, tab_005, ..., each includes:
      • Chart images
      • Tableau workbooks
      • Associated datasets
      • Metadata linked via tableau_train.json

2. Test Set

  • GapMinder:

    • Contains folders like gap_001, gap_002, ..., with charts and data.
    • Metadata is provided in gapminder_test.json.
  • Pew:

    • multiColumn/ and singleColumn/: Contain test charts.
    • Metadata and narrative intents are stored in pew_test.json.
  • Tableau:

    • Folders like tab_002, tab_003, ..., similar in structure to the training Tableau set.
    • Metadata is available in tableau_test.json.

File Access and Metadata 🔑

Test Sets

  • Metadata: The .json files (e.g., pew_test.json, gapminder_train.json) contain comprehensive metadata for each data point. Key fields include:
    • topic_name: The general topic of the article/data (e.g., "Politics & Policy").
    • topic_link: URL to the topic page on the source website.
    • intent: The title or main subject of the specific article or visualization.
    • article_link: Direct URL to the source article.
    • paragraph_table_pair: An array linking narrative paragraphs (paragraph) to associated data visualizations. Each entry contains:
      • table_id or table_path: Identifier for the source data file.
      • chart_image: Path to the corresponding chart image (e.g., multiColumn/imgs/128.png).
      • table: Raw text data associated with the chart.
      • chart_type: Type of the visualization (e.g., "bar", "line").
      • title: Title of the chart.
      • vis_spec: Visualization specifications (often in a structured format like JSON within the string).
      • gem_table, gpt_table: Processed or alternative table formats generated by GPT-4o and Gemini-1.5-Pro (present in Pew data).
  • Chart Images: Located in the imgs/ subdirectories within pew/ or gap/. These are typically .png files named according to their table_id.
  • Data Files: Located in the data/ subdirectories within pew/ or gap/. These are usually .txt or .csv files containing the raw data used for the charts.

Train Sets

  • Metadata: The .json files (e.g., tableau_train.json) contain metadata for training stories. Key fields include:
    • tab_id: Unique identifier for the Tableau story (e.g., "tab_001").
    • topic_name: The general topic (e.g., "Environment", "Economy").
    • story_link: A link to a Google Drive folder containing associated files.
    • intent: The title of the Tableau story.
    • paragraph_table_pair: Links narrative paragraphs (paragraph) to visualizations. Includes:
      • table_path: Path to the data file.
      • chart_image: Path to the chart image.
      • chart_type: Type of visualization.
      • vis_spec: Visualization specifications.
  • Files: Located within the tab/ subdirectory:
    • imgs/: Contains chart images (.png).
    • workbooks/: Contains Tableau workbook files (.twbx).
    • data/: Contains the source data files (often .csv).

How to Access Files

  • Each tab_* folder contains:

    • .png chart images.
    • .csv or .twbx Tableau workbook files.
    • Metadata that links the narrative to visualizations.
  • JSON Files (tableau_test.json, gapminder_test.json, pew_test.json, pew_train.json) contain relevant mappings:

    • Topic categories
    • Chart-specific narrative intents
    • Data paths for visualization and narrative context

Example Usage

You can find an example implementation of the multi-step LLM-Agent framework with GPT-Agent in the following notebook: Sample LLM-Agentic framework implementation.

For a detailed description of the framework, please refer to the paper.

📬 Contact

For any queries, feel free to reach out at saidulis@yorku.ca ✉️

Citation 📜

If you use this dataset in your research, please cite the following paper:

@inproceedings{islam-etal-2024-datanarrative,
    title = "{D}ata{N}arrative: Automated Data-Driven Storytelling with Visualizations and Texts",
    author = "Islam, Mohammed Saidul  and
      Laskar, Md Tahmid Rahman  and
      Parvez, Md Rizwan  and
      Hoque, Enamul  and
      Joty, Shafiq",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "[https://aclanthology.org/2024.emnlp-main.1073/](https://aclanthology.org/2024.emnlp-main.1073/)",
    doi = "10.18653/v1/2024.emnlp-main.1073",
    pages = "19253--19286",
}

About

This is the repository for our work titled 'DataNarrative: Automated Data-Driven Storytelling with Visualizations and Texts'

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors