Leading large language models (LLMs) are trained on public data. However, the majority of the world’s data is dark data not pub- licly accessible, mainly in the form of private organizational data or enterprise data. We show that the performance of LLM-based methods seriously degrades when tested on real-world enterprise datasets. Current benchmarks, based on public data, overestimate the performance of LLMs. We release a benchmark of enterprise data, the Goby benchmark, to the scientific community to advance discovery in the area of enterprise data management. Based on our experience with this enterprise benchmark, we propose tech- niques to uplift the performance of LLMs on this more challenging data distribution: these are (1) hierarchical annotation (2) runtime class-learning and (3) ontology synthesis. We show that one these techniques are deployed, performance on enterprise data is on par with public data.
The GOBY Benchmark Dataset is designed to aid in evaluating data integration methods on structured enterprise data. This dataset includes categories, entities, and results derived from various data sources, represented in a unified schema. Key components include:
- Tags: Attributes within the unified schema.
- Results: Records sourced from wrappers, often web scrapers targeting specific sites.
- Entities: Records from original data sources, represented with "tags" that correspond to unified attributes.
- Wrappers: Data sources, typically web scrapers generating structured output.
The primary data archive, goby.tar.gz, contains the following key directories:
dump/: PostgreSQL dump files that include:doit_categories: Data categories with record counts.doit_data: Triple-based data representing (category_id, source_id, entity_id, name, value).- Additional mapping and result files.
To access the GOBY dataset:
- Download the
goby.zipfile from the repository (link forthcoming). - Extract it using a tool like:
unzip -P your_password goby.zip -d /path/to/extract/
The benchmark mentioned in the abstract can be downloaded here using the download button with the passcode:
GOBY2025
Parts of this project page were adopted from the Nerfies page.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
