This repository provides labeled data for training abbreviation expansion models, as described in:
Gorman, K., Kirov, C., Roark, B., and Sproat, R. 2021. Structured abbreviation in context. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 995-1005.
If you use this data in a publication, we would appreciate it if you cite this paper.
Sentences were extracted from English Wikipedia articles, then filtered as described in the paper. Annotators were then asked to introduce abbreviations to the sentences.
The data, with the original 80%/10%/10% split, can be found in the
data directory. The data are text-format Protocol
Buffers using the protocol
described in abbreviation.proto. To load this data
into Python, install the Protocol Buffers compiler protoc, then:
pip install -r requirements.txt
makeThen, see textproto.py.
This data was collected by Kyle Gorman with help from the annotators and Brian Roark, Richard Sproat, Olivia Redfield, Caterina Golner, and Katherine Wang.
See LICENSE.
See CONTRIBUTING.
This is not an official Google product.