Skip to content
This repository was archived by the owner on Dec 7, 2023. It is now read-only.

google-research-datasets/WikipediaAbbreviationData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Abbreviation data

This repository provides labeled data for training abbreviation expansion models, as described in:

Gorman, K., Kirov, C., Roark, B., and Sproat, R. 2021. Structured abbreviation in context. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 995-1005.

If you use this data in a publication, we would appreciate it if you cite this paper.

Annotation

Sentences were extracted from English Wikipedia articles, then filtered as described in the paper. Annotators were then asked to introduce abbreviations to the sentences.

Organization

The data, with the original 80%/10%/10% split, can be found in the data directory. The data are text-format Protocol Buffers using the protocol described in abbreviation.proto. To load this data into Python, install the Protocol Buffers compiler protoc, then:

pip install -r requirements.txt
make

Then, see textproto.py.

Authors

This data was collected by Kyle Gorman with help from the annotators and Brian Roark, Richard Sproat, Olivia Redfield, Caterina Golner, and Katherine Wang.

License

See LICENSE.

Contributing

See CONTRIBUTING.

Mandatory disclaimer

This is not an official Google product.

About

This data set consists of 24,000 English sentences, extracted from Wikipedia in 2017, annotated to support development of an abbreviation expansion system for text-to-speech synthesis (e.g., a systm tht cn prnounc txt lk ths).

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors