# outspoke-data — License and Attribution
This repository contains derived data files used by the
[Outspoke](https://github.com/brgrdev/outspoke) Android app for offline
word-suggestion and spelling correction. The files are **not** part of the
Outspoke application source code; they are optional downloads that users
initiate from within the app.
---
## dict_*.txt — Frequency Word Lists
**License: Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)**
https://creativecommons.org/licenses/by-sa/4.0/
These files are derived from the
[FrequencyWords](https://github.com/hermitdave/FrequencyWords) project by
Hermit Dave, which in turn is based on the
[OpenSubtitles 2018](http://opus.nlpl.eu/OpenSubtitles2018.php) corpus made
available through the OPUS collection (Tiedemann, 2012).
The word lists were processed as follows: the top 100,000 words by frequency
were extracted, filtered to alphabetic-only tokens of three or more characters,
and re-weighted as log10 relative frequencies. This constitutes a derivative
work under CC BY-SA 4.0. Redistribution of these files, including in this
repository, is permitted under the same license with attribution.
**Attribution:**
- Hermit Dave, FrequencyWords (https://github.com/hermitdave/FrequencyWords)
- Jörg Tiedemann (2012). Parallel Data, Tools and Interfaces in OPUS.
In Proceedings of the 8th International Conference on Language Resources
and Evaluation (LREC 2012).
---
## lm_*.arpa — Bigram Language Models
**License: Creative Commons Attribution 4.0 International (CC BY 4.0)**
https://creativecommons.org/licenses/by/4.0/
These files are statistical bigram language models (in ARPA format) derived
from news text corpora from the
[Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download),
made available by Wortschatz Leipzig under CC BY 4.0.
The models were built by extracting up to 200,000 sentences from each corpus,
tokenising, applying add-1 smoothing, and serialising unigram and bigram log
probabilities in standard ARPA format. This constitutes a derivative work.
Redistribution under CC BY 4.0 is permitted with attribution.
**Attribution:**
- D. Goldhahn, T. Eckart & U. Quasthoff (2012). Building Large Monolingual
Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages.
In Proceedings of the 8th International Language Resources and Evaluation
(LREC 2012). https://wortschatz.uni-leipzig.de
---
## Files in this repository
| File | Source | License |
|---------------|-------------------------------|--------------|
| dict_de.txt | FrequencyWords / OpenSubtitles 2018 | CC BY-SA 4.0 |
| dict_en.txt | FrequencyWords / OpenSubtitles 2018 | CC BY-SA 4.0 |
| dict_es.txt | FrequencyWords / OpenSubtitles 2018 | CC BY-SA 4.0 |
| dict_fr.txt | FrequencyWords / OpenSubtitles 2018 | CC BY-SA 4.0 |
| dict_it.txt | FrequencyWords / OpenSubtitles 2018 | CC BY-SA 4.0 |
| dict_nl.txt | FrequencyWords / OpenSubtitles 2018 | CC BY-SA 4.0 |
| dict_pl.txt | FrequencyWords / OpenSubtitles 2018 | CC BY-SA 4.0 |
| lm_de.arpa | Leipzig Corpora Collection | CC BY 4.0 |
| lm_en.arpa | Leipzig Corpora Collection | CC BY 4.0 |
| lm_es.arpa | Leipzig Corpora Collection | CC BY 4.0 |
| lm_fr.arpa | Leipzig Corpora Collection | CC BY 4.0 |
| lm_it.arpa | Leipzig Corpora Collection | CC BY 4.0 |
| lm_nl.arpa | Leipzig Corpora Collection | CC BY 4.0 |
| lm_pl.arpa | Leipzig Corpora Collection | CC BY 4.0 |