GitHub - minburg/outspoke-data

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.idea		.idea
LICENSE		LICENSE

Repository files navigation

# outspoke-data — License and Attribution

This repository contains derived data files used by the
[Outspoke](https://github.com/brgrdev/outspoke) Android app for offline
word-suggestion and spelling correction. The files are **not** part of the
Outspoke application source code; they are optional downloads that users
initiate from within the app.

---

## dict_*.txt — Frequency Word Lists

**License: Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)**
https://creativecommons.org/licenses/by-sa/4.0/

These files are derived from the
[FrequencyWords](https://github.com/hermitdave/FrequencyWords) project by
Hermit Dave, which in turn is based on the
[OpenSubtitles 2018](http://opus.nlpl.eu/OpenSubtitles2018.php) corpus made
available through the OPUS collection (Tiedemann, 2012).

The word lists were processed as follows: the top 100,000 words by frequency
were extracted, filtered to alphabetic-only tokens of three or more characters,
and re-weighted as log10 relative frequencies. This constitutes a derivative
work under CC BY-SA 4.0. Redistribution of these files, including in this
repository, is permitted under the same license with attribution.

**Attribution:**
- Hermit Dave, FrequencyWords (https://github.com/hermitdave/FrequencyWords)
- Jörg Tiedemann (2012). Parallel Data, Tools and Interfaces in OPUS.
  In Proceedings of the 8th International Conference on Language Resources
  and Evaluation (LREC 2012).

---

## lm_*.arpa — Bigram Language Models

**License: Creative Commons Attribution 4.0 International (CC BY 4.0)**
https://creativecommons.org/licenses/by/4.0/

These files are statistical bigram language models (in ARPA format) derived
from news text corpora from the
[Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download),
made available by Wortschatz Leipzig under CC BY 4.0.

The models were built by extracting up to 200,000 sentences from each corpus,
tokenising, applying add-1 smoothing, and serialising unigram and bigram log
probabilities in standard ARPA format. This constitutes a derivative work.
Redistribution under CC BY 4.0 is permitted with attribution.

**Attribution:**
- D. Goldhahn, T. Eckart & U. Quasthoff (2012). Building Large Monolingual
  Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages.
  In Proceedings of the 8th International Language Resources and Evaluation
  (LREC 2012). https://wortschatz.uni-leipzig.de

---

## Files in this repository

| File          | Source                        | License      |
|---------------|-------------------------------|--------------|
| dict_de.txt   | FrequencyWords / OpenSubtitles 2018 | CC BY-SA 4.0 |
| dict_en.txt   | FrequencyWords / OpenSubtitles 2018 | CC BY-SA 4.0 |
| dict_es.txt   | FrequencyWords / OpenSubtitles 2018 | CC BY-SA 4.0 |
| dict_fr.txt   | FrequencyWords / OpenSubtitles 2018 | CC BY-SA 4.0 |
| dict_it.txt   | FrequencyWords / OpenSubtitles 2018 | CC BY-SA 4.0 |
| dict_nl.txt   | FrequencyWords / OpenSubtitles 2018 | CC BY-SA 4.0 |
| dict_pl.txt   | FrequencyWords / OpenSubtitles 2018 | CC BY-SA 4.0 |
| lm_de.arpa    | Leipzig Corpora Collection    | CC BY 4.0    |
| lm_en.arpa    | Leipzig Corpora Collection    | CC BY 4.0    |
| lm_es.arpa    | Leipzig Corpora Collection    | CC BY 4.0    |
| lm_fr.arpa    | Leipzig Corpora Collection    | CC BY 4.0    |
| lm_it.arpa    | Leipzig Corpora Collection    | CC BY 4.0    |
| lm_nl.arpa    | Leipzig Corpora Collection    | CC BY 4.0    |
| lm_pl.arpa    | Leipzig Corpora Collection    | CC BY 4.0    |