Wiktionary2Dict

Command-line tool for extracting International Phonetic Alphabet (IPA) word pronunciations from a XML Wiktionary dump (<LANG>wiktionary-<DATE>-pages-articles-multistream.xml.bz2).

Inspired by https://github.com/Kyubyong/pron_dictionaries

See gruut for a pre-compiled set of lexicons.

Dependencies

Python 3.7 or higher
lxml

Supported Languages

Each language's wiktionary is slightly different, so custom processing functions are needed. The following languages have been tested:

de-de (dewiktionary)
- German
cs-cz (cswiktionary)
- Czech
en-us (enwiktionary)
- U.S. English
fr-fr (frwiktionary)
- French
it-it (itwiktionary)
- Italian
nl-nl (nlwiktionary)
- Dutch

Installation

The use of a virtual environment is recommended for installation. The create-venv.sh script in the scripts/ directory will handle most of the details for you:

$ git clone https://github.com/rhasspy/wiktionary2dict
$ cd wiktionary2dict
$ scripts/create-venv.sh

Once installation is completed, you should be able to run:

$ bin/wiktionary2dict --help

Usage

First, download <LANG>wiktionary/<LANG>wiktionary-<DATE>-pages-articles-multistream.xml.bz2 from a mirror where <LANG> is the language and <DATE> is the date of the dump. For example:

$ wget enwiktionary-20200801-pages-articles-multistream.xml.bz2

Next, extract pronunciations and clean/sort/compress the resulting lexicon:

$ bzcat enwiktionary-20200801-pages-articles-multistream.xml.bz2 | \
    bin/wiktionary2dict en-us --casing lower | \
    sort | uniq | gzip -9 --to-stdout > en-us_lexicon.txt.gz

After it's finished, you can view a few entries with:

$ zcat en-us_lexicon.txt.gz | head -n 25
0 oʊ
101 ˈwʌnˌoʊ̯ˈwʌn
1080 tɛnˈʔeɪ̯ti
2.0 tupɔɪntoʊ
470 ˌfɔɹˈsɛvəɾ̃i
6x6 sɪksbaɪsɪks
747 ˌsɛvənfoɹtiˈsɛvən
aabomycin ˌæboʊˈmaɪsɪn
aachen ˈɑkən
aalborg ˈɔlbɔɹ
aam ɑm
aandblom ˈɑntblɑm
aardvark ˈɑɹdvɑɹk
aasvoel ˈɑsˌfoʊəl
ababda əˈbæbdə
ababdeh əˈbæbdɛ
abacisci ˌæbəˈsɪˌsaɪ
abacost ˈæbəkɑst
abakan ˌɑbəˈkɑn
abarognosis ˌæbəɹəɡˈnoʊsɪs
abarthroses ˌæˌbɑɹˈθɹoʊsiz
abatises ˈæbətəsəz
abatjour ˌɑˌbɑˈʒʊɚ
abatsons ˌɑˌbɑˈsɔ̃
abatsons ˌɑˌbɑˈsɔ̃z

Adding a New Language

To add support for a new language, you must:

Create a process_<LANG> function in __init__.py
Add an entry to __LANG_PROCESS in __main__.py

Your process_<LANG> function should have the following form:

def process_my_lang(text: str) -> typing.Iterable[str]:
    # Process content of <revision><text>...</text></revision>
    # yield IPA pronunciations

Because each language's Wiktionary is slightly different, you will need to inspect the XML yourself and determine the best way to extract pronunciations. Look at the existing process_ functions for ideas of how to start.

Most of the <text> sections that you need to process contain MediaWiki markup. The desired IPA pronunciation is typically under a specific section/heading, and often begins with "IPA". Be careful to handle cases where multiple pronunciations from different regions (or even different languages) are all mixed together!

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
bin		bin
scripts		scripts
wiktionary2dict		wiktionary2dict
.gitignore		.gitignore
.projectile		.projectile
.python_version		.python_version
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mypy.ini		mypy.ini
pylintrc		pylintrc
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wiktionary2Dict

Dependencies

Supported Languages

Installation

Usage

Adding a New Language

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Wiktionary2Dict

Dependencies

Supported Languages

Installation

Usage

Adding a New Language

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages