List of common stop words in various languages.
The words are normalized to Unicode's normal form C.
There is a manage.py script useful for maintaining the word lists.
To merge the English word list with new lists, you can use the following:
python -m manage merge en /tmp/new_list.txt /tmp/another_new_list.txtThe language code above is used for two purposes:
- Determining the source file based on
languages.json - Determining the libICU locale to use when comparing words
If new words are added manually, you can use the following to maintain the sorting order:
python -m manage sort enor simply
python -m manage sort-allThe management script contains code that can be used as a library. See the LanguageDataIndex class and the sort_word_list function for more details.
- Arabic
- Bulgarian
- Catalan
- Chinese
- Czech
- Danish
- Dutch
- English
- Finnish
- French
- German
- Greek
- Gujarati
- Hindi
- Hebrew
- Hungarian
- Indonesian
- Malaysian
- Italian
- Japanese
- Korean
- Norwegian
- Polish
- Portuguese
- Romanian
- Russian
- Slovak
- Spanish
- Swedish
- Turkish
- Ukrainian
- Vietnamese
- Persian/Farsi
You know how ;)