Skip to content

Alir3z4/stop-words

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

99 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stop Words

List of common stop words in various languages.

The words are normalized to Unicode's normal form C.

Maintaining the lists

There is a manage.py script useful for maintaining the word lists.

To merge the English word list with new lists, you can use the following:

python -m manage merge en /tmp/new_list.txt /tmp/another_new_list.txt

The language code above is used for two purposes:

  1. Determining the source file based on languages.json
  2. Determining the libICU locale to use when comparing words

If new words are added manually, you can use the following to maintain the sorting order:

python -m manage sort en

or simply

python -m manage sort-all

The management script contains code that can be used as a library. See the LanguageDataIndex class and the sort_word_list function for more details.

Available languages

  • Arabic
  • Bulgarian
  • Catalan
  • Chinese
  • Czech
  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Greek
  • Gujarati
  • Hindi
  • Hebrew
  • Hungarian
  • Indonesian
  • Malaysian
  • Italian
  • Japanese
  • Korean
  • Norwegian
  • Polish
  • Portuguese
  • Romanian
  • Russian
  • Slovak
  • Spanish
  • Swedish
  • Turkish
  • Ukrainian
  • Vietnamese
  • Persian/Farsi

Contributing

You know how ;)

Programming languages support

License

Attribution 4.0 International (CC BY 4.0)