Many of the supported languages defined in bm25.py, including chinese and bengali, do not work and cause thread panic.
I am not sure the language list is outdated or if there is a bug elsewhere causing the issue. I see that on first run of I first run of SparseTextEmbedding some files like chinese.txt file is downloaded.
supported_languages = [
"arabic",
"azerbaijani",
"basque",
"bengali",
"catalan",
"chinese",
"danish",
"dutch",
"english",
"finnish",
"french",
"german",
"greek",
"hebrew",
"hinglish",
"hungarian",
"indonesian",
"italian",
"kazakh",
"nepali",
"norwegian",
"portuguese",
"romanian",
"russian",
"slovene",
"spanish",
"swedish",
"tajik",
"turkish",
]
Either add support for the languages or remove them from the supported_languages list
from fastembed.sparse import SparseTextEmbedding
bm25_embedding_model = SparseTextEmbedding("Qdrant/BM25", language="chinese")
What happened?
Many of the supported languages defined in bm25.py, including chinese and bengali, do not work and cause thread panic.
I am not sure the language list is outdated or if there is a bug elsewhere causing the issue. I see that on first run of I first run of SparseTextEmbedding some files like chinese.txt file is downloaded.
What is the expected behaviour?
Either add support for the languages or remove them from the supported_languages list
A minimal reproducible example
What Python version are you on? e.g. python --version
Python 3.11.9
FastEmbed version
v0.6.0
What os are you seeing the problem on?
No response
Relevant stack traces and/or logs