Skip to content

[Bug]: BM25 Inaccurate Supported Languages #505

Description

@charlaie

What happened?

Many of the supported languages defined in bm25.py, including chinese and bengali, do not work and cause thread panic.

I am not sure the language list is outdated or if there is a bug elsewhere causing the issue. I see that on first run of I first run of SparseTextEmbedding some files like chinese.txt file is downloaded.

supported_languages = [
    "arabic",
    "azerbaijani",
    "basque",
    "bengali",
    "catalan",
    "chinese",
    "danish",
    "dutch",
    "english",
    "finnish",
    "french",
    "german",
    "greek",
    "hebrew",
    "hinglish",
    "hungarian",
    "indonesian",
    "italian",
    "kazakh",
    "nepali",
    "norwegian",
    "portuguese",
    "romanian",
    "russian",
    "slovene",
    "spanish",
    "swedish",
    "tajik",
    "turkish",
]

What is the expected behaviour?

Either add support for the languages or remove them from the supported_languages list

A minimal reproducible example

from fastembed.sparse import SparseTextEmbedding
bm25_embedding_model = SparseTextEmbedding("Qdrant/BM25", language="chinese")

What Python version are you on? e.g. python --version

Python 3.11.9

FastEmbed version

v0.6.0

What os are you seeing the problem on?

No response

Relevant stack traces and/or logs

thread '<unnamed>' panicked at src/lib.rs:36:18:
Unsupported language: chinese
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Aborted (core dumped)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions