Skip to content

Fix Chinese stopwords#6765

Merged
generall merged 2 commits intoimplement_new_multilingual_tokenizerfrom
fix_chinese_stopwords
Jun 26, 2025
Merged

Fix Chinese stopwords#6765
generall merged 2 commits intoimplement_new_multilingual_tokenizerfrom
fix_chinese_stopwords

Conversation

@JojiiOfficial
Copy link
Contributor

@JojiiOfficial JojiiOfficial commented Jun 26, 2025

Depends on #6762

This PR removes pinyin conversion (eg 中国 => ["zhong", "guo"]) in our new multilingual tokenizer implementation. This fixes Chinese stopwords, because the tokens were in pinyin but our stopword list is written with Chinese letters (so basically we did assert!(["是", "上去", ...].contains("shì")))

Since Pinyin is just the romanized phonetic version of a Chinese word, we can use the original Chinese word without loosing information about the word. We even improve precision, because multiple Chinese words can map to the same Pinyin, e.g. 忘记 and 旺季 is both ["wang", "ji"] (https://www.quora.com/Is-pinyin-romanization-a-bijective-map).

@generall generall merged commit 606b840 into implement_new_multilingual_tokenizer Jun 26, 2025
17 checks passed
@generall generall deleted the fix_chinese_stopwords branch June 26, 2025 12:46
generall added a commit that referenced this pull request Jun 26, 2025
* Implement new multilingual tokenizer

* Remove unnecessary clones

* Codespell

* filter stopwords before stemmer

* Fix Chinese stopwords (#6765)

* Fix Chinese stopwords

* remove todo

---------

Co-authored-by: Andrey Vasnetsov <andrey@vasnetsov.com>
generall added a commit that referenced this pull request Jul 17, 2025
* Implement new multilingual tokenizer

* Remove unnecessary clones

* Codespell

* filter stopwords before stemmer

* Fix Chinese stopwords (#6765)

* Fix Chinese stopwords

* remove todo

---------

Co-authored-by: Andrey Vasnetsov <andrey@vasnetsov.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants