CJK tokenizer by zarkone · Pull Request #2023 · qdrant/qdrant

zarkone · 2023-06-04T20:40:07Z

/claim #1909

Adds new type of tokenizer suitable for CJK languages using meilisearch/charabia

All Submissions:

Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

Does your submission pass tests?
Have you formatted your code locally using cargo fmt command prior to submission?
Have you checked your code using cargo clippy command?

Changes to Core Features:

Have you added an explanation of what your changes do and why you'd like us to include them?
Have you written new tests for your core changes, as applicable?
Have you successfully ran tests with your changes locally?

zarkone · 2023-06-04T20:41:11Z

looking forward for a feedback on naming and approach in general, as well as what is missing if this approach is applicable 🙏

agourlay · 2023-06-05T09:01:23Z

You need to update the TokenizerType enum in the collections.proto file to propagate your changes to the gRPC API.

EDIT: Maybe you simply forgot to commit the file?

zarkone · 2023-06-05T18:59:47Z

no, I've thought .proto generates from .rs file and not the other way around 🥲

agourlay · 2023-06-06T13:25:22Z

Current status: we are struggling to get a green build for Windows on CI 😬
There is possibly an issue with the charabia crate for our setup.

generall · 2023-06-06T13:26:37Z

Also https://github.com/qdrant/qdrant/actions/runs/5180834118/jobs/9335607637?pr=2023 fails. It requires updates in generated grpc docs

zarkone · 2023-06-06T21:14:51Z

@agourlay thanks for checking! windows build fail doesn't give much clues https://github.com/qdrant/qdrant/actions/runs/5180834122/jobs/9353240767?pr=2023 -- I'll look into it, and also will think about alternative options.

agourlay · 2023-06-07T07:52:58Z

@zarkone please rebase your PR, I have pushed a change to reduce the amount of artifacts generated at compile time.
It should make the Windows CI job happy 🤞

zarkone · 2023-06-07T20:43:30Z

thanks @agourlay! rebased ✔️

agourlay · 2023-06-08T08:29:22Z

docs/grpc/docs.md

 | Prefix | 1 |  |
 | Whitespace | 2 |  |
 | Word | 3 |  |
+| Charabia | 4 |  |


I think my first concern is the naming of the API, surfacing charabia seems like an implementation details leak.

What do you think about multilingual instead?

right, this was my concern as well -- and multilingual is much better.

agourlay · 2023-06-08T08:32:39Z

lib/segment/Cargo.toml

 chrono = { version = "0.4.26", features = ["serde"] }

 sysinfo = "0.29"
+charabia = "0.7.2"


By default, the crates enables the following tokenizers:

chinese

hebrew

japanese

thai

korean

greek

latin-camelcase

We initially just wanted to support CJK but I guess it is fine.

What I would like to see is the difference in binary size before and after this PR.

right, I was also thinking that this might help with situations like windows build fails, potentially. If we want to focus only on CJK it makes sense to disable others.

OK, so I've measured the binary size on fresh dev and here. Results doesn't look great..

| setup | commit | size | |----------------+----------+-------| | dev | edf3d88e | 52.3M | | charabia: full | 68266067 | 92.7M | | charabia: cjk | -- | 92.7M |

Linux(NixOS), x86_64

These are the sizes of target/release/qdrant after executing cargo build --release.

I've also expected that removing features from charabia would exclude some crates from deps and the size will be reduced. However, -- I've checked couple of times and did a cleanup just to make sure -- the size was the same as with full charabia.

This is the command I've used to limit features:

cargo add charabia --no-default-features --features "chinese,japanese,korean" Updating crates.io index Adding charabia v0.7.2 to dependencies. Features: + chinese + japanese + korean - greek - hebrew - japanese-transliteration - latin-camelcase - lindera - thai

It feels too much for one type of tokenizer to take half of binary size

it looks like meilisearch somehow fits 50mb

It's not too surprising that the CJK tokenizers take so much space, as they each include a full dictionary for their respective language.

zarkone · 2023-06-16T21:05:07Z

Ok, so tokenizers are heavy because of dictionaries, that is correct.

Why they need dictionaries? To tokenize languages without the notion
of space.

Without dictionary, one of the simplest ways of tokenizing is n-gram
method – a text is divided into tokens by n amount of adjacent
symbols. Tokenizer vector in this case grows too fast, which forces to
limit n (digrams, trigrams, etc). In addition, tokenized combinations
very often might not make sense semantically, something like:

"hello big world" => ["hel", "llobi", "obigwo", ...

In order to mitigate this issue, tokenizers are using various
stochastic modeling in various combinations. These methods came from
sequence labeling studies and very popular in DNA analyzis. In case of
DNA the task is similar: to "tokenize" the parts of DNA, put "spaces"
in proper locations and minimize situations like "hel", "llobi",
"obigwo", …

On of the most popular methods in this field is Hidden Markov
Model.

This one can be used for example with jiebra-rs, which serves as one
of the backends for charabia lib, however, charabia uses no_hmm
method. no_hmm still uses dictionary, traversing DAG where frequencies
serves as edges.

The size of default dictionary in jiebra-rs(a backend for processing
Chinese text) is around 5M. This dictionary contains frequencies and
labels for the most popular combinations of symbols. Both frequencies
and labels are derived statistically and used in order to estimate
probabilities (see Viterbi algorithm).

This is how charabia tokenizes Chinese text: other backends are
essentially different libraries with (maybe) different approaches. But
most of them are using heavy dictionaries.

Charabia adds around 50M when all the languages are included.

Why Meilisearch weights 75M then?

I think because Meilisearch doesn't enable it by default, exposing as
features instead:

charabia = { version = "0.7.2", default-features = false }

If I use charabia without default features, binary size changes are
insignificant.

Even without dictionaries, charabia does a lot of things like

language detection,
normalization(e.g. `Thé` becomes `the`)
and exposing extended info about tokens like TokenKind

which can be used to improve querying mechanism: for example,
Meilisearch uses TokenKind to search with "quotes", where quoted
text must appear in result as-is (similar to google search
feature). Arabic is always enabled and doesn't require dictionaries.

That being said, I think that it still makes sense to use charabia,
even if we decide to disable all features: it still can handle CJK
languages (check & compare), and we can also forward these features
(like Meilisearch) for users to enable it if needed.

This example also shows that with enabled backends not only tokens are
different, but it also enables searching CJK with ascii characters:

in:

序在这本书里，我想写现代中国某一部分社会、某一类人物。写这类人，我没忘记他们是人类

default-features: false:

Word: 序
Word: 在
Word: 这
Word: 本
Word: 书
Word: 里
Separator(Hard): ,
Word: 我
Word: 想
Word: 写
Word: 现
Word: 代
Word: 中
Word: 国
Word: 某
Word: 一
Word: 部
Word: 分
Word: 社
Word: 会
Separator(Hard): 、
Word: 某
Word: 一
Word: 类
Word: 人
Word: 物
Separator(Hard): 。
Word: 写
Word: 这
Word: 类
Word: 人
Separator(Hard): ,
Word: 我
Word: 没
Word: 忘
Word: 记
Word: 他
Word: 们
Word: 是
Word: 人
Word: 类

default-features: true:

Word: xù
Word: zài
Word: zhè
Word: běnshū
Word: lǐ
Separator(Hard): ,
Word: wǒ
Word: xiǎng
Word: xiě
Word: xiàndài
Word: zhōngguó
Word: mǒu
Word: yībùfēn
Word: shèhuì
Separator(Hard): 、
Word: mǒu
Word: yīlèi
Word: rénwù
Separator(Hard): 。
Word: xiě
Word: zhè
Word: lèirén
Separator(Hard): ,
Word: wǒ
Word: méi
Word: wàngjì
Word: tāmen
Word: shì
Word: rénlèi

Bonus material: enabled hmm in charabia src

no_hmm.txt
hmm.txt

diff hmm.txt no_hmm.txt > hmm_diff.txt:
hmm_diff.txt

…izer

zarkone · 2023-06-20T20:18:57Z

What do you think about forwarding build features?
What else do you feel I am missing which should belong to the scope of this PR?
Looking forward for a feedback

timvisee · 2023-07-13T12:47:06Z

Thanks a lot for your efforts on this!

Unfortunately, we ended up deciding not to merge this into Qdrant. Enabling the CJK dictionaries make the binary too big, as you've found. Not including the dictionaries make the benefits too insignificant. We want to keep the core Qdrant product as simple as possible without too many bells and whistles.

Nevertheless, your contribution has been very valuable and provided us with new insights. We would therefore still like to reward you with the bounty.

We hope you understand why we made this decision. If there's a clear use case for this in the future, we well revisit it again. Thanks again for picking up this task and working on it thoroughly.

zarkone · 2023-07-13T16:04:02Z

@timvisee no, that sounds good, totally fine, no worries -- totally valid. I've actually had a lot of fun exploring qdrant codebase and tokenizer problems.

In case you come across such use case where you need a complex tokenizer for a certain cluster/client, I feel that it would be relatively easy to forward these build flags from charabia.

Just decided to repeat that from my conclusion so it will be documented here in summary.

generall · 2023-07-13T16:36:06Z

/tip $250 @zarkone

algora-pbc · 2023-07-13T16:36:10Z

👉 @generall: Click here to proceed

algora-pbc · 2023-07-13T16:43:24Z

@zarkone: You just got a $250 tip! 👉 Complete your Algora onboarding to collect your payment.

algora-pbc · 2023-07-13T18:54:23Z

🎉🎈 @zarkone has been awarded $250! 🎈🎊

zarkone · 2023-07-13T19:27:57Z

Thank you 🙏

ken0x0a · 2023-07-13T23:29:53Z

Unfortunately, we ended up deciding not to merge this into Qdrant. Enabling the CJK dictionaries make the binary too big, as you've found. Not including the dictionaries make the benefits too insignificant. We want to keep the core Qdrant product as simple as possible without too many bells and whistles.

How about putting all CJK related code under a new feature? ("cjk" maybe?)

generall · 2023-07-13T23:40:03Z

I am actually having second thoughts on it @zarkone @ken0x0a.
I can't change this PR, so I opened another one: #2260

Same code with only minor changes.

I think we will merge charabia support without default features after all.

timvisee · 2023-07-25T07:04:52Z

Since #2260 is merged, I think we can close this one. 😃

algora-pbc bot mentioned this pull request Jun 4, 2023

Support tokenizers for CJK languages #1909

Closed

zarkone force-pushed the zarkone/charabia-tokenizer branch from 6cca910 to 010908c Compare June 6, 2023 21:02

zarkone force-pushed the zarkone/charabia-tokenizer branch from 010908c to c1d7da1 Compare June 7, 2023 14:36

agourlay reviewed Jun 8, 2023

View reviewed changes

zarkone added 3 commits June 8, 2023 20:26

charabia tokenizer for CJK, init

d94b382

add charabia to .proto

e5abfb0

generate grpc docs

6826606

zarkone force-pushed the zarkone/charabia-tokenizer branch from c1d7da1 to 6826606 Compare June 8, 2023 19:22

rename charabia to multilingual

957156d

zarkone added 2 commits June 20, 2023 21:51

disable charabia default features

46bfcd6

Merge remote-tracking branch 'origin/dev' into zarkone/charabia-token…

d554b6f

…izer

timvisee closed this Jul 13, 2023

algora-pbc bot added the 💰 Rewarded label Jul 13, 2023

generall reopened this Jul 13, 2023

generall mentioned this pull request Jul 13, 2023

Second thoughts on CJK tokenizer #2260

Merged

timvisee closed this Jul 25, 2023

yangboz mentioned this pull request Apr 15, 2024

ingesting fail on MPS Otman404/local-rag-llamaindex#3

Open

Conversation

zarkone commented Jun 4, 2023

All Submissions:

New Feature Submissions:

Changes to Core Features:

Uh oh!

zarkone commented Jun 4, 2023

Uh oh!

agourlay commented Jun 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zarkone commented Jun 5, 2023

Uh oh!

agourlay commented Jun 6, 2023

Uh oh!

generall commented Jun 6, 2023

Uh oh!

zarkone commented Jun 6, 2023

Uh oh!

agourlay commented Jun 7, 2023

Uh oh!

zarkone commented Jun 7, 2023

Uh oh!

agourlay Jun 8, 2023

Choose a reason for hiding this comment

Uh oh!

zarkone Jun 8, 2023

Choose a reason for hiding this comment

Uh oh!

agourlay Jun 8, 2023

Choose a reason for hiding this comment

Uh oh!

zarkone Jun 8, 2023

Choose a reason for hiding this comment

Uh oh!

zarkone Jun 8, 2023

Choose a reason for hiding this comment

Uh oh!

generall Jun 8, 2023

Choose a reason for hiding this comment

Uh oh!

Jesse-Bakker Jun 9, 2023

Choose a reason for hiding this comment

Uh oh!

zarkone commented Jun 16, 2023

Bonus material: enabled hmm in charabia src

Uh oh!

zarkone commented Jun 20, 2023

Uh oh!

timvisee commented Jul 13, 2023

Uh oh!

zarkone commented Jul 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

generall commented Jul 13, 2023

Uh oh!

algora-pbc bot commented Jul 13, 2023

Uh oh!

algora-pbc bot commented Jul 13, 2023

Uh oh!

algora-pbc bot commented Jul 13, 2023

Uh oh!

zarkone commented Jul 13, 2023

Uh oh!

ken0x0a commented Jul 13, 2023

Uh oh!

generall commented Jul 13, 2023

Uh oh!

timvisee commented Jul 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

agourlay commented Jun 5, 2023 •

edited

Loading

zarkone commented Jul 13, 2023 •

edited

Loading