Skip to content

CJK tokenizer#2023

Closed
zarkone wants to merge 6 commits intoqdrant:devfrom
zarkone:zarkone/charabia-tokenizer
Closed

CJK tokenizer#2023
zarkone wants to merge 6 commits intoqdrant:devfrom
zarkone:zarkone/charabia-tokenizer

Conversation

@zarkone
Copy link
Copy Markdown
Contributor

@zarkone zarkone commented Jun 4, 2023

/claim #1909

Adds new type of tokenizer suitable for CJK languages using meilisearch/charabia

All Submissions:

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo fmt command prior to submission?
  3. Have you checked your code using cargo clippy command?

Changes to Core Features:

  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your core changes, as applicable?
  • Have you successfully ran tests with your changes locally?

@zarkone
Copy link
Copy Markdown
Contributor Author

zarkone commented Jun 4, 2023

looking forward for a feedback on naming and approach in general, as well as what is missing if this approach is applicable 🙏

@agourlay
Copy link
Copy Markdown
Member

agourlay commented Jun 5, 2023

You need to update the TokenizerType enum in the collections.proto file to propagate your changes to the gRPC API.

EDIT: Maybe you simply forgot to commit the file?

@zarkone
Copy link
Copy Markdown
Contributor Author

zarkone commented Jun 5, 2023

no, I've thought .proto generates from .rs file and not the other way around 🥲

@agourlay
Copy link
Copy Markdown
Member

agourlay commented Jun 6, 2023

Current status: we are struggling to get a green build for Windows on CI 😬
There is possibly an issue with the charabia crate for our setup.

@generall
Copy link
Copy Markdown
Member

generall commented Jun 6, 2023

Also https://github.com/qdrant/qdrant/actions/runs/5180834118/jobs/9335607637?pr=2023 fails. It requires updates in generated grpc docs

@zarkone zarkone force-pushed the zarkone/charabia-tokenizer branch from 6cca910 to 010908c Compare June 6, 2023 21:02
@zarkone
Copy link
Copy Markdown
Contributor Author

zarkone commented Jun 6, 2023

@agourlay thanks for checking! windows build fail doesn't give much clues https://github.com/qdrant/qdrant/actions/runs/5180834122/jobs/9353240767?pr=2023 -- I'll look into it, and also will think about alternative options.

@agourlay
Copy link
Copy Markdown
Member

agourlay commented Jun 7, 2023

@zarkone please rebase your PR, I have pushed a change to reduce the amount of artifacts generated at compile time.
It should make the Windows CI job happy 🤞

@zarkone zarkone force-pushed the zarkone/charabia-tokenizer branch from 010908c to c1d7da1 Compare June 7, 2023 14:36
@zarkone
Copy link
Copy Markdown
Contributor Author

zarkone commented Jun 7, 2023

thanks @agourlay! rebased ✔️

| Prefix | 1 | |
| Whitespace | 2 | |
| Word | 3 | |
| Charabia | 4 | |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my first concern is the naming of the API, surfacing charabia seems like an implementation details leak.

What do you think about multilingual instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, this was my concern as well -- and multilingual is much better.

chrono = { version = "0.4.26", features = ["serde"] }

sysinfo = "0.29"
charabia = "0.7.2"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, the crates enables the following tokenizers:

  • chinese
  • hebrew
  • japanese
  • thai
  • korean
  • greek
  • latin-camelcase

We initially just wanted to support CJK but I guess it is fine.

What I would like to see is the difference in binary size before and after this PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I was also thinking that this might help with situations like windows build fails, potentially. If we want to focus only on CJK it makes sense to disable others.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so I've measured the binary size on fresh dev and here. Results doesn't look great..

| setup          | commit   | size  |
|----------------+----------+-------|
| dev            | edf3d88e | 52.3M |
| charabia: full | 68266067 | 92.7M |
| charabia: cjk  | --       | 92.7M |

Linux(NixOS), x86_64

These are the sizes of target/release/qdrant after executing cargo build --release.

I've also expected that removing features from charabia would exclude some crates from deps and the size will be reduced. However, -- I've checked couple of times and did a cleanup just to make sure -- the size was the same as with full charabia.

This is the command I've used to limit features:

cargo add charabia --no-default-features --features "chinese,japanese,korean"
    Updating crates.io index
      Adding charabia v0.7.2 to dependencies.
             Features:
             + chinese
             + japanese
             + korean
             - greek
             - hebrew
             - japanese-transliteration
             - latin-camelcase
             - lindera
             - thai

It feels too much for one type of tokenizer to take half of binary size

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like meilisearch somehow fits 50mb

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not too surprising that the CJK tokenizers take so much space, as they each include a full dictionary for their respective language.

@zarkone zarkone force-pushed the zarkone/charabia-tokenizer branch from c1d7da1 to 6826606 Compare June 8, 2023 19:22
@zarkone
Copy link
Copy Markdown
Contributor Author

zarkone commented Jun 16, 2023

Ok, so tokenizers are heavy because of dictionaries, that is correct.

Why they need dictionaries? To tokenize languages without the notion
of space.

Without dictionary, one of the simplest ways of tokenizing is n-gram
method – a text is divided into tokens by n amount of adjacent
symbols. Tokenizer vector in this case grows too fast, which forces to
limit n (digrams, trigrams, etc). In addition, tokenized combinations
very often might not make sense semantically, something like:

"hello big world" => ["hel", "llobi", "obigwo", ...

In order to mitigate this issue, tokenizers are using various
stochastic modeling in various combinations. These methods came from
sequence labeling studies and very popular in DNA analyzis. In case of
DNA the task is similar: to "tokenize" the parts of DNA, put "spaces"
in proper locations and minimize situations like "hel", "llobi",
"obigwo", …

On of the most popular methods in this field is Hidden Markov
Model.

This one can be used for example with jiebra-rs, which serves as one
of the backends for charabia lib, however, charabia uses no_hmm
method. no_hmm still uses dictionary, traversing DAG where frequencies
serves as edges.

The size of default dictionary in jiebra-rs(a backend for processing
Chinese text) is around 5M. This dictionary contains frequencies and
labels for the most popular combinations of symbols. Both frequencies
and labels are derived statistically and used in order to estimate
probabilities (see Viterbi algorithm).

This is how charabia tokenizes Chinese text: other backends are
essentially different libraries with (maybe) different approaches. But
most of them are using heavy dictionaries.

Charabia adds around 50M when all the languages are included.

Why Meilisearch weights 75M then?

I think because Meilisearch doesn't enable it by default, exposing as
features instead:

charabia = { version = "0.7.2", default-features = false }

If I use charabia without default features, binary size changes are
insignificant.

Even without dictionaries, charabia does a lot of things like

  • language detection,
  • normalization(e.g. `Thé` becomes `the`)
  • and exposing extended info about tokens like TokenKind

which can be used to improve querying mechanism: for example,
Meilisearch uses TokenKind to search with "quotes", where quoted
text must appear in result as-is (similar to google search
feature). Arabic is always enabled and doesn't require dictionaries.

That being said, I think that it still makes sense to use charabia,
even if we decide to disable all features: it still can handle CJK
languages (check & compare), and we can also forward these features
(like Meilisearch) for users to enable it if needed.

This example also shows that with enabled backends not only tokens are
different, but it also enables searching CJK with ascii characters:

in:

序在这本书里,我想写现代中国某一部分社会、某一类人物。写这类人,我没忘记他们是人类

default-features: false:

Word: 序
Word: 在
Word: 这
Word: 本
Word: 书
Word: 里
Separator(Hard): ,
Word: 我
Word: 想
Word: 写
Word: 现
Word: 代
Word: 中
Word: 国
Word: 某
Word: 一
Word: 部
Word: 分
Word: 社
Word: 会
Separator(Hard): 、
Word: 某
Word: 一
Word: 类
Word: 人
Word: 物
Separator(Hard): 。
Word: 写
Word: 这
Word: 类
Word: 人
Separator(Hard): ,
Word: 我
Word: 没
Word: 忘
Word: 记
Word: 他
Word: 们
Word: 是
Word: 人
Word: 类

default-features: true:

Word: xù
Word: zài
Word: zhè
Word: běnshū
Word: lǐ
Separator(Hard): ,
Word: wǒ
Word: xiǎng
Word: xiě
Word: xiàndài
Word: zhōngguó
Word: mǒu
Word: yībùfēn
Word: shèhuì
Separator(Hard): 、
Word: mǒu
Word: yīlèi
Word: rénwù
Separator(Hard): 。
Word: xiě
Word: zhè
Word: lèirén
Separator(Hard): ,
Word: wǒ
Word: méi
Word: wàngjì
Word: tāmen
Word: shì
Word: rénlèi

Bonus material: enabled hmm in charabia src

no_hmm.txt
hmm.txt

diff hmm.txt no_hmm.txt > hmm_diff.txt:
hmm_diff.txt

@zarkone
Copy link
Copy Markdown
Contributor Author

zarkone commented Jun 20, 2023

What do you think about forwarding build features?
What else do you feel I am missing which should belong to the scope of this PR?
Looking forward for a feedback

@timvisee
Copy link
Copy Markdown
Member

Thanks a lot for your efforts on this!

Unfortunately, we ended up deciding not to merge this into Qdrant. Enabling the CJK dictionaries make the binary too big, as you've found. Not including the dictionaries make the benefits too insignificant. We want to keep the core Qdrant product as simple as possible without too many bells and whistles.

Nevertheless, your contribution has been very valuable and provided us with new insights. We would therefore still like to reward you with the bounty.

We hope you understand why we made this decision. If there's a clear use case for this in the future, we well revisit it again. Thanks again for picking up this task and working on it thoroughly.

@timvisee timvisee closed this Jul 13, 2023
@zarkone
Copy link
Copy Markdown
Contributor Author

zarkone commented Jul 13, 2023

@timvisee no, that sounds good, totally fine, no worries -- totally valid. I've actually had a lot of fun exploring qdrant codebase and tokenizer problems.

In case you come across such use case where you need a complex tokenizer for a certain cluster/client, I feel that it would be relatively easy to forward these build flags from charabia.

Just decided to repeat that from my conclusion so it will be documented here in summary.

@generall
Copy link
Copy Markdown
Member

/tip $250 @zarkone

@algora-pbc
Copy link
Copy Markdown

algora-pbc bot commented Jul 13, 2023

👉 @generall: Click here to proceed

@algora-pbc
Copy link
Copy Markdown

algora-pbc bot commented Jul 13, 2023

@zarkone: You just got a $250 tip! 👉 Complete your Algora onboarding to collect your payment.

@algora-pbc
Copy link
Copy Markdown

algora-pbc bot commented Jul 13, 2023

🎉🎈 @zarkone has been awarded $250! 🎈🎊

@zarkone
Copy link
Copy Markdown
Contributor Author

zarkone commented Jul 13, 2023

Thank you 🙏

1689276327-13-07-23_21:25:27

@generall generall reopened this Jul 13, 2023
@ken0x0a
Copy link
Copy Markdown

ken0x0a commented Jul 13, 2023

Unfortunately, we ended up deciding not to merge this into Qdrant. Enabling the CJK dictionaries make the binary too big, as you've found. Not including the dictionaries make the benefits too insignificant. We want to keep the core Qdrant product as simple as possible without too many bells and whistles.

How about putting all CJK related code under a new feature? ("cjk" maybe?)

@generall
Copy link
Copy Markdown
Member

I am actually having second thoughts on it @zarkone @ken0x0a.
I can't change this PR, so I opened another one: #2260

Same code with only minor changes.

I think we will merge charabia support without default features after all.

@timvisee
Copy link
Copy Markdown
Member

Since #2260 is merged, I think we can close this one. 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants