Second thoughts on CJK tokenizer by generall · Pull Request #2260 · qdrant/qdrant

generall · 2023-07-13T23:36:47Z

Another attempt to think about CJK tokenizer: #2023

Created this PR cause I don't have access to change original one. (all authorship preserved)

CC
@zarkone

zarkone · 2023-07-14T09:13:46Z

👍 let me know if you want me to add anything!

agourlay · 2023-07-14T13:03:54Z

Can I please have a short recap. of what it means for our users to use Charabia without the default features enabled?

What would the documentation look like regarding:

which languages are supported
what capabilities it brings
what it can't support

generall · 2023-07-14T19:43:45Z

Main thing what it does, is that it understands in which cases we need to tokenize based on white-spaces, and in which cases we need to take individual characters:

hello, мир!

and

本日の日付は

both will be tokenized in an acceptable way (not ideal, with expensive features disabled), while
word tokenizer can only handle the first case and character- tokenizer - only second

mocobeta · 2023-07-15T08:35:36Z

lib/segment/src/index/field_index/full_text_index/tokenizers.rs

+        assert_eq!(tokens.get(3), Some(&"日".to_owned()));
+        assert_eq!(tokens.get(4), Some(&"付".to_owned()));
+        assert_eq!(tokens.get(5), Some(&"は".to_owned()));
+    }


Hi, I just noticed this test case, and to me, there might be something wrong.
It is clearly Japanese text, but I don't think it is correct tokenization.
本日の日付は should be splitted into four tokens 本日 / の / 日付 / は.

I am a contributor of https://github.com/lindera-morphology/lindera (the Japanese tokenizer internally used by charabia tokenizer), but I'm not familiar with charabia.

@mosuka, do you know anything about the behavior of charabia?

Ok, I missed the previous comment #2260 (comment) so never mind.

I'd like to have clear documentation for the change; while it may be acceptable for general "multilingual" support, it would not be valid "CJK support" without the full Charabia features (language-specific dictionaries).

we will document it for those who want to have full support

Thank you, I fully understand the design choice here not to bloat the Qdrant itself. Just wanted to clarify what we will have with this change.

@mocobeta @generall
Charabia performs language detection from the input string and then splits it into words using a tokenizer prepared for the detected language.
UniDic is used by defaul for Japanese.

https://github.com/meilisearch/charabia/blob/5f8abfe561bd985f7288a9a196b7e3ebeb773dc4/charabia/src/segmenter/japanese.rs#L25

charabia = { version = "0.7.2", default-features = false }

The above is specified in Cargo.toml, but I believe that would not allow all the language-specific tokenizers available in Charabia.

ffuugoo

LGTM. 👌

Was kinda surprised there's so little information regarding CJK tokenization available online (at least in English; I guess there are more in, well, CJK?). And one of the top articles is by Meilisearch, who created charabia.

@generall Should we expose a set of charabia-* features so that it would be trivial to build Qdrant with required dependencies/dictionaries enabled? We already provide a way to enable custom features in our Dockerfile (e.g., --build-arg FEATURES=optional-feature).

mocobeta · 2023-07-17T11:52:25Z

According to Cargo.toml of charabia library, you'll drop Japanese, Korean, and Chinese Pinyin support by default-features = false. I guess the main reason is the size of their dictionaries. I believe all other languages such as Chinese (not Pinyin), Thai, and Greek are still supported.

generall · 2023-07-17T12:16:05Z

@generall Should we expose a set of charabia-* features so that it would be trivial to build Qdrant with required dependencies/dictionaries enabled?

@ffuugoo could you please suggest the fix for that?

ffuugoo · 2023-07-17T14:27:09Z

@generall Should we expose a set of charabia-* features so that it would be trivial to build Qdrant with required dependencies/dictionaries enabled?

@ffuugoo could you please suggest the fix for that?

Done. The only downside is that cargo check --all-features takes a bit longer now, because it enables all optional CJK features.

...and enable `charabia` features that does not require any additional dependencies by default

generall · 2023-07-17T23:20:53Z

The new version of https://github.com/meilisearch/charabia/releases/tag/v0.8.0 breaks this default per-character tokenization for disabled languages 😭

@mocobeta @mosuka do you know if it is expected behavior? Looks like something related to default settings of AhoCorasick

mocobeta · 2023-07-18T00:27:50Z

@generall Unfortunately, I have little information about Charabia internals (I only know lindera, the tokenizer for Japanese and Korean.) I don't know if it is the expected behavior of Charabia, but I suspect there could be behavior changes in the language detection or tokenizer selection mechanism. Problems often come from the language detection part, not from tokenizers. Will take a look.

generall · 2023-07-18T01:07:54Z

Created an issue there: meilisearch/charabia#229

mosuka · 2023-07-18T01:46:45Z

@generall
This may be an effect of this change.
https://github.com/meilisearch/charabia/releases/tag/v0.8.0

mocobeta · 2023-07-18T13:27:40Z

It looks like an intentional breaking change in the default tokenizer in charabia. It might not be a showstoper to integrate it into qdrant. At least developers will have options to build qdrant with charabia's full-features if they want 🤔

generall · 2023-07-18T14:49:11Z

t might not be a showstoper to integrate it into qdrant.

Agreed, but the motivation for this decision is not clear to me. It turns partially usable version into completely unusable one

mocobeta · 2023-07-18T23:13:02Z

It turns partially usable version into completely unusable one

Yes, the latest charabia default tokenizer shouldn't be applied to non-Latin languages like CJK.

Now I wonder if it is worth trying another approach, character n-gram tokenizer. It is well documented in Lucene and a generalized method of the previous charabia's default behavior (character unigram).
It is an orthogonal approach to this PR. I'm not sure if it makes sense to you or how difficult it is within the context of qdrant, can I have a try?

generall · 2023-07-19T10:13:03Z

@mocobeta character n-gram tokenizer usually only work well with ranking function, which is designed to prioritize matches with more n-grams overlapped. Something like BM25. Qdrant only uses full-text for filtering, and in this case n-gram tokenizers might create too much noize

mocobeta · 2023-07-19T14:03:29Z

@generall Sorry I didn't explain the use cases of n-gram tokenizer. Actually, we Japanese (and "CJK" people) often use character n-gram tokenizer not only for ranking but also for simple filtering documents. It works in situations a dictionary-based tokenizer is not available or does not work well with the real-world text.
The trick is, for example, we tokenize 明日の天気は曇りのち晴れです into bi-grams 明日 / 日の / の天 / 天気 / 気は / は曇 / 曇り / りの / のち / ち晴 / 晴れ / れで / です. Then when we have a search query 曇りのち晴れ we also tokenize it into 曇り / りの / のち / ち晴 / 晴れ, and perform a conjunction search. The combination of bi-gram or tri-gram and conjunction search generally works well for us - and if phrase search query is available we don't have any noise.

This is a typical work-around for searching CJK languages; I hope this explanation makes sense to you. The downside is the index and vocabulary size, and obviously, it doesn't work without conjunction search ("and" operation).

* charabia tokenizer for CJK, init * add charabia to .proto * generate grpc docs * rename charabia to multilingual * disable charabia default features * ignore stopwords and separators in tokenizer * fix rebase conflict * fix test * fix test * Bump `charabia` crate version... ...and enable `charabia` features that does not require any additional dependencies by default * Expose optional `charabia` features in the `segment` crate * Expose optional `charabia` features in the `qdrant` crate * fix codespell * more regression tests * more regression tests * make language-specific tests feature-flagged --------- Co-authored-by: Anatolii Smolianinov <zarkonesmall@gmail.com> Co-authored-by: Roman Titov <ffuugoo@users.noreply.github.com>

ghost · 2023-09-19T06:23:16Z

is it possible to provide a dockerfile or docker image for building with CJK tokenizer support?
@generall

ffuugoo · 2023-09-19T09:20:30Z

is it possible to provide a dockerfile or docker image for building with CJK tokenizer support? @generall

Currently, we don't provide prebuilt Docker image with CJK support, but you can build one yourself with the default Dockerfile in the repo:

docker build buildx --build-arg FEATURES=multiling-chinese,multiling-japanese,multiling-korean
I recommend using docker build buildx, but docker build --build-arg FEATURES=... should also work
and, of course, you could specify any combination of multiling-chinese/multiling-japanese/multiling-korean that you want (e.g., --build-arg FEATURES=multiling-chinese,multiling-korean or --build-arg FEATURES=multiling-japanese)

yangboz · 2024-04-16T06:20:18Z

is it possible to provide a dockerfile or docker image for building with CJK tokenizer support? @generall

Currently, we don't provide prebuilt Docker image with CJK support, but you can build one yourself with the default Dockerfile in the repo:

docker build buildx --build-arg FEATURES=multiling-chinese,multiling-japanese,multiling-korean

I recommend using docker build buildx, but docker build --build-arg FEATURES=... should also work

and, of course, you could specify any combination of multiling-chinese/multiling-japanese/multiling-korean that you want (e.g., --build-arg FEATURES=multiling-chinese,multiling-korean or --build-arg FEATURES=multiling-japanese)

does Quadrant docker image with those differ tags is possible ?

timvisee · 2024-04-16T07:36:01Z

is it possible to provide a dockerfile or docker image for building with CJK tokenizer support? @generall

Currently, we don't provide prebuilt Docker image with CJK support, but you can build one yourself with the default Dockerfile in the repo:

docker build buildx --build-arg FEATURES=multiling-chinese,multiling-japanese,multiling-korean

I recommend using docker build buildx, but docker build --build-arg FEATURES=... should also work

and, of course, you could specify any combination of multiling-chinese/multiling-japanese/multiling-korean that you want (e.g., --build-arg FEATURES=multiling-chinese,multiling-korean or --build-arg FEATURES=multiling-japanese)

does Quadrant docker image with those differ tags is possible ?

Yes, it is possible to build one as described above. But we don't provide one by default at this time.

yangboz · 2024-04-16T08:36:54Z

docker build buildx --build-arg FEATURES=multiling-chinese

after tried this example command , error happen as following:

 unable to prepare context: path "buildx" not found

any idea ? thanks.

timvisee · 2024-04-16T08:43:38Z

docker build buildx --build-arg FEATURES=multiling-chinese

after tried this example command , error happen as following:
 unable to prepare context: path "buildx" not found
any idea ? thanks.

You might have to install buildx: https://stackoverflow.com/a/77359279/1000145

yangboz · 2024-04-17T05:54:14Z

after

cargo add charabia --no-default-features --features "chinese,japanese,korean"

then how to use docker to build it ? any tutorial on customized dockerized image on those CJK tokenizer staff ? thanks a lot !

yangboz · 2024-04-19T03:48:30Z

any more easier to following tutorials on it ?

generall mentioned this pull request Jul 13, 2023

CJK tokenizer #2023

Closed

8 tasks

generall force-pushed the zarkone/charabia-tokenizer branch from ebcc45a to b8eef94 Compare July 14, 2023 12:36

generall requested a review from agourlay July 14, 2023 12:55

mocobeta reviewed Jul 15, 2023

View reviewed changes

ffuugoo approved these changes Jul 17, 2023

View reviewed changes

zarkone and others added 13 commits July 18, 2023 00:15

charabia tokenizer for CJK, init

5fc9ea8

add charabia to .proto

1f0cd7a

generate grpc docs

0aaadc4

rename charabia to multilingual

2b75910

disable charabia default features

657cb7e

ignore stopwords and separators in tokenizer

d5e9d73

fix rebase conflict

0db93a6

fix test

574a9f8

fix test

e84eb82

Bump charabia crate version...

1829e46

...and enable `charabia` features that does not require any additional dependencies by default

Expose optional charabia features in the segment crate

9d23cbc

Expose optional charabia features in the qdrant crate

552956a

fix codespell

34c8663

generall force-pushed the zarkone/charabia-tokenizer branch from 68f1721 to 34c8663 Compare July 17, 2023 22:15

generall added 2 commits July 18, 2023 01:24

more regression tests

979ea81

more regression tests

2d9b796

generall mentioned this pull request Jul 17, 2023

Tokenization of japan text with disabled default features meilisearch/charabia#229

Open

make language-specific tests feature-flagged

8a4fe59

generall merged commit 5d2ec88 into dev Jul 19, 2023

ffuugoo deleted the zarkone/charabia-tokenizer branch July 19, 2023 13:27

Conversation

generall commented Jul 13, 2023

Uh oh!

zarkone commented Jul 14, 2023

Uh oh!

agourlay commented Jul 14, 2023

Uh oh!

generall commented Jul 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mocobeta Jul 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mocobeta Jul 15, 2023

Choose a reason for hiding this comment

Uh oh!

generall Jul 15, 2023

Choose a reason for hiding this comment

Uh oh!

mocobeta Jul 15, 2023

Choose a reason for hiding this comment

Uh oh!

mosuka Jul 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mosuka Jul 15, 2023

Choose a reason for hiding this comment

Uh oh!

ffuugoo left a comment

Choose a reason for hiding this comment

Uh oh!

mocobeta commented Jul 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

generall commented Jul 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ffuugoo commented Jul 17, 2023

Uh oh!

generall commented Jul 17, 2023

Uh oh!

mocobeta commented Jul 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

generall commented Jul 18, 2023

Uh oh!

mosuka commented Jul 18, 2023

Uh oh!

mocobeta commented Jul 18, 2023

Uh oh!

generall commented Jul 18, 2023

Uh oh!

mocobeta commented Jul 18, 2023

Uh oh!

generall commented Jul 19, 2023

Uh oh!

mocobeta commented Jul 19, 2023

Uh oh!

ghost commented Sep 19, 2023

Uh oh!

ffuugoo commented Sep 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yangboz commented Apr 16, 2024

Uh oh!

timvisee commented Apr 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yangboz commented Apr 16, 2024

Uh oh!

timvisee commented Apr 16, 2024

Uh oh!

yangboz commented Apr 17, 2024

Uh oh!

yangboz commented Apr 19, 2024

Uh oh!

Reviewers

Assignees

generall commented Jul 14, 2023 •

edited

Loading

mocobeta Jul 15, 2023 •

edited

Loading

mosuka Jul 15, 2023 •

edited

Loading

mocobeta commented Jul 17, 2023 •

edited

Loading

generall commented Jul 17, 2023 •

edited

Loading

mocobeta commented Jul 18, 2023 •

edited

Loading

ffuugoo commented Sep 19, 2023 •

edited

Loading

timvisee commented Apr 16, 2024 •

edited

Loading