Skip to content

Second thoughts on CJK tokenizer#2260

Merged
generall merged 16 commits intodevfrom
zarkone/charabia-tokenizer
Jul 19, 2023
Merged

Second thoughts on CJK tokenizer#2260
generall merged 16 commits intodevfrom
zarkone/charabia-tokenizer

Conversation

@generall
Copy link
Copy Markdown
Member

Another attempt to think about CJK tokenizer: #2023

Created this PR cause I don't have access to change original one. (all authorship preserved)

CC
@zarkone

@generall generall mentioned this pull request Jul 13, 2023
8 tasks
@zarkone
Copy link
Copy Markdown
Contributor

zarkone commented Jul 14, 2023

👍 let me know if you want me to add anything!

@generall generall force-pushed the zarkone/charabia-tokenizer branch from ebcc45a to b8eef94 Compare July 14, 2023 12:36
@generall generall requested a review from agourlay July 14, 2023 12:55
@agourlay
Copy link
Copy Markdown
Member

Can I please have a short recap. of what it means for our users to use Charabia without the default features enabled?

What would the documentation look like regarding:

  • which languages are supported
  • what capabilities it brings
  • what it can't support

@generall
Copy link
Copy Markdown
Member Author

generall commented Jul 14, 2023

Main thing what it does, is that it understands in which cases we need to tokenize based on white-spaces, and in which cases we need to take individual characters:

hello, мир!

and

本日の日付は

both will be tokenized in an acceptable way (not ideal, with expensive features disabled), while
word tokenizer can only handle the first case and character- tokenizer - only second

assert_eq!(tokens.get(3), Some(&"日".to_owned()));
assert_eq!(tokens.get(4), Some(&"付".to_owned()));
assert_eq!(tokens.get(5), Some(&"は".to_owned()));
}
Copy link
Copy Markdown

@mocobeta mocobeta Jul 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I just noticed this test case, and to me, there might be something wrong.
It is clearly Japanese text, but I don't think it is correct tokenization.
本日の日付は should be splitted into four tokens 本日 / / 日付 / .

I am a contributor of https://github.com/lindera-morphology/lindera (the Japanese tokenizer internally used by charabia tokenizer), but I'm not familiar with charabia.

@mosuka, do you know anything about the behavior of charabia?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I missed the previous comment #2260 (comment) so never mind.

I'd like to have clear documentation for the change; while it may be acceptable for general "multilingual" support, it would not be valid "CJK support" without the full Charabia features (language-specific dictionaries).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will document it for those who want to have full support

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I fully understand the design choice here not to bloat the Qdrant itself. Just wanted to clarify what we will have with this change.

Copy link
Copy Markdown

@mosuka mosuka Jul 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mocobeta @generall
Charabia performs language detection from the input string and then splits it into words using a tokenizer prepared for the detected language.
UniDic is used by defaul for Japanese.

https://github.com/meilisearch/charabia/blob/5f8abfe561bd985f7288a9a196b7e3ebeb773dc4/charabia/src/segmenter/japanese.rs#L25

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

charabia = { version = "0.7.2", default-features = false }

The above is specified in Cargo.toml, but I believe that would not allow all the language-specific tokenizers available in Charabia.

Copy link
Copy Markdown
Contributor

@ffuugoo ffuugoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. 👌

Was kinda surprised there's so little information regarding CJK tokenization available online (at least in English; I guess there are more in, well, CJK?). And one of the top articles is by Meilisearch, who created charabia.

@generall Should we expose a set of charabia-* features so that it would be trivial to build Qdrant with required dependencies/dictionaries enabled? We already provide a way to enable custom features in our Dockerfile (e.g., --build-arg FEATURES=optional-feature).

@mocobeta
Copy link
Copy Markdown

mocobeta commented Jul 17, 2023

According to Cargo.toml of charabia library, you'll drop Japanese, Korean, and Chinese Pinyin support by default-features = false. I guess the main reason is the size of their dictionaries. I believe all other languages such as Chinese (not Pinyin), Thai, and Greek are still supported.

@generall
Copy link
Copy Markdown
Member Author

generall commented Jul 17, 2023

@generall Should we expose a set of charabia-* features so that it would be trivial to build Qdrant with required dependencies/dictionaries enabled?

@ffuugoo could you please suggest the fix for that?

@ffuugoo
Copy link
Copy Markdown
Contributor

ffuugoo commented Jul 17, 2023

@generall Should we expose a set of charabia-* features so that it would be trivial to build Qdrant with required dependencies/dictionaries enabled?

@ffuugoo could you please suggest the fix for that?

Done. The only downside is that cargo check --all-features takes a bit longer now, because it enables all optional CJK features.

@generall generall force-pushed the zarkone/charabia-tokenizer branch from 68f1721 to 34c8663 Compare July 17, 2023 22:15
@generall
Copy link
Copy Markdown
Member Author

The new version of https://github.com/meilisearch/charabia/releases/tag/v0.8.0 breaks this default per-character tokenization for disabled languages 😭

@mocobeta @mosuka do you know if it is expected behavior? Looks like something related to default settings of AhoCorasick

@mocobeta
Copy link
Copy Markdown

mocobeta commented Jul 18, 2023

@generall Unfortunately, I have little information about Charabia internals (I only know lindera, the tokenizer for Japanese and Korean.) I don't know if it is the expected behavior of Charabia, but I suspect there could be behavior changes in the language detection or tokenizer selection mechanism. Problems often come from the language detection part, not from tokenizers. Will take a look.

@generall
Copy link
Copy Markdown
Member Author

Created an issue there: meilisearch/charabia#229

@mosuka
Copy link
Copy Markdown

mosuka commented Jul 18, 2023

@generall
This may be an effect of this change.
https://github.com/meilisearch/charabia/releases/tag/v0.8.0

@mocobeta
Copy link
Copy Markdown

It looks like an intentional breaking change in the default tokenizer in charabia. It might not be a showstoper to integrate it into qdrant. At least developers will have options to build qdrant with charabia's full-features if they want 🤔

@generall
Copy link
Copy Markdown
Member Author

t might not be a showstoper to integrate it into qdrant.

Agreed, but the motivation for this decision is not clear to me. It turns partially usable version into completely unusable one

@mocobeta
Copy link
Copy Markdown

It turns partially usable version into completely unusable one

Yes, the latest charabia default tokenizer shouldn't be applied to non-Latin languages like CJK.

Now I wonder if it is worth trying another approach, character n-gram tokenizer. It is well documented in Lucene and a generalized method of the previous charabia's default behavior (character unigram).
It is an orthogonal approach to this PR. I'm not sure if it makes sense to you or how difficult it is within the context of qdrant, can I have a try?

@generall
Copy link
Copy Markdown
Member Author

@mocobeta character n-gram tokenizer usually only work well with ranking function, which is designed to prioritize matches with more n-grams overlapped. Something like BM25. Qdrant only uses full-text for filtering, and in this case n-gram tokenizers might create too much noize

@generall generall merged commit 5d2ec88 into dev Jul 19, 2023
@ffuugoo ffuugoo deleted the zarkone/charabia-tokenizer branch July 19, 2023 13:27
@mocobeta
Copy link
Copy Markdown

@generall Sorry I didn't explain the use cases of n-gram tokenizer. Actually, we Japanese (and "CJK" people) often use character n-gram tokenizer not only for ranking but also for simple filtering documents. It works in situations a dictionary-based tokenizer is not available or does not work well with the real-world text.
The trick is, for example, we tokenize 明日の天気は曇りのち晴れです into bi-grams 明日 / 日の / の天 / 天気 / 気は / は曇 / 曇り / りの / のち / ち晴 / 晴れ / れで / です. Then when we have a search query 曇りのち晴れ we also tokenize it into 曇り / りの / のち / ち晴 / 晴れ, and perform a conjunction search. The combination of bi-gram or tri-gram and conjunction search generally works well for us - and if phrase search query is available we don't have any noise.

This is a typical work-around for searching CJK languages; I hope this explanation makes sense to you. The downside is the index and vocabulary size, and obviously, it doesn't work without conjunction search ("and" operation).

IvanPleshkov pushed a commit that referenced this pull request Jul 19, 2023
* charabia tokenizer for CJK, init

* add charabia to .proto

* generate grpc docs

* rename charabia to multilingual

* disable charabia default features

* ignore stopwords and separators in tokenizer

* fix rebase conflict

* fix test

* fix test

* Bump `charabia` crate version...

...and enable `charabia` features that does not require any additional dependencies by default

* Expose optional `charabia` features in the `segment` crate

* Expose optional `charabia` features in the `qdrant` crate

* fix codespell

* more regression tests

* more regression tests

* make language-specific tests feature-flagged

---------

Co-authored-by: Anatolii Smolianinov <zarkonesmall@gmail.com>
Co-authored-by: Roman Titov <ffuugoo@users.noreply.github.com>
generall added a commit that referenced this pull request Jul 31, 2023
* charabia tokenizer for CJK, init

* add charabia to .proto

* generate grpc docs

* rename charabia to multilingual

* disable charabia default features

* ignore stopwords and separators in tokenizer

* fix rebase conflict

* fix test

* fix test

* Bump `charabia` crate version...

...and enable `charabia` features that does not require any additional dependencies by default

* Expose optional `charabia` features in the `segment` crate

* Expose optional `charabia` features in the `qdrant` crate

* fix codespell

* more regression tests

* more regression tests

* make language-specific tests feature-flagged

---------

Co-authored-by: Anatolii Smolianinov <zarkonesmall@gmail.com>
Co-authored-by: Roman Titov <ffuugoo@users.noreply.github.com>
@ghost
Copy link
Copy Markdown

ghost commented Sep 19, 2023

is it possible to provide a dockerfile or docker image for building with CJK tokenizer support?
@generall

@ffuugoo
Copy link
Copy Markdown
Contributor

ffuugoo commented Sep 19, 2023

is it possible to provide a dockerfile or docker image for building with CJK tokenizer support? @generall

Currently, we don't provide prebuilt Docker image with CJK support, but you can build one yourself with the default Dockerfile in the repo:

  • docker build buildx --build-arg FEATURES=multiling-chinese,multiling-japanese,multiling-korean
  • I recommend using docker build buildx, but docker build --build-arg FEATURES=... should also work
  • and, of course, you could specify any combination of multiling-chinese/multiling-japanese/multiling-korean that you want (e.g., --build-arg FEATURES=multiling-chinese,multiling-korean or --build-arg FEATURES=multiling-japanese)

@yangboz
Copy link
Copy Markdown

yangboz commented Apr 16, 2024

is it possible to provide a dockerfile or docker image for building with CJK tokenizer support? @generall

Currently, we don't provide prebuilt Docker image with CJK support, but you can build one yourself with the default Dockerfile in the repo:

  • docker build buildx --build-arg FEATURES=multiling-chinese,multiling-japanese,multiling-korean
  • I recommend using docker build buildx, but docker build --build-arg FEATURES=... should also work
  • and, of course, you could specify any combination of multiling-chinese/multiling-japanese/multiling-korean that you want (e.g., --build-arg FEATURES=multiling-chinese,multiling-korean or --build-arg FEATURES=multiling-japanese)

does Quadrant docker image with those differ tags is possible ?

@timvisee
Copy link
Copy Markdown
Member

timvisee commented Apr 16, 2024

is it possible to provide a dockerfile or docker image for building with CJK tokenizer support? @generall

Currently, we don't provide prebuilt Docker image with CJK support, but you can build one yourself with the default Dockerfile in the repo:

  • docker build buildx --build-arg FEATURES=multiling-chinese,multiling-japanese,multiling-korean
  • I recommend using docker build buildx, but docker build --build-arg FEATURES=... should also work
  • and, of course, you could specify any combination of multiling-chinese/multiling-japanese/multiling-korean that you want (e.g., --build-arg FEATURES=multiling-chinese,multiling-korean or --build-arg FEATURES=multiling-japanese)

does Quadrant docker image with those differ tags is possible ?

Yes, it is possible to build one as described above. But we don't provide one by default at this time.

@yangboz
Copy link
Copy Markdown

yangboz commented Apr 16, 2024

  • docker build buildx --build-arg FEATURES=multiling-chinese

after tried this example command , error happen as following:

 unable to prepare context: path "buildx" not found

any idea ? thanks.

@timvisee
Copy link
Copy Markdown
Member

  • docker build buildx --build-arg FEATURES=multiling-chinese

after tried this example command , error happen as following:

 unable to prepare context: path "buildx" not found

any idea ? thanks.

You might have to install buildx: https://stackoverflow.com/a/77359279/1000145

@yangboz
Copy link
Copy Markdown

yangboz commented Apr 17, 2024

after

cargo add charabia --no-default-features --features "chinese,japanese,korean"

then how to use docker to build it ? any tutorial on customized dockerized image on those CJK tokenizer staff ? thanks a lot !

@yangboz
Copy link
Copy Markdown

yangboz commented Apr 19, 2024

any more easier to following tutorials on it ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants