Conversation
|
👍 let me know if you want me to add anything! |
ebcc45a to
b8eef94
Compare
|
Can I please have a short recap. of what it means for our users to use Charabia without the default features enabled? What would the documentation look like regarding:
|
|
Main thing what it does, is that it understands in which cases we need to tokenize based on white-spaces, and in which cases we need to take individual characters: and both will be tokenized in an acceptable way (not ideal, with expensive features disabled), while |
| assert_eq!(tokens.get(3), Some(&"日".to_owned())); | ||
| assert_eq!(tokens.get(4), Some(&"付".to_owned())); | ||
| assert_eq!(tokens.get(5), Some(&"は".to_owned())); | ||
| } |
There was a problem hiding this comment.
Hi, I just noticed this test case, and to me, there might be something wrong.
It is clearly Japanese text, but I don't think it is correct tokenization.
本日の日付は should be splitted into four tokens 本日 / の / 日付 / は.
I am a contributor of https://github.com/lindera-morphology/lindera (the Japanese tokenizer internally used by charabia tokenizer), but I'm not familiar with charabia.
@mosuka, do you know anything about the behavior of charabia?
There was a problem hiding this comment.
Ok, I missed the previous comment #2260 (comment) so never mind.
I'd like to have clear documentation for the change; while it may be acceptable for general "multilingual" support, it would not be valid "CJK support" without the full Charabia features (language-specific dictionaries).
There was a problem hiding this comment.
we will document it for those who want to have full support
There was a problem hiding this comment.
Thank you, I fully understand the design choice here not to bloat the Qdrant itself. Just wanted to clarify what we will have with this change.
There was a problem hiding this comment.
charabia = { version = "0.7.2", default-features = false }
The above is specified in Cargo.toml, but I believe that would not allow all the language-specific tokenizers available in Charabia.
ffuugoo
left a comment
There was a problem hiding this comment.
LGTM. 👌
Was kinda surprised there's so little information regarding CJK tokenization available online (at least in English; I guess there are more in, well, CJK?). And one of the top articles is by Meilisearch, who created charabia.
@generall Should we expose a set of charabia-* features so that it would be trivial to build Qdrant with required dependencies/dictionaries enabled? We already provide a way to enable custom features in our Dockerfile (e.g., --build-arg FEATURES=optional-feature).
|
According to Cargo.toml of |
Done. The only downside is that |
...and enable `charabia` features that does not require any additional dependencies by default
68f1721 to
34c8663
Compare
|
The new version of https://github.com/meilisearch/charabia/releases/tag/v0.8.0 breaks this default per-character tokenization for disabled languages 😭 @mocobeta @mosuka do you know if it is expected behavior? Looks like something related to default settings of AhoCorasick |
|
@generall Unfortunately, I have little information about Charabia internals (I only know |
|
Created an issue there: meilisearch/charabia#229 |
|
@generall |
|
It looks like an intentional breaking change in the default tokenizer in charabia. It might not be a showstoper to integrate it into qdrant. At least developers will have options to build qdrant with charabia's full-features if they want 🤔 |
Agreed, but the motivation for this decision is not clear to me. It turns partially usable version into completely unusable one |
Yes, the latest charabia default tokenizer shouldn't be applied to non-Latin languages like CJK. Now I wonder if it is worth trying another approach, character n-gram tokenizer. It is well documented in Lucene and a generalized method of the previous charabia's default behavior (character unigram). |
|
@mocobeta character n-gram tokenizer usually only work well with ranking function, which is designed to prioritize matches with more n-grams overlapped. Something like BM25. Qdrant only uses full-text for filtering, and in this case n-gram tokenizers might create too much noize |
|
@generall Sorry I didn't explain the use cases of n-gram tokenizer. Actually, we Japanese (and "CJK" people) often use character n-gram tokenizer not only for ranking but also for simple filtering documents. It works in situations a dictionary-based tokenizer is not available or does not work well with the real-world text. This is a typical work-around for searching CJK languages; I hope this explanation makes sense to you. The downside is the index and vocabulary size, and obviously, it doesn't work without conjunction search ("and" operation). |
* charabia tokenizer for CJK, init * add charabia to .proto * generate grpc docs * rename charabia to multilingual * disable charabia default features * ignore stopwords and separators in tokenizer * fix rebase conflict * fix test * fix test * Bump `charabia` crate version... ...and enable `charabia` features that does not require any additional dependencies by default * Expose optional `charabia` features in the `segment` crate * Expose optional `charabia` features in the `qdrant` crate * fix codespell * more regression tests * more regression tests * make language-specific tests feature-flagged --------- Co-authored-by: Anatolii Smolianinov <zarkonesmall@gmail.com> Co-authored-by: Roman Titov <ffuugoo@users.noreply.github.com>
* charabia tokenizer for CJK, init * add charabia to .proto * generate grpc docs * rename charabia to multilingual * disable charabia default features * ignore stopwords and separators in tokenizer * fix rebase conflict * fix test * fix test * Bump `charabia` crate version... ...and enable `charabia` features that does not require any additional dependencies by default * Expose optional `charabia` features in the `segment` crate * Expose optional `charabia` features in the `qdrant` crate * fix codespell * more regression tests * more regression tests * make language-specific tests feature-flagged --------- Co-authored-by: Anatolii Smolianinov <zarkonesmall@gmail.com> Co-authored-by: Roman Titov <ffuugoo@users.noreply.github.com>
|
is it possible to provide a dockerfile or docker image for building with CJK tokenizer support? |
Currently, we don't provide prebuilt Docker image with CJK support, but you can build one yourself with the default
|
does Quadrant docker image with those differ tags is possible ? |
Yes, it is possible to build one as described above. But we don't provide one by default at this time. |
after tried this example command , error happen as following: any idea ? thanks. |
You might have to install |
|
after then how to use docker to build it ? any tutorial on customized dockerized image on those CJK tokenizer staff ? thanks a lot ! |
|
any more easier to following tutorials on it ? |
Another attempt to think about CJK tokenizer: #2023
Created this PR cause I don't have access to change original one. (all authorship preserved)
CC
@zarkone