Conversation
|
looking forward for a feedback on naming and approach in general, as well as what is missing if this approach is applicable 🙏 |
|
You need to update the EDIT: Maybe you simply forgot to commit the file? |
|
no, I've thought .proto generates from .rs file and not the other way around 🥲 |
|
Current status: we are struggling to get a green build for Windows on CI 😬 |
|
Also https://github.com/qdrant/qdrant/actions/runs/5180834118/jobs/9335607637?pr=2023 fails. It requires updates in generated grpc docs |
6cca910 to
010908c
Compare
|
@agourlay thanks for checking! windows build fail doesn't give much clues https://github.com/qdrant/qdrant/actions/runs/5180834122/jobs/9353240767?pr=2023 -- I'll look into it, and also will think about alternative options. |
|
@zarkone please rebase your PR, I have pushed a change to reduce the amount of artifacts generated at compile time. |
010908c to
c1d7da1
Compare
|
thanks @agourlay! rebased ✔️ |
docs/grpc/docs.md
Outdated
| | Prefix | 1 | | | ||
| | Whitespace | 2 | | | ||
| | Word | 3 | | | ||
| | Charabia | 4 | | |
There was a problem hiding this comment.
I think my first concern is the naming of the API, surfacing charabia seems like an implementation details leak.
What do you think about multilingual instead?
There was a problem hiding this comment.
right, this was my concern as well -- and multilingual is much better.
lib/segment/Cargo.toml
Outdated
| chrono = { version = "0.4.26", features = ["serde"] } | ||
|
|
||
| sysinfo = "0.29" | ||
| charabia = "0.7.2" |
There was a problem hiding this comment.
By default, the crates enables the following tokenizers:
- chinese
- hebrew
- japanese
- thai
- korean
- greek
- latin-camelcase
We initially just wanted to support CJK but I guess it is fine.
What I would like to see is the difference in binary size before and after this PR.
There was a problem hiding this comment.
right, I was also thinking that this might help with situations like windows build fails, potentially. If we want to focus only on CJK it makes sense to disable others.
There was a problem hiding this comment.
OK, so I've measured the binary size on fresh dev and here. Results doesn't look great..
| setup | commit | size |
|----------------+----------+-------|
| dev | edf3d88e | 52.3M |
| charabia: full | 68266067 | 92.7M |
| charabia: cjk | -- | 92.7M |
Linux(NixOS), x86_64
These are the sizes of target/release/qdrant after executing cargo build --release.
I've also expected that removing features from charabia would exclude some crates from deps and the size will be reduced. However, -- I've checked couple of times and did a cleanup just to make sure -- the size was the same as with full charabia.
This is the command I've used to limit features:
cargo add charabia --no-default-features --features "chinese,japanese,korean"
Updating crates.io index
Adding charabia v0.7.2 to dependencies.
Features:
+ chinese
+ japanese
+ korean
- greek
- hebrew
- japanese-transliteration
- latin-camelcase
- lindera
- thaiIt feels too much for one type of tokenizer to take half of binary size
There was a problem hiding this comment.
it looks like meilisearch somehow fits 50mb
There was a problem hiding this comment.
It's not too surprising that the CJK tokenizers take so much space, as they each include a full dictionary for their respective language.
c1d7da1 to
6826606
Compare
|
Ok, so tokenizers are heavy because of dictionaries, that is correct. Why they need dictionaries? To tokenize languages without the notion Without dictionary, one of the simplest ways of tokenizing is n-gram
In order to mitigate this issue, tokenizers are using various On of the most popular methods in this field is Hidden Markov This one can be used for example with jiebra-rs, which serves as one The size of default dictionary in jiebra-rs(a backend for processing This is how charabia tokenizes Chinese text: other backends are Charabia adds around 50M when all the languages are included.
I think because Meilisearch doesn't enable it by default, exposing as
If I use charabia without default features, binary size changes are Even without dictionaries, charabia does a lot of things like
which can be used to improve querying mechanism: for example, That being said, I think that it still makes sense to use charabia, This example also shows that with enabled backends not only tokens are in:
Bonus material: enabled hmm in charabia src
|
|
What do you think about forwarding build features? |
|
Thanks a lot for your efforts on this! Unfortunately, we ended up deciding not to merge this into Qdrant. Enabling the CJK dictionaries make the binary too big, as you've found. Not including the dictionaries make the benefits too insignificant. We want to keep the core Qdrant product as simple as possible without too many bells and whistles. Nevertheless, your contribution has been very valuable and provided us with new insights. We would therefore still like to reward you with the bounty. We hope you understand why we made this decision. If there's a clear use case for this in the future, we well revisit it again. Thanks again for picking up this task and working on it thoroughly. |
|
@timvisee no, that sounds good, totally fine, no worries -- totally valid. I've actually had a lot of fun exploring qdrant codebase and tokenizer problems. In case you come across such use case where you need a complex tokenizer for a certain cluster/client, I feel that it would be relatively easy to forward these build flags from charabia. Just decided to repeat that from my conclusion so it will be documented here in summary. |
|
/tip $250 @zarkone |
|
@zarkone: You just got a $250 tip! 👉 Complete your Algora onboarding to collect your payment. |
|
🎉🎈 @zarkone has been awarded $250! 🎈🎊 |
How about putting all CJK related code under a new feature? ("cjk" maybe?) |
|
Since #2260 is merged, I think we can close this one. 😃 |

/claim #1909
Adds new type of tokenizer suitable for CJK languages using meilisearch/charabia
All Submissions:
New Feature Submissions:
cargo fmtcommand prior to submission?cargo clippycommand?Changes to Core Features: