𝐕𝐢𝐬𝐢𝐨𝐧 𝐑𝐀𝐆𝐑𝐀𝐆 𝐨𝐧 𝐂𝐨𝐦𝐩𝐥𝐞𝐱 𝐆𝐫𝐚𝐩𝐡𝐢𝐜𝐬🖼️
RAG is mostly text-only, even though we have so much data available as charts/figures.
Combine @cohere lastest Embed v4 embedding model with a vision-LLM like @GoogleDeepMind Gemini to get 𝐕𝐢𝐬𝐢𝐨𝐧 𝐑𝐀𝐆
GPT-3 Embeddings by @OpenAI was announced this week.
📈 I was excited and tested them on 20 datasets
😢 Sadly they are worse than open models that are 1000 x smaller
💰 Running @OpenAI models can be a 1 million times more expensive
tinyurl.com/gpt3-emb
𝐒𝐞𝐦𝐚𝐧𝐭𝐢𝐜 𝐒𝐞𝐚𝐫𝐜𝐡 𝐨𝐧 𝟏𝟎𝟎𝐌 𝐝𝐨𝐜𝐬 - 𝐖𝐢𝐭𝐡 𝟏𝟎𝟎𝐌𝐁 𝐨𝐟 𝐌𝐞𝐦𝐨𝐫𝐲
GPU-poor and Memory-poor, and not having 500GB of memory to embed & index 100M docs?
Still want to participate at TREC-RAG 2024?
Introducing 𝐃𝐢𝐬𝐤𝐕𝐞𝐜𝐭𝐨𝐫𝐈𝐧𝐝𝐞𝐱
🇺🇳𝟐𝟓𝟎𝐌 𝐖𝐢𝐤𝐢𝐩𝐞𝐝𝐢𝐚 𝐄𝐦𝐛𝐞𝐝𝐝𝐢𝐧𝐠𝐬 𝐢𝐧 𝟑𝟎𝟎+ 𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞𝐬 🇺🇳
What could you build if your RAG has access to Wikipedia in all 300+ languages?
Available for anyone to use, using our state-of-the-art multilingual embedding model:
huggingface.co/datasets/Coher…
🚀 𝐂𝐨𝐡𝐞𝐫𝐞 𝐄𝐦𝐛𝐞𝐝 𝐕𝟑 - 𝐢𝐧𝐭𝟖 & 𝐛𝐢𝐧𝐚𝐫𝐲 𝐒𝐮𝐩𝐩𝐨𝐫𝐭🚀
I'm excited to launch our native support for int8 & binary embeddings for Cohere Embed V3.
They slash your vector DB cost 4x - 32x while keeping 95% - 100% of the search quality.
txt.cohere.com/int8-binary-em…
Happy to announce that today is my first day at @huggingface. Looking forward to meet the new team.
First project will be on better integration of the huggingface hub into SentenceTransformers - Sharing your own SBERT.net models will become super easy!
𝐁𝐌𝟒𝟐 - 𝐓𝐡𝐞 𝐌𝐢𝐬𝐬𝐢𝐧𝐠 𝐁𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤
Qdrant released this week an interesting new approach that claims to replace BM25/lexical search.
They just sadly forgot to do proper benchmarking.
As it turns out: BM42 is way worse than BM25.
Hey all!
We actually did find a discrepancy with our previous benchmarks of bm42. Please don't trust us and always check performance on your own data.
Our best effort to correct it is here: github.com/qdrant/bm42_ev…
𝗖𝗼𝗵𝗲𝗿𝗲 𝗘𝗺𝗯𝗲𝗱 𝗩𝟯 - 𝗢𝘂𝗿 𝗡𝗲𝘄 𝗘𝗺𝗯𝗲𝗱𝗱𝗶𝗻𝗴 𝗠𝗼𝗱𝗲𝗹
Our team has been hard at work to ship the best embedding model for noisy & complex data.
After great feedback, finally publically available.
MTEB - Massive Text Embedding Benchmark 🧨
Text embeddings are usefull for many applications 💻, but still their evaluation is often done rather poorly on trivial datasets 🙁. MTEB is here to change it.
We collected 58 datasets across 8 tasks and evaluated many public models.
🇺🇳Semantic Search finally works across languages! 🇺🇳
Semantic Search gives great search results, but worked so far just for English😰
Glad to share our new cohere multilingual embedding model for 100+ languages. And the results are amazing 📈
Details:
txt.cohere.ai/multilingual/
🚨Sentence-Embeddings Model Alert🚨
Significantly better sentence embeddings models are now available in Sentence-Transformers: sbert.net/docs/pretraine…
Models have been evaluated on 14 challenging datasets including data from Twitter, Reddit, biomedical domain, e-mails and more
📺How to train state-of-the-art sentence embeddings? 📺
Just uploaded my 3-part video series on the theory how to train state-of-the-art sentence embedding models: