Skip to content

[Bug/embeddings]: big difference between embeddings in sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 #368

Description

@SardorLut

What happened?

import numpy as np
from numpy import dot
from numpy.linalg import norm
from fastembed import TextEmbedding
from sentence_transformers import SentenceTransformer

text = ["Я помню чудное мгновенье:\nПередо мной явилась ты,\nКак мимолетное виденье,\nКак гений чистой красоты."]

embedding_model = TextEmbedding(
    model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2", providers=["CPUExecutionProvider"], 
)
embed_from_fastembed = np.array(list(embedding_model.embed(documents=text)))

model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')
embeddings_from_sentence_transformer = np.array(model.encode(text))

a = embeddings_from_sentence_transformer
b = embed_from_fastembed[0]
cos_sim = dot(a, b)/(norm(a)*norm(b))
print(float(cos_sim)) #0.6093958020210266

The difference is even greater if I give more bigger text

What Python version are you on? e.g. python --version

manager: poetry
Python=3.10
fastembed-gpu="^0.3.6"
onnxruntime-gpu==1.18.0

Version

0.2.7 (Latest)

What os are you seeing the problem on?

No response

Relevant stack traces and/or logs

No response

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions