Skip to content

[Bug/Model Request]: intfloat/multilingual-e5-large should use average pooling #384

Description

@ITHwang

What happened?

Hi, I'm using intfloat/multilingual-e5-large for a retrieval task and I found that when E5OnnxEmbedding embeds texts using the model, the model output is pooled by CLS-pooling.

class E5OnnxEmbedding(OnnxTextEmbedding):
    ...

class OnnxTextEmbedding(TextEmbeddingBase, OnnxTextModel[np.ndarray]):
    """Implementation of the Flag Embedding model."""
    ...

    def _post_process_onnx_output(self, output: OnnxOutputContext) -> Iterable[np.ndarray]:
        embeddings = output.model_output
        return normalize(embeddings[:, 0]).astype(np.float32)

But I think it would be better to use average pooling as the paper does when pretraining the model.

Following the popular biencoder architecture, we use a pre-trained Transformer encoder and average pooling over the output layer to get fixed-size text embeddings Eq and Ep. The score is the cosine similarity scaled by a temperature hyperparameter τ : ...

So I'm alternatively using the model that uses average pooling by overriding E5OnnxEmbedding:

def average_pool(last_hidden_states: np.ndarray, attention_mask: np.ndarray) -> np.ndarray:
    ...
    return avg_hidden

class CustomE5OnnxEmbedding(E5OnnxEmbedding):
    ...

    def _post_process_onnx_output(self, output: OnnxOutputContext) -> Iterable[np.ndarray]:
        embeddings, attention_masks = output.model_output, output.attention_mask

        pooled_embeddings = average_pool(embeddings, attention_masks)
        nomalized_embeddings = normalize(pooled_embeddings).astype(np.float32)

        return nomalized_embeddings

TextEmbedding.EMBEDDINGS_REGISTRY.append(CustomE5OnnxEmbedding)

Would you consider changing the pooling method to average pooling?

And separated with this, I'm really enjoying using FastEmbed and I appreciate your work on it!

Thanks for your time and consideration!

What Python version are you on? e.g. python --version

  • Python 3.11
  • FastEmbed 0.4.1

Version

0.2.7 (Latest)

What os are you seeing the problem on?

MacOS

Relevant stack traces and/or logs

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions