Skip to content

MMR: Reduce cache size by reusing previous MMR scores #140710

@ioanatia

Description

@ioanatia

The MMR algorithm follows a simple but effective process:

  1. Start by selecting the most relevant item
  2. For each remaining item, calculate an MMR score that combines:
    • Its relevance to the query
    • Its dissimilarity to already selected items
  3. Select the item with the highest MMR score
  4. Repeat until you have enough results

The MMR score formula at step 2 is:

MMR(Di) = lambda * similarity(Q,Di)) - (1-lambda)(max(similarity(Di, Dj))

where:

  • Di is the current doc we compute the MMR score
  • Dj represent the docs from the diversified result set
  • Q is the query vector (which can be optional)

Let's assume we have N docs that we need to diversify.
In our current implementation of MMR, we maintain two sets of caches:

  • a cache that holds the similarity scores between the query Q and all documents (N size)
  • a cache that holds the similarity scores between pairs of documents Di,Dj (N*N size)

At step, for each document that we have not selected we recompute the MMR score, using these two caches.

We can reduce the number of computations, the cache size and the number of cache lookups by maintaining:

  • a cache that holds the similarity scores between the query Q and all documents (N size)
  • a cache that holds max similarity between the current doc and the diversified set of docs (N size)

We no longer need a cache for all pairs of similarity(Di, Dj).

Assume we have selected document Dx at step 2 and we are now looking at adding a new document to the diversified set.
The new MMR score becomes:

// update the cache for maxSimilarityToSelected 
maxSimilarityToSelected(Di) = max(similarity(Di, Dx), maxSimilarityToSelected(Di))

MMR(Di) = lambda * similarity(Q,Di)) - (1-lambda) * maxSimilarityToSelected(Di)

With this method we no longer have to iterate through all the already selected documents, when we compute a new MMR score.
We only need to take into account the last document we added to the list of diversified results.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions