MMR: Reduce cache size by reusing previous MMR scores

The MMR algorithm follows a simple but effective process:

1. Start by selecting the most relevant item
2. For each remaining item, calculate an MMR score that combines:
   - Its relevance to the query
   - Its dissimilarity to already selected items
3. Select the item with the highest MMR score
4. Repeat until you have enough results


The MMR score formula at step 2 is:

```
MMR(Di) = lambda * similarity(Q,Di)) - (1-lambda)(max(similarity(Di, Dj))
```

where:
- Di is the current doc we compute the MMR score
- Dj represent the docs from the diversified result set
- Q is the query vector (which can be optional)

Let's assume we have N docs that we need to diversify.
In our current implementation of MMR, we maintain two sets of caches:
- a cache that holds the similarity scores between the query Q and all documents (N size)
- a cache that holds the similarity scores between pairs of documents Di,Dj (N*N size)

At step, for each document that we have not selected we recompute the MMR score, using these two caches.

We can reduce the number of computations, the cache size and the number of cache lookups by maintaining:
- a cache that holds the similarity scores between the query Q and all documents (N size)
- a cache that holds max similarity between the current doc and the diversified set of docs (N size)

We no longer need a cache for all pairs of `similarity(Di, Dj)`.

Assume we have selected document `Dx` at step 2 and we are now looking at adding a new document to the diversified set.
The new MMR score becomes:

```
// update the cache for maxSimilarityToSelected 
maxSimilarityToSelected(Di) = max(similarity(Di, Dx), maxSimilarityToSelected(Di))

MMR(Di) = lambda * similarity(Q,Di)) - (1-lambda) * maxSimilarityToSelected(Di)
```

With this method we no longer have to iterate through all the already selected documents, when we compute a new MMR score.
We only need to take into account the last document we added to the list of diversified results.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MMR: Reduce cache size by reusing previous MMR scores #140710

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

MMR: Reduce cache size by reusing previous MMR scores #140710

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions