The MMR algorithm follows a simple but effective process:
- Start by selecting the most relevant item
- For each remaining item, calculate an MMR score that combines:
- Its relevance to the query
- Its dissimilarity to already selected items
- Select the item with the highest MMR score
- Repeat until you have enough results
The MMR score formula at step 2 is:
MMR(Di) = lambda * similarity(Q,Di)) - (1-lambda)(max(similarity(Di, Dj))
where:
- Di is the current doc we compute the MMR score
- Dj represent the docs from the diversified result set
- Q is the query vector (which can be optional)
Let's assume we have N docs that we need to diversify.
In our current implementation of MMR, we maintain two sets of caches:
- a cache that holds the similarity scores between the query Q and all documents (N size)
- a cache that holds the similarity scores between pairs of documents Di,Dj (N*N size)
At step, for each document that we have not selected we recompute the MMR score, using these two caches.
We can reduce the number of computations, the cache size and the number of cache lookups by maintaining:
- a cache that holds the similarity scores between the query Q and all documents (N size)
- a cache that holds max similarity between the current doc and the diversified set of docs (N size)
We no longer need a cache for all pairs of similarity(Di, Dj).
Assume we have selected document Dx at step 2 and we are now looking at adding a new document to the diversified set.
The new MMR score becomes:
// update the cache for maxSimilarityToSelected
maxSimilarityToSelected(Di) = max(similarity(Di, Dx), maxSimilarityToSelected(Di))
MMR(Di) = lambda * similarity(Q,Di)) - (1-lambda) * maxSimilarityToSelected(Di)
With this method we no longer have to iterate through all the already selected documents, when we compute a new MMR score.
We only need to take into account the last document we added to the list of diversified results.
The MMR algorithm follows a simple but effective process:
The MMR score formula at step 2 is:
where:
Let's assume we have N docs that we need to diversify.
In our current implementation of MMR, we maintain two sets of caches:
At step, for each document that we have not selected we recompute the MMR score, using these two caches.
We can reduce the number of computations, the cache size and the number of cache lookups by maintaining:
We no longer need a cache for all pairs of
similarity(Di, Dj).Assume we have selected document
Dxat step 2 and we are now looking at adding a new document to the diversified set.The new MMR score becomes:
With this method we no longer have to iterate through all the already selected documents, when we compute a new MMR score.
We only need to take into account the last document we added to the list of diversified results.