Skip to content

feat: Add evaluation runtime for indexing and retrieval#4639

Merged
KennethEnevoldsen merged 67 commits into
embeddings-benchmark:mainfrom
ayush1298:add_evaluation_runtime
Jun 9, 2026
Merged

feat: Add evaluation runtime for indexing and retrieval#4639
KennethEnevoldsen merged 67 commits into
embeddings-benchmark:mainfrom
ayush1298:add_evaluation_runtime

Conversation

@ayush1298

@ayush1298 ayush1298 commented May 8, 2026

Copy link
Copy Markdown
Collaborator

closes #4177

This PR add TimingStack class to calculate start and end time of different phases of indexing and retrieval and also provide a method plot() to plot timings.

Copilot AI review requested due to automatic review settings May 8, 2026 17:56

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces per-phase runtime tracking for evaluation (especially retrieval), and persists those phase timings into TaskResult so downstream consumers can inspect indexing/search/scoring time breakdowns.

Changes:

  • Added a new TimingStack utility (with a quick_plot() logger-based visualization) to record named phase start/end times.
  • Extended TaskResult with an evaluation_phases field and wired evaluation code to populate it.
  • Instrumented task data loading/transform and retrieval evaluation (indexing/search/scoring) to record phase timings.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
mteb/timing.py Adds TimingStack/TimingContext and quick_plot() for recording and displaying phase timings.
mteb/results/task_result.py Adds evaluation_phases field and passes it through constructors / historic conversion.
mteb/evaluate.py Populates TaskResult.evaluation_phases from the task timer at the end of evaluation.
mteb/abstasks/retrieval.py Wraps retrieval data loading, dataset transform, and scoring in timer phases; passes timer to evaluator.
mteb/abstasks/abstask.py Instantiates a timer on all tasks and records data_loading / dataset_transform phases in load_data().
mteb/_evaluators/retrieval_evaluator.py Adds optional timer support to measure indexing/search phases inside the retrieval evaluator.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mteb/evaluate.py Outdated
Comment thread mteb/results/task_result.py
Comment thread mteb/results/task_result.py Outdated
Comment thread mteb/abstasks/retrieval.py Outdated
Comment thread mteb/_evaluators/retrieval_evaluator.py
Comment thread mteb/timing.py Outdated
Comment thread mteb/timing.py Outdated
Comment thread mteb/timing.py Outdated
@ayush1298

Copy link
Copy Markdown
Collaborator Author

@KennethEnevoldsen @Samoed When I tested these with the below script:

import json
import logging

import mteb
logging.basicConfig(level=logging.INFO, format="%(message)s")


def main():
    model = mteb.get_model("sentence-transformers/all-MiniLM-L6-v2")
    task = mteb.get_task("SciFact")
    results = mteb.evaluate(
        model, task, overwrite_strategy="always", encode_kwargs={"batch_size": 32}
    )
    task.timer.quick_plot()
    task_result = results.task_results[0]
    task_result_dict = task_result.to_dict()
    print("evaluation_time:", task_result_dict.get("evaluation_time"))

    phases = task_result_dict.get("evaluation_phases")
    if phases:
        print("evaluation_phases:", json.dumps(phases, indent=2))
    else:
        print("No evaluation_phases found!")


if __name__ == "__main__":
    main()

I got below output:

data_loading      |█████████████████████████████                     | 17.2s
dataset_transform |                             █                    | 0.0s
encode_corpus     |                             █                    | 0.0s
encode_queries    |                             ████████████████████ | 11.7s
scoring           |                                                 █| 0.2s
                   29.1s (untracked: 0.0s)
evaluation_time: 11.84852123260498
evaluation_phases: [
  {
    "name": "data_loading",
    "start": 2.86102294921875e-06,
    "end": 17.240598917007446
  },
  {
    "name": "dataset_transform",
    "start": 17.240612983703613,
    "end": 17.24061393737793
  },
  {
    "name": "encode_corpus",
    "start": 17.2497079372406,
    "end": 17.249716997146606
  },
  {
    "name": "encode_queries",
    "start": 17.2497341632843,
    "end": 28.928431034088135
  },
  {
    "name": "scoring",
    "start": 28.92854905128479,
    "end": 29.086568117141724
  }
]

So here, if you saw the plot or even timings, then dataset_transform was taking nearly 0 timing(some 6th-7th decimal difference), and encode_corpus had exactly 0 timing.
Now, the problem here is that for encoder models, corpus encoding doesn't happen in encode_corpus. When .index() is called, we are simply storing a reference to the dataset: (self.task_corpus = corpus), not actually encoding the text yet, so it's taking 0 time. This is lazily deferred and happens inside encode_queries, where when we called .search() method, as there is no dedicated index backend, it falls back to a function called _full_corpus_search(). and hence all timing was there in encode_queries.

@Samoed Samoed left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we need such TimingStack class. Why not simply returning 2 additional values?

@KennethEnevoldsen KennethEnevoldsen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah few fundamental things to consider

  1. How would the result look like with multiple splits or subsets? (e.g. for sib200)
  2. How do we handle merging of two results?

We of course also need some sort of tests for this, while I don't think we need to implement it for all classes we should def. implement it in a few to ensure that the implementation work for not just retrieval.

Comment thread mteb/_evaluators/retrieval_evaluator.py Outdated
Comment thread mteb/evaluate.py Outdated
@ayush1298

Copy link
Copy Markdown
Collaborator Author

I'm not sure if we need such TimingStack class. Why not simply returning 2 additional values?

But those values need to be returned at many places and multiple layers, causing us to unnecessarily change many functions. Having these class, just simplify all those things

Comment thread mteb/abstasks/abstask.py Outdated
@ayush1298

ayush1298 commented May 9, 2026

Copy link
Copy Markdown
Collaborator Author

Ah few fundamental things to consider

  1. How would the result look like with multiple splits or subsets? (e.g. for sib200)

Updated to handle split/subset. The results when run with SIB200Classification task:

data_loading      |█████████████████████████████████████████████████ | 305.6s
dataset_transform |                                                 █| 1.7s
                   307.3s (untracked: 0.0s)
evaluation_phases: [
  {
    "name": "data_loading",
    "start": 5.7220458984375e-06,
    "end": 305.56462478637695
  },
  {
    "name": "dataset_transform",
    "start": 305.5646319389343,
    "end": 307.3017909526825
  }
]

It took to much time as there are around 200 langs. Also, as its classification, so dont have any other fields. Maybe, we can add different fields based on task, if we want to go beyond retrieval.

I tried with MintakaRetrieval which has multiple splits/subsets and the results are as follows:

image
Detailed logs
evaluation_phases: [
  {
    "name": "data_loading",
    "start": 1.0013580322265625e-05,
    "end": 138.45878195762634
  },
  {
    "name": "dataset_transform",
    "start": 138.4587881565094,
    "end": 138.45878911018372
  },
  {
    "name": "encode_corpus",
    "start": 138.47474312782288,
    "end": 138.4747519493103,
    "split": "test",
    "subset": "ar"
  },
  {
    "name": "encode_queries",
    "start": 138.47476720809937,
    "end": 141.96742820739746,
    "split": "test",
    "subset": "ar"
  },
  {
    "name": "scoring",
    "start": 141.96751809120178,
    "end": 142.97935509681702,
    "split": "test",
    "subset": "ar"
  },
  {
    "name": "encode_corpus",
    "start": 143.0113651752472,
    "end": 143.01136684417725,
    "split": "test",
    "subset": "de"
  },
  {
    "name": "encode_queries",
    "start": 143.0113821029663,
    "end": 145.17311215400696,
    "split": "test",
    "subset": "de"
  },
  {
    "name": "scoring",
    "start": 145.17320919036865,
    "end": 146.25091409683228,
    "split": "test",
    "subset": "de"
  },
  {
    "name": "encode_corpus",
    "start": 146.29281520843506,
    "end": 146.29281616210938,
    "split": "test",
    "subset": "es"
  },
  {
    "name": "encode_queries",
    "start": 146.29282903671265,
    "end": 148.33434295654297,
    "split": "test",
    "subset": "es"
  },
  {
    "name": "scoring",
    "start": 148.3344271183014,
    "end": 149.4387810230255,
    "split": "test",
    "subset": "es"
  },
  {
    "name": "encode_corpus",
    "start": 149.4747588634491,
    "end": 149.474760055542,
    "split": "test",
    "subset": "fr"
  },
  {
    "name": "encode_queries",
    "start": 149.47476983070374,
    "end": 151.43966007232666,
    "split": "test",
    "subset": "fr"
  },
  {
    "name": "scoring",
    "start": 151.43973898887634,
    "end": 152.55419206619263,
    "split": "test",
    "subset": "fr"
  },
  {
    "name": "encode_corpus",
    "start": 152.58365201950073,
    "end": 152.58365321159363,
    "split": "test",
    "subset": "hi"
  },
  {
    "name": "encode_queries",
    "start": 152.583664894104,
    "end": 153.75605392456055,
    "split": "test",
    "subset": "hi"
  },
  {
    "name": "scoring",
    "start": 153.7561240196228,
    "end": 154.22829222679138,
    "split": "test",
    "subset": "hi"
  },
  {
    "name": "encode_corpus",
    "start": 154.2488510608673,
    "end": 154.24885201454163,
    "split": "test",
    "subset": "it"
  },
  {
    "name": "encode_queries",
    "start": 154.24886298179626,
    "end": 156.1714689731598,
    "split": "test",
    "subset": "it"
  },
  {
    "name": "scoring",
    "start": 156.17156100273132,
    "end": 157.3480260372162,
    "split": "test",
    "subset": "it"
  },
  {
    "name": "encode_corpus",
    "start": 157.3810420036316,
    "end": 157.3810429573059,
    "split": "test",
    "subset": "ja"
  },
  {
    "name": "encode_queries",
    "start": 157.38105702400208,
    "end": 159.3182179927826,
    "split": "test",
    "subset": "ja"
  },
  {
    "name": "scoring",
    "start": 159.3182990550995,
    "end": 160.3883352279663,
    "split": "test",
    "subset": "ja"
  },
  {
    "name": "encode_corpus",
    "start": 160.4202799797058,
    "end": 160.4202811717987,
    "split": "test",
    "subset": "pt"
  },
  {
    "name": "encode_queries",
    "start": 160.4202938079834,
    "end": 162.29019498825073,
    "split": "test",
    "subset": "pt"
  },
  {
    "name": "scoring",
    "start": 162.29027605056763,
    "end": 163.3557801246643,
    "split": "test",
    "subset": "pt"
  }
]
  1. How do we handle merging of two results?

I have not updated the merge of results in TaskResult with evaluation_phases. Here there are 2 cases:

  1. When we have results from 2 different machines, but for different split/subset of same task. These will be easily handled as now each phase is tag with split and subset
  2. Problem occurs when same split/subset is ran on 2 different hardwares and we got 2 different results. I am not sure, what to do exactly here, or should we add hardware field, and then based on it do merging. Like, if it is of same hardware then modified old results, if it is of different hardware, then we can just add it.

We of course also need some sort of tests for this, while I don't think we need to implement it for all classes we should def. implement it in a few to ensure that the implementation work for not just retrieval.

Will add tests, and also add for other classes, once we have sure that it works as expected for retrieval

Comment thread mteb/evaluate.py
Comment thread mteb/evaluate.py
Comment thread mteb/evaluate.py Outdated
Comment thread mteb/evaluate.py Outdated
Comment thread mteb/abstasks/retrieval.py
Comment thread mteb/abstasks/abstask.py Outdated
Comment thread mteb/_evaluators/retrieval_evaluator.py Outdated
Comment thread mteb/_evaluators/retrieval_evaluator.py Outdated
@ayush1298

ayush1298 commented May 10, 2026

Copy link
Copy Markdown
Collaborator Author

@KennethEnevoldsen Can you check these comment also.
Then, I can update merge results as well as if any changes need to make for plot of multiple splits/subsets.
Also, do we have to extend it for classification or any other task?

Comment thread mteb/abstasks/abstask.py
Comment thread mteb/evaluate.py
Comment thread mteb/evaluate.py
Comment thread mteb/timing.py Outdated
Comment thread mteb/timing.py Outdated
Comment thread mteb/_evaluators/retrieval_evaluator.py Outdated
@Samoed

Samoed commented May 11, 2026

Copy link
Copy Markdown
Member

Problem occurs when same split/subset is ran on 2 different hardwares and we got 2 different results. I am not sure, what to do exactly here, or should we add hardware field, and then based on it do merging. Like, if it is of same hardware then modified old results, if it is of different hardware, then we can just add it.

You can average time

Comment thread mteb/timing.py Outdated
Comment thread tests/test_evaluate.py Outdated
Comment thread mteb/_evaluators/zeroshot_classification_evaluator.py Outdated
@ayush1298

Copy link
Copy Markdown
Collaborator Author

I have resolved @Samoed comments, added to docs, @KennethEnevoldsen Could you review it once again

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 36 out of 36 changed files in this pull request and generated 5 comments.

Comment thread mteb/evaluate.py
Comment thread mteb/results/task_result.py Outdated
Comment thread mteb/evaluate.py
Comment thread mteb/_evaluators/image/imagetext_pairclassification_evaluator.py Outdated
Comment thread tests/test_timing.py Outdated
@ayush1298

Copy link
Copy Markdown
Collaborator Author

@KennethEnevoldsen this is pending for while, can you review it?

@KennethEnevoldsen KennethEnevoldsen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a few minor things otherwise this is good

Comment thread mteb/_evaluators/retrieval_evaluator.py Outdated
Comment thread docs/api/results.md Outdated
Comment thread docs/whats_new.md Outdated
Comment thread docs/whats_new.md Outdated
Comment thread mteb/_evaluators/retrieval_evaluator.py Outdated
@ayush1298

Copy link
Copy Markdown
Collaborator Author

@KennethEnevoldsen Could you look at the above comments that you asked related to plots. I think apart from that, we are good to merge this

@KennethEnevoldsen

Copy link
Copy Markdown
Contributor

@ayush1298 there is still some cases in docs that needs to be updated

@ayush1298

ayush1298 commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

@ayush1298 there is still some cases in docs that needs to be updated

@KennethEnevoldsen I have updated docs with new plot. Can you check is there anything else?

@KennethEnevoldsen KennethEnevoldsen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are golden! Thanks for taking the time

@KennethEnevoldsen KennethEnevoldsen merged commit 6a3e816 into embeddings-benchmark:main Jun 9, 2026
12 checks passed
@ayush1298 ayush1298 deleted the add_evaluation_runtime branch June 9, 2026 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add evaluation runtime for indexing and retrieval time

4 participants