feat: Add evaluation runtime for indexing and retrieval by ayush1298 · Pull Request #4639 · embeddings-benchmark/mteb

ayush1298 · 2026-05-08T17:56:21Z

This PR add TimingStack class to calculate start and end time of different phases of indexing and retrieval and also provide a method plot() to plot timings.

Copilot

Pull request overview

This PR introduces per-phase runtime tracking for evaluation (especially retrieval), and persists those phase timings into TaskResult so downstream consumers can inspect indexing/search/scoring time breakdowns.

Changes:

Added a new TimingStack utility (with a quick_plot() logger-based visualization) to record named phase start/end times.
Extended TaskResult with an evaluation_phases field and wired evaluation code to populate it.
Instrumented task data loading/transform and retrieval evaluation (indexing/search/scoring) to record phase timings.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
mteb/timing.py	Adds `TimingStack`/`TimingContext` and `quick_plot()` for recording and displaying phase timings.
mteb/results/task_result.py	Adds `evaluation_phases` field and passes it through constructors / historic conversion.
mteb/evaluate.py	Populates `TaskResult.evaluation_phases` from the task timer at the end of evaluation.
mteb/abstasks/retrieval.py	Wraps retrieval data loading, dataset transform, and scoring in timer phases; passes timer to evaluator.
mteb/abstasks/abstask.py	Instantiates a timer on all tasks and records `data_loading` / `dataset_transform` phases in `load_data()`.
mteb/_evaluators/retrieval_evaluator.py	Adds optional timer support to measure indexing/search phases inside the retrieval evaluator.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ayush1298 · 2026-05-08T18:05:25Z

@KennethEnevoldsen @Samoed When I tested these with the below script:

import json
import logging

import mteb
logging.basicConfig(level=logging.INFO, format="%(message)s")


def main():
    model = mteb.get_model("sentence-transformers/all-MiniLM-L6-v2")
    task = mteb.get_task("SciFact")
    results = mteb.evaluate(
        model, task, overwrite_strategy="always", encode_kwargs={"batch_size": 32}
    )
    task.timer.quick_plot()
    task_result = results.task_results[0]
    task_result_dict = task_result.to_dict()
    print("evaluation_time:", task_result_dict.get("evaluation_time"))

    phases = task_result_dict.get("evaluation_phases")
    if phases:
        print("evaluation_phases:", json.dumps(phases, indent=2))
    else:
        print("No evaluation_phases found!")


if __name__ == "__main__":
    main()

I got below output:

data_loading      |█████████████████████████████                     | 17.2s
dataset_transform |                             █                    | 0.0s
encode_corpus     |                             █                    | 0.0s
encode_queries    |                             ████████████████████ | 11.7s
scoring           |                                                 █| 0.2s
                   29.1s (untracked: 0.0s)
evaluation_time: 11.84852123260498
evaluation_phases: [
  {
    "name": "data_loading",
    "start": 2.86102294921875e-06,
    "end": 17.240598917007446
  },
  {
    "name": "dataset_transform",
    "start": 17.240612983703613,
    "end": 17.24061393737793
  },
  {
    "name": "encode_corpus",
    "start": 17.2497079372406,
    "end": 17.249716997146606
  },
  {
    "name": "encode_queries",
    "start": 17.2497341632843,
    "end": 28.928431034088135
  },
  {
    "name": "scoring",
    "start": 28.92854905128479,
    "end": 29.086568117141724
  }
]

So here, if you saw the plot or even timings, then dataset_transform was taking nearly 0 timing(some 6th-7th decimal difference), and encode_corpus had exactly 0 timing.
Now, the problem here is that for encoder models, corpus encoding doesn't happen in encode_corpus. When .index() is called, we are simply storing a reference to the dataset: (self.task_corpus = corpus), not actually encoding the text yet, so it's taking 0 time. This is lazily deferred and happens inside encode_queries, where when we called .search() method, as there is no dedicated index backend, it falls back to a function called _full_corpus_search(). and hence all timing was there in encode_queries.

Samoed

I'm not sure if we need such TimingStack class. Why not simply returning 2 additional values?

KennethEnevoldsen

Ah few fundamental things to consider

How would the result look like with multiple splits or subsets? (e.g. for sib200)
How do we handle merging of two results?

We of course also need some sort of tests for this, while I don't think we need to implement it for all classes we should def. implement it in a few to ensure that the implementation work for not just retrieval.

ayush1298 · 2026-05-09T07:29:47Z

I'm not sure if we need such TimingStack class. Why not simply returning 2 additional values?

But those values need to be returned at many places and multiple layers, causing us to unnecessarily change many functions. Having these class, just simplify all those things

ayush1298 · 2026-05-09T11:57:44Z

Ah few fundamental things to consider

How would the result look like with multiple splits or subsets? (e.g. for sib200)

Updated to handle split/subset. The results when run with SIB200Classification task:

data_loading      |█████████████████████████████████████████████████ | 305.6s
dataset_transform |                                                 █| 1.7s
                   307.3s (untracked: 0.0s)
evaluation_phases: [
  {
    "name": "data_loading",
    "start": 5.7220458984375e-06,
    "end": 305.56462478637695
  },
  {
    "name": "dataset_transform",
    "start": 305.5646319389343,
    "end": 307.3017909526825
  }
]

It took to much time as there are around 200 langs. Also, as its classification, so dont have any other fields. Maybe, we can add different fields based on task, if we want to go beyond retrieval.

I tried with MintakaRetrieval which has multiple splits/subsets and the results are as follows:

Detailed logs

evaluation_phases: [
  {
    "name": "data_loading",
    "start": 1.0013580322265625e-05,
    "end": 138.45878195762634
  },
  {
    "name": "dataset_transform",
    "start": 138.4587881565094,
    "end": 138.45878911018372
  },
  {
    "name": "encode_corpus",
    "start": 138.47474312782288,
    "end": 138.4747519493103,
    "split": "test",
    "subset": "ar"
  },
  {
    "name": "encode_queries",
    "start": 138.47476720809937,
    "end": 141.96742820739746,
    "split": "test",
    "subset": "ar"
  },
  {
    "name": "scoring",
    "start": 141.96751809120178,
    "end": 142.97935509681702,
    "split": "test",
    "subset": "ar"
  },
  {
    "name": "encode_corpus",
    "start": 143.0113651752472,
    "end": 143.01136684417725,
    "split": "test",
    "subset": "de"
  },
  {
    "name": "encode_queries",
    "start": 143.0113821029663,
    "end": 145.17311215400696,
    "split": "test",
    "subset": "de"
  },
  {
    "name": "scoring",
    "start": 145.17320919036865,
    "end": 146.25091409683228,
    "split": "test",
    "subset": "de"
  },
  {
    "name": "encode_corpus",
    "start": 146.29281520843506,
    "end": 146.29281616210938,
    "split": "test",
    "subset": "es"
  },
  {
    "name": "encode_queries",
    "start": 146.29282903671265,
    "end": 148.33434295654297,
    "split": "test",
    "subset": "es"
  },
  {
    "name": "scoring",
    "start": 148.3344271183014,
    "end": 149.4387810230255,
    "split": "test",
    "subset": "es"
  },
  {
    "name": "encode_corpus",
    "start": 149.4747588634491,
    "end": 149.474760055542,
    "split": "test",
    "subset": "fr"
  },
  {
    "name": "encode_queries",
    "start": 149.47476983070374,
    "end": 151.43966007232666,
    "split": "test",
    "subset": "fr"
  },
  {
    "name": "scoring",
    "start": 151.43973898887634,
    "end": 152.55419206619263,
    "split": "test",
    "subset": "fr"
  },
  {
    "name": "encode_corpus",
    "start": 152.58365201950073,
    "end": 152.58365321159363,
    "split": "test",
    "subset": "hi"
  },
  {
    "name": "encode_queries",
    "start": 152.583664894104,
    "end": 153.75605392456055,
    "split": "test",
    "subset": "hi"
  },
  {
    "name": "scoring",
    "start": 153.7561240196228,
    "end": 154.22829222679138,
    "split": "test",
    "subset": "hi"
  },
  {
    "name": "encode_corpus",
    "start": 154.2488510608673,
    "end": 154.24885201454163,
    "split": "test",
    "subset": "it"
  },
  {
    "name": "encode_queries",
    "start": 154.24886298179626,
    "end": 156.1714689731598,
    "split": "test",
    "subset": "it"
  },
  {
    "name": "scoring",
    "start": 156.17156100273132,
    "end": 157.3480260372162,
    "split": "test",
    "subset": "it"
  },
  {
    "name": "encode_corpus",
    "start": 157.3810420036316,
    "end": 157.3810429573059,
    "split": "test",
    "subset": "ja"
  },
  {
    "name": "encode_queries",
    "start": 157.38105702400208,
    "end": 159.3182179927826,
    "split": "test",
    "subset": "ja"
  },
  {
    "name": "scoring",
    "start": 159.3182990550995,
    "end": 160.3883352279663,
    "split": "test",
    "subset": "ja"
  },
  {
    "name": "encode_corpus",
    "start": 160.4202799797058,
    "end": 160.4202811717987,
    "split": "test",
    "subset": "pt"
  },
  {
    "name": "encode_queries",
    "start": 160.4202938079834,
    "end": 162.29019498825073,
    "split": "test",
    "subset": "pt"
  },
  {
    "name": "scoring",
    "start": 162.29027605056763,
    "end": 163.3557801246643,
    "split": "test",
    "subset": "pt"
  }
]

How do we handle merging of two results?

I have not updated the merge of results in TaskResult with evaluation_phases. Here there are 2 cases:

When we have results from 2 different machines, but for different split/subset of same task. These will be easily handled as now each phase is tag with split and subset
Problem occurs when same split/subset is ran on 2 different hardwares and we got 2 different results. I am not sure, what to do exactly here, or should we add hardware field, and then based on it do merging. Like, if it is of same hardware then modified old results, if it is of different hardware, then we can just add it.

We of course also need some sort of tests for this, while I don't think we need to implement it for all classes we should def. implement it in a few to ensure that the implementation work for not just retrieval.

Will add tests, and also add for other classes, once we have sure that it works as expected for retrieval

ayush1298 · 2026-05-10T07:35:34Z

@KennethEnevoldsen Can you check these comment also.
Then, I can update merge results as well as if any changes need to make for plot of multiple splits/subsets.
Also, do we have to extend it for classification or any other task?

Samoed · 2026-05-11T08:08:00Z

Problem occurs when same split/subset is ran on 2 different hardwares and we got 2 different results. I am not sure, what to do exactly here, or should we add hardware field, and then based on it do merging. Like, if it is of same hardware then modified old results, if it is of different hardware, then we can just add it.

You can average time

ayush1298 · 2026-05-28T13:08:36Z

I have resolved @Samoed comments, added to docs, @KennethEnevoldsen Could you review it once again

Copilot

Pull request overview

Copilot reviewed 36 out of 36 changed files in this pull request and generated 5 comments.

ayush1298 · 2026-06-03T11:17:50Z

@KennethEnevoldsen this is pending for while, can you review it?

KennethEnevoldsen

Only a few minor things otherwise this is good

ayush1298 · 2026-06-08T09:58:01Z

@KennethEnevoldsen Could you look at the above comments that you asked related to plots. I think apart from that, we are good to merge this

KennethEnevoldsen · 2026-06-08T12:46:13Z

@ayush1298 there is still some cases in docs that needs to be updated

ayush1298 · 2026-06-09T04:11:36Z

@ayush1298 there is still some cases in docs that needs to be updated

@KennethEnevoldsen I have updated docs with new plot. Can you check is there anything else?

KennethEnevoldsen

I think we are golden! Thanks for taking the time

Add evaluation runtime for indexing and retrieval

48a7262

Copilot AI review requested due to automatic review settings May 8, 2026 17:56

Copilot started reviewing on behalf of ayush1298 May 8, 2026 17:57 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

Samoed reviewed May 8, 2026

View reviewed changes

KennethEnevoldsen reviewed May 8, 2026

View reviewed changes

Comment thread mteb/_evaluators/retrieval_evaluator.py Outdated

Comment thread mteb/evaluate.py Outdated

change timer from maintaining state to passed as an attribute

3015fa1

Samoed reviewed May 9, 2026

View reviewed changes

Comment thread mteb/abstasks/abstask.py Outdated

ayush1298 added 3 commits May 9, 2026 15:26

add timer argument to load_data

198be1d

add timer argument to evaluate

4a6b026

removed timer from kwargs

8cdcbcd

ayush1298 added 6 commits May 9, 2026 17:57

update plots for split/subsets

7bd2f0b

fix tests

1e0e3b3

fix typing errors

109d0f6

typing errors

42fc5f0

change typing

915b154

correct typing

5435781

ayush1298 commented May 9, 2026

View reviewed changes

Comment thread mteb/evaluate.py

KennethEnevoldsen reviewed May 9, 2026

View reviewed changes

apply changes from review

a9f294e

update to handle overwritten load_data

175c36d

Samoed reviewed May 10, 2026

View reviewed changes

Comment thread mteb/abstasks/abstask.py

Comment thread mteb/evaluate.py

Comment thread mteb/evaluate.py

Comment thread mteb/timing.py Outdated

Comment thread mteb/timing.py Outdated

Comment thread mteb/_evaluators/retrieval_evaluator.py Outdated

changes from review

241276a

Samoed reviewed May 11, 2026

View reviewed changes

Comment thread mteb/timing.py Outdated

added evaluation phases merging logic and fix typecheck

4fe550c

fix import

638338b

Samoed reviewed May 28, 2026

View reviewed changes

Comment thread tests/test_evaluate.py Outdated

ayush1298 added 3 commits May 28, 2026 15:58

update tests

be66384

change Scoring to aggregate level in Classification task

9b602ac

remove unwanted file

64ec036

Samoed reviewed May 28, 2026

View reviewed changes

Comment thread mteb/_evaluators/zeroshot_classification_evaluator.py Outdated

ayush1298 added 3 commits May 28, 2026 16:45

Merge branch 'main' into add_evaluation_runtime

c3ec306

fix lintter and typecheck errors after merge

cdddf71

revert changes in other classification task

7f5dfbb

ayush1298 requested a review from KennethEnevoldsen May 28, 2026 13:07

Samoed requested a review from Copilot May 28, 2026 15:27

Copilot started reviewing on behalf of Samoed May 28, 2026 15:27 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

Comment thread mteb/evaluate.py

Comment thread mteb/results/task_result.py Outdated

Comment thread mteb/evaluate.py

Comment thread mteb/_evaluators/image/imagetext_pairclassification_evaluator.py Outdated

Comment thread tests/test_timing.py Outdated

changes from copilot review and add new test

72db5a2

Samoed approved these changes Jun 4, 2026

View reviewed changes

KennethEnevoldsen approved these changes Jun 5, 2026

View reviewed changes

Comment thread mteb/_evaluators/retrieval_evaluator.py Outdated

Comment thread docs/api/results.md Outdated

Comment thread docs/whats_new.md Outdated

Comment thread docs/whats_new.md Outdated

changes from review

88a03b5

Samoed reviewed Jun 5, 2026

View reviewed changes

Comment thread mteb/_evaluators/retrieval_evaluator.py Outdated

ayush1298 added 3 commits June 8, 2026 17:51

update condition

1cab9fc

Merge branch 'main' into add_evaluation_runtime

38fbccb

make lint

d6f97ab

updated docs

7f931de

KennethEnevoldsen approved these changes Jun 9, 2026

View reviewed changes

KennethEnevoldsen merged commit 6a3e816 into embeddings-benchmark:main Jun 9, 2026
12 checks passed

ayush1298 deleted the add_evaluation_runtime branch June 9, 2026 17:48

Uh oh!

Conversation

ayush1298 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ayush1298 commented May 8, 2026

Uh oh!

Samoed left a comment

Choose a reason for hiding this comment

Uh oh!

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ayush1298 commented May 9, 2026

Uh oh!

Uh oh!

ayush1298 commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ayush1298 commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Samoed commented May 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ayush1298 commented May 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ayush1298 commented Jun 3, 2026

Uh oh!

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ayush1298 commented Jun 8, 2026

Uh oh!

KennethEnevoldsen commented Jun 8, 2026

ayush1298 commented May 8, 2026 •

edited

Loading

ayush1298 commented May 9, 2026 •

edited

Loading

ayush1298 commented May 10, 2026 •

edited

Loading

ayush1298 commented Jun 9, 2026 •

edited

Loading