feat: Add evaluation runtime for indexing and retrieval#4639
feat: Add evaluation runtime for indexing and retrieval#4639KennethEnevoldsen merged 67 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces per-phase runtime tracking for evaluation (especially retrieval), and persists those phase timings into TaskResult so downstream consumers can inspect indexing/search/scoring time breakdowns.
Changes:
- Added a new
TimingStackutility (with aquick_plot()logger-based visualization) to record named phase start/end times. - Extended
TaskResultwith anevaluation_phasesfield and wired evaluation code to populate it. - Instrumented task data loading/transform and retrieval evaluation (indexing/search/scoring) to record phase timings.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| mteb/timing.py | Adds TimingStack/TimingContext and quick_plot() for recording and displaying phase timings. |
| mteb/results/task_result.py | Adds evaluation_phases field and passes it through constructors / historic conversion. |
| mteb/evaluate.py | Populates TaskResult.evaluation_phases from the task timer at the end of evaluation. |
| mteb/abstasks/retrieval.py | Wraps retrieval data loading, dataset transform, and scoring in timer phases; passes timer to evaluator. |
| mteb/abstasks/abstask.py | Instantiates a timer on all tasks and records data_loading / dataset_transform phases in load_data(). |
| mteb/_evaluators/retrieval_evaluator.py | Adds optional timer support to measure indexing/search phases inside the retrieval evaluator. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@KennethEnevoldsen @Samoed When I tested these with the below script: I got below output: So here, if you saw the plot or even timings, then |
Samoed
left a comment
There was a problem hiding this comment.
I'm not sure if we need such TimingStack class. Why not simply returning 2 additional values?
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
Ah few fundamental things to consider
- How would the result look like with multiple splits or subsets? (e.g. for sib200)
- How do we handle merging of two results?
We of course also need some sort of tests for this, while I don't think we need to implement it for all classes we should def. implement it in a few to ensure that the implementation work for not just retrieval.
But those values need to be returned at many places and multiple layers, causing us to unnecessarily change many functions. Having these class, just simplify all those things |
|
@KennethEnevoldsen Can you check these comment also. |
You can average time |
|
I have resolved @Samoed comments, added to docs, @KennethEnevoldsen Could you review it once again |
|
@KennethEnevoldsen this is pending for while, can you review it? |
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
Only a few minor things otherwise this is good
|
@KennethEnevoldsen Could you look at the above comments that you asked related to plots. I think apart from that, we are good to merge this |
|
@ayush1298 there is still some cases in docs that needs to be updated |
@KennethEnevoldsen I have updated docs with new plot. Can you check is there anything else? |
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
I think we are golden! Thanks for taking the time

closes #4177
This PR add
TimingStackclass to calculate start and end time of different phases of indexing and retrieval and also provide a methodplot()to plot timings.