Open Deep Research Optimization Report

Open Deep Research is an open-source agentic framework developed by Hugging Face that autonomously conducts web-based research. Built using the smolagents framework, it can browse the internet, extract and analyze information, and perform data manipulations to answer complex queries.

Performance on GAIA Validation Set

We optimized the Open Deep Research framework using our proposed SEW(Self-Evolving Workflow) optimizer, with a primary focus on improving the prompts within the framework. The performance of Open Deep Research with original and optimized prompts on the full GAIA validation set is shown in the following figure:

Figure 1: Performance comparison between original and optimized Open Deep Research on the full GAIA validation set

The results indicate that our optimized prompts improve the performance by 18.41% on average, with noticeable improvements on tasks from all three levels of the GAIA benchmark.

In our experiments, we leveraged the OpenAI o3 model to optimize the prompts, and used gpt-4o-mini to run the model during evaluation. The total investment for this optimization process was approximately $45, with the majority of the cost (about $42) from running the model with gpt-4o-mini for validation. These results indicate that our optimization process is cost-effective and can achieve remarkable performance improvements.

Why Open Deep Research?

We chose Open Deep Research because it is one of the few open-source, runnable frameworks on the GAIA leaderboard. Most other submissions are either closed-source or lack runnable code. Alongside OWL , Open Deep Research offers a strong baseline for evaluating and improving web-based research agents. While OWL is optimized in another teammate's repository, this work focuses on optimizing Open Deep Research for the GAIA leaderboard.

Figure 2: GAIA Leaderboard showing Open Deep Research performance and ranking among other submissions

What We Changed

We made the following modifications to the original framework:

LLM Backbone

We change the original o1 LLM model (used in the leaderboard submission) with gpt-4o-mini in our experiments.

The main reason is the extremely high token consumption of this framework. In our preliminary tests, running just 50 samples with o1 incurred about $150 in API fees. Running the full validation set of 165 examples would cost approximately $495, making it impractical for iterative optimization. To reduce the cost while preserving reasonable performance, we used the more cost-effective gpt-4o-mini. However, even with this smaller model, running the full validation set still costs around $55, highlights the inherently high token consumption of the Open Deep Research framework.
Optimized Prompts

We optimized the prompts within the Open Deep Research framework using our proposed SEWOptimizer. In our experiments, we randomly sampled 25 questions from the GAIA validation set and used them as a validation subset for optimization. These optimized prompts can be found in the src/smolagents/prompts folder:
- code_agent_4o_mini_optimized.yaml
- toolcalling_agent_4o_mini_optimized.yaml
Easier Run with --optimized Flag

We modified the script run_gaia.py in the examples/open_deep_research folder to include the --optimized argument. This allows users to switch between original and optimized prompts effortlessly. You can follow the instructions under the folder examples/open_deep_research to run the framework.
Evaluation Script

We add a new script evaluate.py in the examples/open_deep_research folder to facilitate the evaluation of model outputs.
Evaluation Results

To facilitate the evaluation, we provide the results of the original and optimized prompts on the full GAIA validation set in the output/validation folder:
- results with original prompts: gpt-4o-mini_results.jsonl
- results with optimized prompts: gpt-4o-mini_optimized_results.jsonl
These files can be directly used for quick comparison and reproduce the results.

Reproducing Results

Setup

Follow the instructions in the examples/open_deep_research folder to setup the environment. Then create a file .env with the following content:

OPENAI_API_KEY=your_openai_api_key
SERPER_API_KEY=your_serper_api_key

Reproducing Original Implementation

Run the following commands to reproduce the original Open Deep Research implementation on the validation set of the GAIA benchmark:

python run_gaia.py --concurrency 10 --model-id gpt-4o-mini --run-name gpt-4o-mini_results

Running Our Optimized Implementation

Run the following command to reproduce our optimized implementation on the validation set of the GAIA benchmark:

cd examples/open_deep_research
python run_gaia.py --concurrency 10 --model-id gpt-4o-mini --run-name gpt-4o-mini_optimized_results --optimized

Evaluation

Run the following command to evaluate the performance:

python evaluate.py --output_file /path/to/your/results.jsonl

Name		Name	Last commit message	Last commit date
Latest commit History 708 Commits
.github		.github
docs		docs
examples		examples
output/validation		output/validation
src/smolagents		src/smolagents
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
e2b.toml		e2b.toml
image.png		image.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
result_comparison.png		result_comparison.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Deep Research Optimization Report

Performance on GAIA Validation Set

Why Open Deep Research?

What We Changed

Reproducing Results

Setup

Reproducing Original Implementation

Running Our Optimized Implementation

Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Open Deep Research Optimization Report

Performance on GAIA Validation Set

Why Open Deep Research?

What We Changed

Reproducing Results

Setup

Reproducing Original Implementation

Running Our Optimized Implementation

Evaluation

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages