Open Deep Research is an open-source agentic framework developed by Hugging Face that autonomously conducts web-based research. Built using the smolagents framework, it can browse the internet, extract and analyze information, and perform data manipulations to answer complex queries.
We optimized the Open Deep Research framework using our proposed SEW(Self-Evolving Workflow) optimizer, with a primary focus on improving the prompts within the framework. The performance of Open Deep Research with original and optimized prompts on the full GAIA validation set is shown in the following figure:
Figure 1: Performance comparison between original and optimized Open Deep Research on the full GAIA validation setThe results indicate that our optimized prompts improve the performance by 18.41% on average, with noticeable improvements on tasks from all three levels of the GAIA benchmark.
In our experiments, we leveraged the OpenAI o3 model to optimize the prompts, and used gpt-4o-mini to run the model during evaluation. The total investment for this optimization process was approximately $45, with the majority of the cost (about $42) from running the model with gpt-4o-mini for validation. These results indicate that our optimization process is cost-effective and can achieve remarkable performance improvements.
We chose Open Deep Research because it is one of the few open-source, runnable frameworks on the GAIA leaderboard. Most other submissions are either closed-source or lack runnable code. Alongside OWL , Open Deep Research offers a strong baseline for evaluating and improving web-based research agents. While OWL is optimized in another teammate's repository, this work focuses on optimizing Open Deep Research for the GAIA leaderboard.
Figure 2: GAIA Leaderboard showing Open Deep Research performance and ranking among other submissionsWe made the following modifications to the original framework:
-
LLM Backbone
We change the original
o1LLM model (used in the leaderboard submission) withgpt-4o-miniin our experiments.The main reason is the extremely high token consumption of this framework. In our preliminary tests, running just 50 samples with
o1incurred about $150 in API fees. Running the full validation set of 165 examples would cost approximately $495, making it impractical for iterative optimization. To reduce the cost while preserving reasonable performance, we used the more cost-effectivegpt-4o-mini. However, even with this smaller model, running the full validation set still costs around $55, highlights the inherently high token consumption of the Open Deep Research framework. -
Optimized Prompts
We optimized the prompts within the Open Deep Research framework using our proposed SEWOptimizer. In our experiments, we randomly sampled 25 questions from the GAIA validation set and used them as a validation subset for optimization. These optimized prompts can be found in the
src/smolagents/promptsfolder:code_agent_4o_mini_optimized.yamltoolcalling_agent_4o_mini_optimized.yaml
-
Easier Run with
--optimizedFlagWe modified the script
run_gaia.pyin theexamples/open_deep_researchfolder to include the--optimizedargument. This allows users to switch between original and optimized prompts effortlessly. You can follow the instructions under the folderexamples/open_deep_researchto run the framework. -
Evaluation Script
We add a new script
evaluate.pyin theexamples/open_deep_researchfolder to facilitate the evaluation of model outputs. -
Evaluation Results
To facilitate the evaluation, we provide the results of the original and optimized prompts on the full GAIA validation set in the
output/validationfolder:- results with original prompts:
gpt-4o-mini_results.jsonl - results with optimized prompts:
gpt-4o-mini_optimized_results.jsonl
These files can be directly used for quick comparison and reproduce the results.
- results with original prompts:
Follow the instructions in the examples/open_deep_research folder to setup the environment. Then create a file .env with the following content:
OPENAI_API_KEY=your_openai_api_key
SERPER_API_KEY=your_serper_api_key
Run the following commands to reproduce the original Open Deep Research implementation on the validation set of the GAIA benchmark:
python run_gaia.py --concurrency 10 --model-id gpt-4o-mini --run-name gpt-4o-mini_results Run the following command to reproduce our optimized implementation on the validation set of the GAIA benchmark:
cd examples/open_deep_research
python run_gaia.py --concurrency 10 --model-id gpt-4o-mini --run-name gpt-4o-mini_optimized_results --optimizedRun the following command to evaluate the performance:
python evaluate.py --output_file /path/to/your/results.jsonl
