SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python Benchmark Projects
-
-
InfluxDB
InfluxDB – Database Purpose-Built for High-Resolution Data. Turn time series data into real-time intelligence. Manage high-volume, high-velocity data without sacrificing performance.
-
Project mention: Show HN: We built a cheaper, simpler 24/7 proactive agent than moltbot | news.ycombinator.com | 2026-01-28
https://github.com/NevaMind-AI/memU
Happy to hear thoughts from people building agents or infra in this space.
-
-
opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
-
Project mention: Quark Platform: SOTA Vector Search with 99.0% Recall 10 | news.ycombinator.com | 2025-06-10
-
-
Project mention: MiniMax M2.5 is beating Claude Opus 4.6 and MiniMax is 17x-20x cheaper | news.ycombinator.com | 2026-03-02
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
-
Late interaction models encode data into multi-vector outputs. In other words, multiple input tokens map to multiple output vectors. Then at search time, the maximum similarity algorithm is used to find the best matches between the corpus and a query. This algorithm has achieved excellent results on retrieval benchmarks such as MTEB.
-
-
-
OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Some of Opus 4.6's standout results for me:
* GDPVal Elo: 1606 vs. GPT-5.2's 1462. OpenAI reported that GPT-5.2 has a 70.9% win-or-tie rate against human professionals. (https://openai.com/index/gdpval/) We can estimate Opus 4.6's win-or-tie rate against human pros at 80–88%.
* OSWorld: 72.7%, matching human performance at ~72.4% (https://os-world.github.io/). Since the human subjects were CS students, they were likely at least as competent as the average knowledge worker. The original OSWorld benchmark is somewhat noisy, but even if the model remains slightly inferior to humans, it is only a matter of time before it catches up or surpasses them.
* BrowseComp: At 84%, it is approaching human intersubject agreement of ~86% (https://openai.com/index/browsecomp/).
Taken together, this suggests that digital knowledge work will be transformed quite soon, possibly drastically if agent reliability improves beyond a certain threshold.
-
-
beir
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
Project mention: Gemini Embedding: Powering RAG and context engineering | news.ycombinator.com | 2025-07-31It's always worth checking out the MTEB leaderboard: https://huggingface.co/spaces/mteb/leaderboard
There are some good open models there that have longer context limits and fewer dimensions.
The benchmarks are just a guide. It's best to build a test dataset with your own data. This is a good example of that: https://github.com/beir-cellar/beir/wiki/Load-your-custom-da...
Another benefit of having your own test dataset, is that it can grow as your data grows. And you can quickly test new models to see how it performs with YOUR data.
-
-
FastRAG is a minimal, no-frills solution to build retrieval-augmented generation (RAG) pipelines locally. It requires zero external infrastructure—no Pinecone or LangChain—letting you set up document-based Q&A in minutes.
-
-
-
-
-
pytest-benchmark – pytest fixture for benchmarking code
-
BEHAVIOR-1K
BEHAVIOR-1K: a platform for accelerating Embodied AI research. Join our Discord for support: https://discord.gg/bccR5vGFEx
Communication Over Confidence Project: BEHAVIOR-1K My first contribution taught me the most fundamental lesson of open source. I spent full 3 days just setting up the project and understanding the codebase. When I finally identified the issue, I faced a dilemma. There was a line of code that seems very important but I had to remove to fix the issue. The function returned False if it identified anything other than True in a list, but there was also an assert all(...), child_values has NoneTypes line checking for NoneType values. Should I remove it or Keep it? Instead of making assumptions, I created a Pull Request with a [WIP] tag to open a conversation with the reviewers. This turned out to be the right call. In open source, especially as a newcomer, communication is the golden key. Nobody expects you to be perfect but they do expect you to be thoughtful. Don't be afraid to ask questions. Maintainers would much rather answer your questions than dealing with a poor PR.
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Benchmark discussion
Python Benchmark related posts
-
MiniMax M2.5 is beating Claude Opus 4.6 and MiniMax is 17x-20x cheaper
-
I don't know how you get here from "predict the next word."
-
I beat Grok 4 on ARC-AGI-2 using a CPU-only symbolic engine (18.1% score)
-
Claude Sonnet 4.6 System Card
-
Ed's JavaLand 2025 Session Picks
-
Show HN: Mcpbr – does your MCP help? Test it on SWE-bench and 25 evals
-
OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)
-
A note from our sponsor - SaaSHub
www.saashub.com | 11 Mar 2026
Index
What are some of the best open-source Benchmark projects in Python? This list will help you:
| # | Project | Stars |
|---|---|---|
| 1 | fashion-mnist | 12,657 |
| 2 | memU | 12,642 |
| 3 | mmpose | 7,401 |
| 4 | opencompass | 6,744 |
| 5 | ann-benchmarks | 5,609 |
| 6 | mmaction2 | 4,917 |
| 7 | SWE-bench | 4,441 |
| 8 | Baichuan2 | 4,120 |
| 9 | mteb | 3,156 |
| 10 | Baichuan-13B | 2,947 |
| 11 | promptbench | 2,784 |
| 12 | OSWorld | 2,645 |
| 13 | InternVideo | 2,208 |
| 14 | beir | 2,097 |
| 15 | logparser | 1,939 |
| 16 | fastRAG | 1,756 |
| 17 | evalplus | 1,696 |
| 18 | inference | 1,537 |
| 19 | VBench | 1,522 |
| 20 | py-motmetrics | 1,479 |
| 21 | pytest-benchmark | 1,415 |
| 22 | BEHAVIOR-1K | 1,350 |
| 23 | Awesome-System2-Reasoning-LLM | 1,334 |