Python Benchmark

Open-source Python projects categorized as Benchmark

Top 23 Python Benchmark Projects

  1. fashion-mnist

    A MNIST-like fashion product database. Benchmark :point_down:

  2. InfluxDB

    InfluxDB – Database Purpose-Built for High-Resolution Data. Turn time series data into real-time intelligence. Manage high-volume, high-velocity data without sacrificing performance.

    InfluxDB logo
  3. memU

    Memory for 24/7 proactive agents like openclaw (moltbot, clawdbot).

    Project mention: Show HN: We built a cheaper, simpler 24/7 proactive agent than moltbot | news.ycombinator.com | 2026-01-28

    https://github.com/NevaMind-AI/memU

    Happy to hear thoughts from people building agents or infra in this space.

  4. mmpose

    OpenMMLab Pose Estimation Toolbox and Benchmark.

  5. opencompass

    OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

  6. ann-benchmarks

    Benchmarks of approximate nearest neighbor libraries in Python

    Project mention: Quark Platform: SOTA Vector Search with 99.0% Recall 10 | news.ycombinator.com | 2025-06-10
  7. mmaction2

    OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark

  8. SWE-bench

    SWE-bench: Can Language Models Resolve Real-world Github Issues?

    Project mention: MiniMax M2.5 is beating Claude Opus 4.6 and MiniMax is 17x-20x cheaper | news.ycombinator.com | 2026-03-02
  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  10. Baichuan2

    A series of large language models developed by Baichuan Intelligent Technology

  11. mteb

    MTEB: Massive Text Embedding Benchmark

    Project mention: 💡 What's new in txtai 9.0 | dev.to | 2025-08-28

    Late interaction models encode data into multi-vector outputs. In other words, multiple input tokens map to multiple output vectors. Then at search time, the maximum similarity algorithm is used to find the best matches between the corpus and a query. This algorithm has achieved excellent results on retrieval benchmarks such as MTEB.

  12. Baichuan-13B

    A 13B large language model developed by Baichuan Intelligent Technology

  13. promptbench

    A unified evaluation framework for large language models

  14. OSWorld

    [NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    Project mention: Claude Opus 4.6 | news.ycombinator.com | 2026-02-05

    Some of Opus 4.6's standout results for me:

    * GDPVal Elo: 1606 vs. GPT-5.2's 1462. OpenAI reported that GPT-5.2 has a 70.9% win-or-tie rate against human professionals. (https://openai.com/index/gdpval/) We can estimate Opus 4.6's win-or-tie rate against human pros at 80–88%.

    * OSWorld: 72.7%, matching human performance at ~72.4% (https://os-world.github.io/). Since the human subjects were CS students, they were likely at least as competent as the average knowledge worker. The original OSWorld benchmark is somewhat noisy, but even if the model remains slightly inferior to humans, it is only a matter of time before it catches up or surpasses them.

    * BrowseComp: At 84%, it is approaching human intersubject agreement of ~86% (https://openai.com/index/browsecomp/).

    Taken together, this suggests that digital knowledge work will be transformed quite soon, possibly drastically if agent reliability improves beyond a certain threshold.

  15. InternVideo

    [ECCV2024] Video Foundation Models & Data for Multimodal Understanding

  16. beir

    A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

    Project mention: Gemini Embedding: Powering RAG and context engineering | news.ycombinator.com | 2025-07-31

    It's always worth checking out the MTEB leaderboard: https://huggingface.co/spaces/mteb/leaderboard

    There are some good open models there that have longer context limits and fewer dimensions.

    The benchmarks are just a guide. It's best to build a test dataset with your own data. This is a good example of that: https://github.com/beir-cellar/beir/wiki/Load-your-custom-da...

    Another benefit of having your own test dataset, is that it can grow as your data grows. And you can quickly test new models to see how it performs with YOUR data.

  17. logparser

    A machine learning toolkit for log parsing [ICSE'19, DSN'16]

  18. fastRAG

    Efficient Retrieval Augmentation and Generation Framework

    Project mention: 10 Open Source AI Tools Every Developer Should Know | dev.to | 2025-07-28

    FastRAG is a minimal, no-frills solution to build retrieval-augmented generation (RAG) pipelines locally. It requires zero external infrastructure—no Pinecone or LangChain—letting you set up document-based Q&A in minutes.

  19. evalplus

    Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024

  20. inference

    Reference implementations of MLPerf® inference benchmarks (by mlcommons)

  21. VBench

    [CVPR2024 Highlight] VBench - We Evaluate Video Generation

  22. py-motmetrics

    :bar_chart: Benchmark multiple object trackers (MOT) in Python

  23. pytest-benchmark

    pytest fixture for benchmarking code

    Project mention: This Week In Python | dev.to | 2025-05-23

    pytest-benchmark – pytest fixture for benchmarking code

  24. BEHAVIOR-1K

    BEHAVIOR-1K: a platform for accelerating Embodied AI research. Join our Discord for support: https://discord.gg/bccR5vGFEx

    Project mention: Open Source Journey | dev.to | 2025-11-01

    Communication Over Confidence Project: BEHAVIOR-1K My first contribution taught me the most fundamental lesson of open source. I spent full 3 days just setting up the project and understanding the codebase. When I finally identified the issue, I faced a dilemma. There was a line of code that seems very important but I had to remove to fix the issue. The function returned False if it identified anything other than True in a list, but there was also an assert all(...), child_values has NoneTypes line checking for NoneType values. Should I remove it or Keep it? Instead of making assumptions, I created a Pull Request with a [WIP] tag to open a conversation with the reviewers. This turned out to be the right call. In open source, especially as a newcomer, communication is the golden key. Nobody expects you to be perfect but they do expect you to be thoughtful. Don't be afraid to ask questions. Maintainers would much rather answer your questions than dealing with a poor PR.

  25. Awesome-System2-Reasoning-LLM

    Latest Advances on System-2 Reasoning

  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Benchmark discussion

Log in or Post with

Python Benchmark related posts

  • MiniMax M2.5 is beating Claude Opus 4.6 and MiniMax is 17x-20x cheaper

    1 project | news.ycombinator.com | 2 Mar 2026
  • I don't know how you get here from "predict the next word."

    2 projects | news.ycombinator.com | 26 Feb 2026
  • I beat Grok 4 on ARC-AGI-2 using a CPU-only symbolic engine (18.1% score)

    3 projects | news.ycombinator.com | 24 Feb 2026
  • Claude Sonnet 4.6 System Card

    7 projects | news.ycombinator.com | 17 Feb 2026
  • Ed's JavaLand 2025 Session Picks

    1 project | dev.to | 16 Feb 2026
  • Show HN: Mcpbr – does your MCP help? Test it on SWE-bench and 25 evals

    1 project | news.ycombinator.com | 1 Feb 2026
  • OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)

    4 projects | news.ycombinator.com | 29 Jan 2026
  • A note from our sponsor - SaaSHub
    www.saashub.com | 11 Mar 2026
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Benchmark projects in Python? This list will help you:

# Project Stars
1 fashion-mnist 12,657
2 memU 12,642
3 mmpose 7,401
4 opencompass 6,744
5 ann-benchmarks 5,609
6 mmaction2 4,917
7 SWE-bench 4,441
8 Baichuan2 4,120
9 mteb 3,156
10 Baichuan-13B 2,947
11 promptbench 2,784
12 OSWorld 2,645
13 InternVideo 2,208
14 beir 2,097
15 logparser 1,939
16 fastRAG 1,756
17 evalplus 1,696
18 inference 1,537
19 VBench 1,522
20 py-motmetrics 1,479
21 pytest-benchmark 1,415
22 BEHAVIOR-1K 1,350
23 Awesome-System2-Reasoning-LLM 1,334

Sponsored
InfluxDB – Database Purpose-Built for High-Resolution Data
Turn time series data into real-time intelligence. Manage high-volume, high-velocity data without sacrificing performance.
www.influxdata.com

Did you know that Python is
the 2nd most popular programming language
based on number of references?