MLS-Bench
Move or scroll to enter

MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

MLS-Bench

MLS-Bench evaluates whether AI systems can invent generalizable and scalable ML methods. It spans 140tasks across 12 domains — language models, vision and generation, reinforcement learning, robotics, ML systems, AI for science, optimization, time series, causal reasoning, and more.

UC BerkeleyPrinceton UniversityTsinghua UniversityPurdue UniversityUniversity of WashingtonHarvard UniversityUniversity of PennsylvaniaShanghai Jiao Tong UniversityUC San DiegoCarnegie Mellon University
Why this benchmark

Methods that stood the test of time and scale.

Modern AI progress is built on a small set of reusable ideas — convolutions, residual connections, attention, normalization — that generalize across architectures and survive every order-of-magnitude jump in scale.

Convolution

Weight-shared receptive fields that scaled vision models.

Word2Vec

Distributed embeddings transferable across NLP tasks.

GAN

Adversarial generator–discriminator game for sample synthesis.

Adam

Adaptive moment estimation that became the default optimizer.

U-Net

Encoder–decoder with skip links — vision and diffusion staple.

PPO

Clipped policy ratio that made deep RL stable to scale.

RMSNorm

Mean-free normalization, faster and surprisingly sufficient.

RoPE

Rotary position encoding that scales with context length.

FlashAttention

IO-aware exact attention that scaled context length.

Convolution

Weight-shared receptive fields that scaled vision models.

Word2Vec

Distributed embeddings transferable across NLP tasks.

GAN

Adversarial generator–discriminator game for sample synthesis.

Adam

Adaptive moment estimation that became the default optimizer.

U-Net

Encoder–decoder with skip links — vision and diffusion staple.

PPO

Clipped policy ratio that made deep RL stable to scale.

RMSNorm

Mean-free normalization, faster and surprisingly sufficient.

RoPE

Rotary position encoding that scales with context length.

FlashAttention

IO-aware exact attention that scaled context length.

LSTM

Gated recurrence enabling long-range sequence learning.

Dropout

Random unit masking that became the standard regularizer.

BatchNorm

Normalizing activations across the batch to stabilize training.

ResNet

Residual connections enabling 100+ layer training.

Transformer

Self-attention as the universal sequence operator.

Mixup

Input–label interpolation that improved generalization.

DDPM

Denoising diffusion: learn to invert a noise process.

LoRA

Low-rank adapters for parameter-efficient finetuning.

LSTM

Gated recurrence enabling long-range sequence learning.

Dropout

Random unit masking that became the standard regularizer.

BatchNorm

Normalizing activations across the batch to stabilize training.

ResNet

Residual connections enabling 100+ layer training.

Transformer

Self-attention as the universal sequence operator.

Mixup

Input–label interpolation that improved generalization.

DDPM

Denoising diffusion: learn to invert a noise process.

LoRA

Low-rank adapters for parameter-efficient finetuning.

MLS-Bench tests whether AI agents can invent the next ones.

Each task isolates a well-defined research question and asks the agent to propose a single modular improvement — a new loss, an attention variant, a sampler, a routing rule — then measures whether the change transfers across models, datasets, and seeds.

140 executable tasks across 12 domains, each built around a targeted ML component, a controlled edit surface, and multi-setting evidence for transfer.

Leaderboard

Score on the official 30-task MLS-Bench-Lite subset.

#ModelHarnessPerformance
1Claude Opus 4.8ClosedClaude Code (max effort)Closed42.8
2GPT-5.5ClosedCodex (xhigh)Open35.5
3Kimi K2.7 CodeOpenKimi-CodeOpen35.1
4Kimi K2.6OpenKimi-CodeOpen26.7

Results are from the Kimi K2.7 Code model card. We have verified. The evaluation is based on Harbor with a 5-hour exploration budget for each agent, not the native harness used for our main paper results but is highly encouraged.

Model Performance by Category

Each model's bar shows Vanilla as the darker lower portion and Agent as the lighter overlay, against a translucent grey Human SOTA reference computed from the reproduced human baselines. Scores use the paper's normalized task metric.

Claude Opus 4.6GPT-5.4Gemini 3.1 ProDeepSeek-V3.2Qwen 3.6 Plus
Vanilla (darker)Agent (lighter)Human SOTA