Move or scroll to enter

MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

MLS-Bench evaluates whether AI systems can invent generalizable and scalable ML methods. It spans 140tasks across 12 domains — language models, vision and generation, reinforcement learning, robotics, ML systems, AI for science, optimization, time series, causal reasoning, and more.

Why this benchmark

Methods that stood the test of time and scale.

Modern AI progress is built on a small set of reusable ideas — convolutions, residual connections, attention, normalization — that generalize across architectures and survive every order-of-magnitude jump in scale.

Convolution

Weight-shared receptive fields that scaled vision models.

Word2Vec

Distributed embeddings transferable across NLP tasks.

GAN

Adversarial generator–discriminator game for sample synthesis.

Adam

Adaptive moment estimation that became the default optimizer.

U-Net

Encoder–decoder with skip links — vision and diffusion staple.

PPO

Clipped policy ratio that made deep RL stable to scale.

RMSNorm

Mean-free normalization, faster and surprisingly sufficient.

RoPE

Rotary position encoding that scales with context length.

FlashAttention

IO-aware exact attention that scaled context length.

Convolution

Weight-shared receptive fields that scaled vision models.

Word2Vec

Distributed embeddings transferable across NLP tasks.

GAN

Adversarial generator–discriminator game for sample synthesis.

Adam

Adaptive moment estimation that became the default optimizer.

U-Net

Encoder–decoder with skip links — vision and diffusion staple.

PPO

Clipped policy ratio that made deep RL stable to scale.

RMSNorm

Mean-free normalization, faster and surprisingly sufficient.

RoPE

Rotary position encoding that scales with context length.

FlashAttention

IO-aware exact attention that scaled context length.

LSTM

Gated recurrence enabling long-range sequence learning.

Dropout

Random unit masking that became the standard regularizer.

BatchNorm

Normalizing activations across the batch to stabilize training.

ResNet

Residual connections enabling 100+ layer training.

Transformer

Self-attention as the universal sequence operator.

Mixup

Input–label interpolation that improved generalization.

DDPM

Denoising diffusion: learn to invert a noise process.

LoRA

Low-rank adapters for parameter-efficient finetuning.

LSTM

Gated recurrence enabling long-range sequence learning.

Dropout

Random unit masking that became the standard regularizer.

BatchNorm

Normalizing activations across the batch to stabilize training.

ResNet

Residual connections enabling 100+ layer training.

Transformer

Self-attention as the universal sequence operator.

Mixup

Input–label interpolation that improved generalization.

DDPM

Denoising diffusion: learn to invert a noise process.

LoRA

Low-rank adapters for parameter-efficient finetuning.

MLS-Bench tests whether AI agents can invent the next ones.

Each task isolates a well-defined research question and asks the agent to propose a single modular improvement — a new loss, an attention variant, a sampler, a routing rule — then measures whether the change transfers across models, datasets, and seeds.

Quantization-Aware Language-Model Training

Attention Cache Structural Reduction

Trajectory Optimization for Model-Based Planning

3D Scene Densification Strategy

Diffusion-Prior Inverse Solver

Value-Based Visual Control

Gradient Compression for Distributed Training

Homophily-Heterophily Graph Filter

Score-Based Black-Box Linf Attack

Quantization-Aware Language-Model Training

Attention Cache Structural Reduction

Trajectory Optimization for Model-Based Planning

3D Scene Densification Strategy

Diffusion-Prior Inverse Solver

Value-Based Visual Control

Gradient Compression for Distributed Training

Homophily-Heterophily Graph Filter

Score-Based Black-Box Linf Attack

Autoregressive Embedding Strategy

Efficient Diffusion Sampling for Robot Actions

Fixed-Budget Diffusion Sampler Updates

Atmospheric Column Emulator Architecture

Backbone-to-Sequence Inverse Folding

Constraint Handling for Safe RL

Evolutionary Operators for Continuous Black-Box Optimization

Discrete Causal Graph Discovery

Fused Causal Attention Kernel

Autoregressive Embedding Strategy

Efficient Diffusion Sampling for Robot Actions

Fixed-Budget Diffusion Sampler Updates

Atmospheric Column Emulator Architecture

Backbone-to-Sequence Inverse Folding

Constraint Handling for Safe RL

Evolutionary Operators for Continuous Black-Box Optimization

Discrete Causal Graph Discovery

Fused Causal Attention Kernel

140 executable tasks across 12 domains, each built around a targeted ML component, a controlled edit surface, and multi-setting evidence for transfer.

Leaderboard

Score on the official 30-task MLS-Bench-Lite subset.

#	Model	Harness	Performance
1	Claude Opus 4.8Closed	Claude Code (max effort)Closed	42.8
2	GPT-5.5Closed	Codex (xhigh)Open	35.5
3	Kimi K2.7 CodeOpen	Kimi-CodeOpen	35.1
4	Kimi K2.6Open	Kimi-CodeOpen	26.7

Results are from the Kimi K2.7 Code model card. We have verified. The evaluation is based on Harbor with a 5-hour exploration budget for each agent, not the native harness used for our main paper results but is highly encouraged.

Model Performance by Category

Each model's bar shows Vanilla as the darker lower portion and Agent as the lighter overlay, against a translucent grey Human SOTA reference computed from the reproduced human baselines. Scores use the paper's normalized task metric.

Claude Opus 4.6GPT-5.4Gemini 3.1 ProDeepSeek-V3.2Qwen 3.6 Plus

Vanilla (darker)Agent (lighter)Human SOTA

Task Categories

140 tasks across 12 flat categories. Open a category to browse its tasks.

Language Models18 tasks Robotics12 tasks Vision & Generation11 tasks Reinforcement Learning13 tasks ML Systems & Efficient ML10 tasks AI for Science10 tasks Optimization & Theory13 tasks Classical & Adaptive Learning14 tasks Deep Learning11 tasks Time Series & Forecasting10 tasks Structured & Causal Reasoning10 tasks Trustworthy Learning8 tasks