P95 stands for the 95th percentile. 95% of iterations completed at or below this time. Tells you the 'typical worst case' — what a user experiences on a bad-but-not-extreme run.

P99 stands for the 99th percentile. 99% of iterations completed at or below this time. It highlights the tail latency (the rare outlier spikes). With 100 iterations, this is the 99th value (second-worst)

Sandbox Provider Leaderboard

Q: What is Time to Interactive (TTI)?

TTI is the total elapsed time it takes for a sandbox provider to boot up a sandbox and run a terminal command inside the sandbox.

Q: Why use median instead of average?

The average can be easily skewed by a single extreme outlier. The median provides a more accurate representation of the typical performance most users will experience.

Sandbox Benchmarks

A leaderboard of common benchmarks for each of our sandbox providers.

View on Github

Last run: April 1, 2026

Details

Provider Leaderboard

Performance Over Time

Composite Score

Detailed Metrics

Provider	Score	Median	P95	P99	Success
Daytona	98.2	0.11s	0.28s	0.29s	100%
E2B	93.8	0.44s	0.85s	0.99s	100%
Hopx	87.7	1.05s	1.42s	1.64s	100%
Blaxel	89.2	1.05s	1.12s	1.15s	100%
Cloudflare	79.3	1.72s	2.48s	2.78s	100%
Vercel	81.7	1.75s	1.93s	1.98s	100%
Namespace	77.8	1.86s	2.43s	3.35s	100%
Runloop	80.8	1.87s	1.99s	1.99s	100%
CodeSandbox	75.5	2.32s	2.60s	2.72s	100%
Modal	43.2	2.66s	11.78s	22.39s	98%

Want to see a provider added?

Let us know on X

Methodology

What We Measure

Every benchmark measures Time to Interactive (TTI) — the elapsed time from calling compute.sandbox.create() to the first successful runCommand() inside the sandbox.

Each provider is tested with 100 iterations per run. Benchmarks run automatically via GitHub Actions on a recurring schedule. All results are committed to the public benchmarks repo.

Sequential Test: Sandboxes are launched one at a time, waiting for each to become interactive before starting the next.

Staggered Test: Sandboxes are launched with 200ms delays between each.

Burst Test: All sandboxes are launched concurrently in a single burst.

How We Score

The Composite Score is a weighted blend of timing metrics multiplied by the success rate. Each metric is scored against a fixed 10-second ceiling: 100 × (1 − value / 10,000ms), so a 200ms median scores 98 and anything ≥10s scores 0.

The weighted timing score is then multiplied by the success rate (0–1), so providers that fail frequently are penalized proportionally.

• Median: 60% — primary signal for typical experience
• P95: 25% — tail latency / consistency
• P99: 15% — extreme tail latency

Sandbox Benchmarks FAQs

Have another question? Email us.

What is a sandbox?

A sandbox is anywhere you can run code in isolation. It could be a VM, bare metal, a container, anywhere with compute resources.

What is Time to Interactive (TTI)?

Why use median instead of average?

What does P95 mean?

What does P99 mean?