[Evals] Standardize Evals in Next.js by gaojude · Pull Request #90883 · vercel/next.js

gaojude · 2026-03-04T19:10:10Z

Fixtures now live next to the code they test, like e2e. pnpm eval <name> packs the local next build, generates baseline + agents-md experiment configs on the fly, and runs both in a sandbox. The agents-md variant drops an AGENTS.md that points the agent at the bundled docs in node_modules/next/dist/docs/ — comparing the two variants tells you whether shipping a doc actually changes agent behavior.

run-evals.js mirrors run-tests.js: pack once, pass the tarball path to the child via NEXT_EVAL_TARBALL env, forward flags. We only pack next, not the whole workspace — the sandbox is remote Linux, so a local @next/swc darwin binary wouldn't run there anyway; the sandbox downloads the right one at runtime. The experiment config uses sandbox: 'auto', which picks Vercel sandboxes when credentials are present and falls back to local Docker otherwise, so external contributors can run the same evals with just Docker + ANTHROPIC_API_KEY.

experiments/ is generated fresh each run and gitignored so we don't maintain N committed config files that differ by one line. Fixture code is excluded from eslint since it's deliberately imperfect code for agents to fix, and EVAL.ts uses vitest rather than jest. Fixture package.json files use "next": "^16" rather than a pinned canary so agents reading package.json to infer capabilities aren't misled by a stale version string; the tarball install overlays it regardless.

next-evals-oss stays as the full benchmark runner for nextjs.org/evals; it'll pull fixtures from here instead of keeping its own copy.

nextjs-bot · 2026-03-04T19:24:56Z

Failing test suites

Commit: 72af8ff | About building and testing Next.js

pnpm test-dev test/development/app-dir/server-components-hmr-cache/server-components-hmr-cache.test.ts (job)

server-components-hmr-cache > should support reading from an infinite streaming fetch (DD)

Expand output

● server-components-hmr-cache › should support reading from an infinite streaming fetch

thrown: "Exceeded timeout of 10000 ms for a test.
Add a timeout value to this test to increase the timeout, if this is a long-running test. See https://jestjs.io/docs/api#testname-fn-timeout."

  224 |   })
  225 |
> 226 |   it('should support reading from an infinite streaming fetch', async () => {
      |   ^
  227 |     const browser = await next.browser('/infinite-stream')
  228 |     const text = await browser.elementByCss('p').text()
  229 |     expect(text).toBe('data: chunk-1')

  at it (development/app-dir/server-components-hmr-cache/server-components-hmr-cache.test.ts:226:3)
  at Object.describe (development/app-dir/server-components-hmr-cache/server-components-hmr-cache.test.ts:6:1)

nextjs-bot · 2026-03-04T19:30:15Z

Stats from current PR

✅ No significant changes detected

📊 All Metrics

📖 Metrics Glossary

Dev Server Metrics:

Listen = TCP port starts accepting connections
First Request = HTTP server returns successful response
Cold = Fresh build (no cache)
Warm = With cached build artifacts

Build Metrics:

Fresh = Clean build (no .next directory)
Cached = With existing .next directory

Change Thresholds:

Time: Changes < 50ms AND < 10%, OR < 2% are insignificant
Size: Changes < 1KB AND < 1% are insignificant
All other changes are flagged to catch regressions

⚡ Dev Server

Metric	Canary	PR	Change	Trend
Cold (Listen)	456ms	456ms	✓	▁▁▁▁▁
Cold (Ready in log)	457ms	455ms	✓	▁▁▁▁▁
Cold (First Request)	1.019s	1.003s	✓	▂▂▁▁▂
Warm (Listen)	456ms	456ms	✓	▁▁▁▁▁
Warm (Ready in log)	454ms	457ms	✓	▁▁▁▁▁
Warm (First Request)	380ms	382ms	✓	▁▁▁▁▁

📦 Dev Server (Webpack) (Legacy)

📦 Dev Server (Webpack)

Metric	Canary	PR	Change	Trend
Cold (Listen)	506ms	507ms	✓	▁▁▁▁▁
Cold (Ready in log)	461ms	463ms	✓	▆▆▅▅▁
Cold (First Request)	1.976s	1.966s	✓	▄▃▃▇▁
Warm (Listen)	505ms	505ms	✓	▁▁▁▁▁
Warm (Ready in log)	461ms	460ms	✓	▅▄▅▅▁
Warm (First Request)	1.994s	2.061s	✓	▄▄▄▇▁

⚡ Production Builds

Metric	Canary	PR	Change	Trend
Fresh Build	4.393s	4.402s	✓	▁▁▁▁▁
Cached Build	4.446s	4.421s	✓	▁▁▁▁▁

📦 Production Builds (Webpack) (Legacy)

📦 Production Builds (Webpack)

Metric	Canary	PR	Change	Trend
Fresh Build	14.477s	14.598s	✓	▁▁▂▅▁
Cached Build	14.599s	14.704s	✓	▁▁▁▅▁
node_modules Size	477 MB	477 MB	✓	▁▁▁▁▁

📦 Bundle Sizes

Bundle Sizes

⚡ Turbopack

Client

Main Bundles: **402 kB** → **402 kB** ⚠️ +31 B

80 files with content-based hashes (individual files not comparable between builds)

Server

Middleware

	Canary	PR	Change
middleware-b..fest.js gzip	766 B	767 B	✓
Total	766 B	767 B	⚠️ +1 B

Build Details

Build Manifests

	Canary	PR	Change
_buildManifest.js gzip	446 B	450 B	✓
Total	446 B	450 B	⚠️ +4 B

📦 Webpack

Client

Main Bundles

	Canary	PR	Change
5528-HASH.js gzip	5.54 kB	N/A	-
6280-HASH.js gzip	59.4 kB	N/A	-
6335.HASH.js gzip	169 B	N/A	-
912-HASH.js gzip	4.59 kB	N/A	-
e8aec2e4-HASH.js gzip	62.6 kB	N/A	-
framework-HASH.js gzip	59.7 kB	59.7 kB	✓
main-app-HASH.js gzip	255 B	254 B	✓
main-HASH.js gzip	39.1 kB	39.1 kB	✓
webpack-HASH.js gzip	1.68 kB	1.68 kB	✓
262-HASH.js gzip	N/A	4.59 kB	-
2889.HASH.js gzip	N/A	169 B	-
5602-HASH.js gzip	N/A	5.55 kB	-
6948ada0-HASH.js gzip	N/A	62.6 kB	-
9544-HASH.js gzip	N/A	60.2 kB	-
Total	233 kB	234 kB	⚠️ +734 B

Polyfills

	Canary	PR	Change
polyfills-HASH.js gzip	39.4 kB	39.4 kB	✓
Total	39.4 kB	39.4 kB	✓

Pages

	Canary	PR	Change
_app-HASH.js gzip	194 B	194 B	✓
_error-HASH.js gzip	183 B	180 B	🟢 3 B (-2%)
css-HASH.js gzip	331 B	330 B	✓
dynamic-HASH.js gzip	1.81 kB	1.81 kB	✓
edge-ssr-HASH.js gzip	256 B	256 B	✓
head-HASH.js gzip	351 B	352 B	✓
hooks-HASH.js gzip	384 B	383 B	✓
image-HASH.js gzip	580 B	581 B	✓
index-HASH.js gzip	260 B	260 B	✓
link-HASH.js gzip	2.51 kB	2.51 kB	✓
routerDirect..HASH.js gzip	320 B	319 B	✓
script-HASH.js gzip	386 B	386 B	✓
withRouter-HASH.js gzip	315 B	315 B	✓
1afbb74e6ecf..834.css gzip	106 B	106 B	✓
Total	7.98 kB	7.98 kB	✅ -1 B

Server

Edge SSR

	Canary	PR	Change
edge-ssr.js gzip	125 kB	125 kB	✓
page.js gzip	256 kB	256 kB	✓
Total	380 kB	381 kB	⚠️ +905 B

Middleware

	Canary	PR	Change
middleware-b..fest.js gzip	618 B	617 B	✓
middleware-r..fest.js gzip	156 B	155 B	✓
middleware.js gzip	43.6 kB	43.9 kB	✓
edge-runtime..pack.js gzip	842 B	842 B	✓
Total	45.2 kB	45.5 kB	⚠️ +300 B

Build Details

Build Manifests

	Canary	PR	Change
_buildManifest.js gzip	715 B	718 B	✓
Total	715 B	718 B	⚠️ +3 B

Build Cache

	Canary	PR	Change
0.pack gzip	4.07 MB	4.07 MB	✓
index.pack gzip	103 kB	102 kB	✓
index.pack.old gzip	103 kB	103 kB	✓
Total	4.27 MB	4.28 MB	⚠️ +1.13 kB

🔄 Shared (bundler-independent)

Runtimes

	Canary	PR	Change
app-page-exp...dev.js gzip	322 kB	322 kB	✓
app-page-exp..prod.js gzip	171 kB	171 kB	✓
app-page-tur...dev.js gzip	322 kB	322 kB	✓
app-page-tur..prod.js gzip	171 kB	171 kB	✓
app-page-tur...dev.js gzip	318 kB	318 kB	✓
app-page-tur..prod.js gzip	169 kB	169 kB	✓
app-page.run...dev.js gzip	319 kB	319 kB	✓
app-page.run..prod.js gzip	169 kB	169 kB	✓
app-route-ex...dev.js gzip	70.9 kB	70.9 kB	✓
app-route-ex..prod.js gzip	49.3 kB	49.3 kB	✓
app-route-tu...dev.js gzip	70.9 kB	70.9 kB	✓
app-route-tu..prod.js gzip	49.3 kB	49.3 kB	✓
app-route-tu...dev.js gzip	70.5 kB	70.5 kB	✓
app-route-tu..prod.js gzip	49 kB	49 kB	✓
app-route.ru...dev.js gzip	70.4 kB	70.4 kB	✓
app-route.ru..prod.js gzip	49 kB	49 kB	✓
dist_client_...dev.js gzip	324 B	324 B	✓
dist_client_...dev.js gzip	326 B	326 B	✓
dist_client_...dev.js gzip	318 B	318 B	✓
dist_client_...dev.js gzip	317 B	317 B	✓
pages-api-tu...dev.js gzip	43.2 kB	43.2 kB	✓
pages-api-tu..prod.js gzip	32.9 kB	32.9 kB	✓
pages-api.ru...dev.js gzip	43.2 kB	43.2 kB	✓
pages-api.ru..prod.js gzip	32.9 kB	32.9 kB	✓
pages-turbo....dev.js gzip	52.6 kB	52.6 kB	✓
pages-turbo...prod.js gzip	38.5 kB	38.5 kB	✓
pages.runtim...dev.js gzip	52.6 kB	52.6 kB	✓
pages.runtim..prod.js gzip	38.5 kB	38.5 kB	✓
server.runti..prod.js gzip	62 kB	62 kB	✓
Total	2.84 MB	2.84 MB	⚠️ +5 B

📎 Tarball URL

https://vercel-packages.vercel.app/next/commits/7f829743dc4590639419b8c9d403eae09cd737d7/next

lubieowoce · 2026-03-05T15:04:13Z

We only pack next, not the whole workspace — the sandbox is remote Linux, so a local @next/swc darwin binary wouldn't run there anyway; the sandbox downloads the right one at runtime

we can iterate on this, but this is can definitely cause spurious failures when we have rust changes on canary that weren't published yet, so we're gonna have to deal with it at some point

lubieowoce · 2026-03-05T15:07:42Z

run-evals.js

+  const flags = argv.filter((a) => a.startsWith('-'))
+  const positional = argv.filter((a) => !a.startsWith('-'))


can we please use some argument parsing package instead of this

lubieowoce · 2026-03-05T15:15:58Z

evals/README.md

+Then edit three files:
+
+**`PROMPT.md`** — what you'd type into the agent. Write it like a real user would: describe the symptom or goal, not the API. "Navigating from `/a` to `/b` is slow, fix it" is a good prompt. "Use `unstable_instant`" is not — you're testing whether the agent understands the feature well enough to reach for it, not whether it can pattern-match a name you handed it.
+
+**`EVAL.ts`** — vitest assertions against files the agent wrote. Regex the source, don't run it.


this whole convention seems to come from @vercel-labs/agent-eval, which isn't mentioned anywhere in this README. for someone like me who hasn't worked with this stuff at all, it's not even clear that that's what we're doing without reading the runner code. perhaps this README should mention that this is what we're using, and link to the docs for @vercel-labs/agent-eval in addition to inlining the relevant parts here?

Agreed — the README explains the eval convention (PROMPT.md, EVAL.ts, fixture dirs) without mentioning that it's all driven by @vercel/agent-eval. Adding a brief "How it works" section that names the package, links to its docs, and explains the relationship between the generated experiments and the runner would make this much more approachable for someone encountering it for the first time.

lubieowoce · 2026-03-05T15:18:10Z

evals/README.md

+
+## Running without Vercel sandbox access
+
+If you don't have Vercel credentials, the runner falls back to local Docker. Have Docker running and provide your own model key in `.env.local` at the repo root:


similar to above: inlining docs is nice, but reference links are nicer https://github.com/vercel-labs/agent-eval#direct-api-keys-no-vercel-account-required

as in, it'd be good to mention which part of this setup is the "runner" here. i went looking for a dockerfile in the nextjs repo because i didn't know who's doing that

lubieowoce · 2026-03-06T13:40:32Z

.config/eslintignore.mjs

+  // Eval fixtures are deliberately imperfect code for agents to fix; EVAL.ts
+  // uses vitest (not jest) and comes from an external repo.
+  'evals/evals/**/*',


what does jest have to do with anything? this is an eslint config. and EVAL.ts is, uh, not in an external repo! anyway, we should still be linting EVAL.ts because that's not part of the "imperfect" code, no?

Fixtures now live next to the code they test, like e2e. `pnpm eval <name>` packs the local `next` build, generates baseline + agents-md experiment configs on the fly, and runs both in a sandbox. The agents-md variant drops an `AGENTS.md` that points the agent at the bundled docs in `node_modules/next/dist/docs/` — comparing the two variants tells you whether shipping a doc actually changes agent behavior. `run-evals.js` mirrors `run-tests.js`: pack once, pass the tarball path to the child via `NEXT_EVAL_TARBALL` env, forward flags. We only pack `next`, not the whole workspace — the sandbox is remote Linux, so a local `@next/swc` darwin binary wouldn't run there anyway; the sandbox downloads the right one at runtime. The experiment config uses `sandbox: 'auto'`, which picks Vercel sandboxes when credentials are present and falls back to local Docker otherwise, so external contributors can run the same evals with just Docker + `ANTHROPIC_API_KEY`. `experiments/` is generated fresh each run and gitignored so we don't maintain N committed config files that differ by one line. Fixture code is excluded from eslint since it's deliberately imperfect code for agents to fix, and `EVAL.ts` uses vitest rather than jest. Fixture `package.json` files use `"next": "^16"` rather than a pinned canary so agents reading `package.json` to infer capabilities aren't misled by a stale version string; the tarball install overlays it regardless. next-evals-oss stays as the full benchmark runner for nextjs.org/evals; it'll pull fixtures from here instead of keeping its own copy.

nextjs-bot added the created-by: Next.js team PRs by the Next.js team. label Mar 4, 2026

gaojude force-pushed the evals-in-repo branch 3 times, most recently from e5c9cae to a5a1ed2 Compare March 5, 2026 01:15

gaojude changed the title ~~Move agent evals into the repo~~ [Evals] Standardize Evals in Next.js Mar 5, 2026

lubieowoce reviewed Mar 5, 2026

View reviewed changes

gaojude force-pushed the evals-in-repo branch from a5a1ed2 to cfae56c Compare March 5, 2026 18:54

gaojude requested a review from lubieowoce March 5, 2026 19:12

lubieowoce reviewed Mar 6, 2026

View reviewed changes

lubieowoce approved these changes Mar 6, 2026

View reviewed changes

gaojude force-pushed the evals-in-repo branch from cfae56c to 59fc1d4 Compare March 6, 2026 13:59

move evals to next

72af8ff

gaojude force-pushed the evals-in-repo branch from 59fc1d4 to 72af8ff Compare March 6, 2026 15:53

elsigh self-assigned this Mar 6, 2026

elsigh self-requested a review March 6, 2026 16:21

elsigh approved these changes Mar 6, 2026

View reviewed changes

gaojude marked this pull request as ready for review March 6, 2026 16:35

Merge branch 'canary' into evals-in-repo

7f82974

gaojude merged commit 0921733 into canary Mar 6, 2026
155 of 158 checks passed

gaojude deleted the evals-in-repo branch March 6, 2026 17:05

		const flags = argv.filter((a) => a.startsWith('-'))
		const positional = argv.filter((a) => !a.startsWith('-'))


		## Running without Vercel sandbox access

		If you don't have Vercel credentials, the runner falls back to local Docker. Have Docker running and provide your own model key in `.env.local` at the repo root:

Conversation

gaojude commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nextjs-bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Failing test suites

Uh oh!

nextjs-bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stats from current PR

✅ No significant changes detected

⚡ Dev Server

📦 Dev Server (Webpack)

⚡ Production Builds

📦 Production Builds (Webpack)

Bundle Sizes

⚡ Turbopack

📦 Webpack

🔄 Shared (bundler-independent)

Uh oh!

lubieowoce commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lubieowoce Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lubieowoce Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

vercel bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

gaojude Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

lubieowoce Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lubieowoce Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gaojude commented Mar 4, 2026 •

edited

Loading

nextjs-bot commented Mar 4, 2026 •

edited

Loading

nextjs-bot commented Mar 4, 2026 •

edited

Loading

lubieowoce commented Mar 5, 2026 •

edited

Loading

lubieowoce Mar 5, 2026 •

edited

Loading

lubieowoce Mar 5, 2026 •

edited

Loading

lubieowoce Mar 6, 2026 •

edited

Loading