Skip to content

[Evals] Standardize Evals in Next.js#90883

Merged
gaojude merged 2 commits intocanaryfrom
evals-in-repo
Mar 6, 2026
Merged

[Evals] Standardize Evals in Next.js#90883
gaojude merged 2 commits intocanaryfrom
evals-in-repo

Conversation

@gaojude
Copy link
Contributor

@gaojude gaojude commented Mar 4, 2026

Fixtures now live next to the code they test, like e2e. pnpm eval <name> packs the local next build, generates baseline + agents-md experiment configs on the fly, and runs both in a sandbox. The agents-md variant drops an AGENTS.md that points the agent at the bundled docs in node_modules/next/dist/docs/ — comparing the two variants tells you whether shipping a doc actually changes agent behavior.

run-evals.js mirrors run-tests.js: pack once, pass the tarball path to the child via NEXT_EVAL_TARBALL env, forward flags. We only pack next, not the whole workspace — the sandbox is remote Linux, so a local @next/swc darwin binary wouldn't run there anyway; the sandbox downloads the right one at runtime. The experiment config uses sandbox: 'auto', which picks Vercel sandboxes when credentials are present and falls back to local Docker otherwise, so external contributors can run the same evals with just Docker + ANTHROPIC_API_KEY.

experiments/ is generated fresh each run and gitignored so we don't maintain N committed config files that differ by one line. Fixture code is excluded from eslint since it's deliberately imperfect code for agents to fix, and EVAL.ts uses vitest rather than jest. Fixture package.json files use "next": "^16" rather than a pinned canary so agents reading package.json to infer capabilities aren't misled by a stale version string; the tarball install overlays it regardless.

next-evals-oss stays as the full benchmark runner for nextjs.org/evals; it'll pull fixtures from here instead of keeping its own copy.

@nextjs-bot nextjs-bot added the created-by: Next.js team PRs by the Next.js team. label Mar 4, 2026
@nextjs-bot
Copy link
Collaborator

nextjs-bot commented Mar 4, 2026

Failing test suites

Commit: 72af8ff | About building and testing Next.js

pnpm test-dev test/development/app-dir/server-components-hmr-cache/server-components-hmr-cache.test.ts (job)

  • server-components-hmr-cache > should support reading from an infinite streaming fetch (DD)
Expand output

● server-components-hmr-cache › should support reading from an infinite streaming fetch

thrown: "Exceeded timeout of 10000 ms for a test.
Add a timeout value to this test to increase the timeout, if this is a long-running test. See https://jestjs.io/docs/api#testname-fn-timeout."

  224 |   })
  225 |
> 226 |   it('should support reading from an infinite streaming fetch', async () => {
      |   ^
  227 |     const browser = await next.browser('/infinite-stream')
  228 |     const text = await browser.elementByCss('p').text()
  229 |     expect(text).toBe('data: chunk-1')

  at it (development/app-dir/server-components-hmr-cache/server-components-hmr-cache.test.ts:226:3)
  at Object.describe (development/app-dir/server-components-hmr-cache/server-components-hmr-cache.test.ts:6:1)

@nextjs-bot
Copy link
Collaborator

nextjs-bot commented Mar 4, 2026

Stats from current PR

✅ No significant changes detected

📊 All Metrics
📖 Metrics Glossary

Dev Server Metrics:

  • Listen = TCP port starts accepting connections
  • First Request = HTTP server returns successful response
  • Cold = Fresh build (no cache)
  • Warm = With cached build artifacts

Build Metrics:

  • Fresh = Clean build (no .next directory)
  • Cached = With existing .next directory

Change Thresholds:

  • Time: Changes < 50ms AND < 10%, OR < 2% are insignificant
  • Size: Changes < 1KB AND < 1% are insignificant
  • All other changes are flagged to catch regressions

⚡ Dev Server

Metric Canary PR Change Trend
Cold (Listen) 456ms 456ms ▁▁▁▁▁
Cold (Ready in log) 457ms 455ms ▁▁▁▁▁
Cold (First Request) 1.019s 1.003s ▂▂▁▁▂
Warm (Listen) 456ms 456ms ▁▁▁▁▁
Warm (Ready in log) 454ms 457ms ▁▁▁▁▁
Warm (First Request) 380ms 382ms ▁▁▁▁▁
📦 Dev Server (Webpack) (Legacy)

📦 Dev Server (Webpack)

Metric Canary PR Change Trend
Cold (Listen) 506ms 507ms ▁▁▁▁▁
Cold (Ready in log) 461ms 463ms ▆▆▅▅▁
Cold (First Request) 1.976s 1.966s ▄▃▃▇▁
Warm (Listen) 505ms 505ms ▁▁▁▁▁
Warm (Ready in log) 461ms 460ms ▅▄▅▅▁
Warm (First Request) 1.994s 2.061s ▄▄▄▇▁

⚡ Production Builds

Metric Canary PR Change Trend
Fresh Build 4.393s 4.402s ▁▁▁▁▁
Cached Build 4.446s 4.421s ▁▁▁▁▁
📦 Production Builds (Webpack) (Legacy)

📦 Production Builds (Webpack)

Metric Canary PR Change Trend
Fresh Build 14.477s 14.598s ▁▁▂▅▁
Cached Build 14.599s 14.704s ▁▁▁▅▁
node_modules Size 477 MB 477 MB ▁▁▁▁▁
📦 Bundle Sizes

Bundle Sizes

⚡ Turbopack

Client

Main Bundles: **402 kB** → **402 kB** ⚠️ +31 B

80 files with content-based hashes (individual files not comparable between builds)

Server

Middleware
Canary PR Change
middleware-b..fest.js gzip 766 B 767 B
Total 766 B 767 B ⚠️ +1 B
Build Details
Build Manifests
Canary PR Change
_buildManifest.js gzip 446 B 450 B
Total 446 B 450 B ⚠️ +4 B

📦 Webpack

Client

Main Bundles
Canary PR Change
5528-HASH.js gzip 5.54 kB N/A -
6280-HASH.js gzip 59.4 kB N/A -
6335.HASH.js gzip 169 B N/A -
912-HASH.js gzip 4.59 kB N/A -
e8aec2e4-HASH.js gzip 62.6 kB N/A -
framework-HASH.js gzip 59.7 kB 59.7 kB
main-app-HASH.js gzip 255 B 254 B
main-HASH.js gzip 39.1 kB 39.1 kB
webpack-HASH.js gzip 1.68 kB 1.68 kB
262-HASH.js gzip N/A 4.59 kB -
2889.HASH.js gzip N/A 169 B -
5602-HASH.js gzip N/A 5.55 kB -
6948ada0-HASH.js gzip N/A 62.6 kB -
9544-HASH.js gzip N/A 60.2 kB -
Total 233 kB 234 kB ⚠️ +734 B
Polyfills
Canary PR Change
polyfills-HASH.js gzip 39.4 kB 39.4 kB
Total 39.4 kB 39.4 kB
Pages
Canary PR Change
_app-HASH.js gzip 194 B 194 B
_error-HASH.js gzip 183 B 180 B 🟢 3 B (-2%)
css-HASH.js gzip 331 B 330 B
dynamic-HASH.js gzip 1.81 kB 1.81 kB
edge-ssr-HASH.js gzip 256 B 256 B
head-HASH.js gzip 351 B 352 B
hooks-HASH.js gzip 384 B 383 B
image-HASH.js gzip 580 B 581 B
index-HASH.js gzip 260 B 260 B
link-HASH.js gzip 2.51 kB 2.51 kB
routerDirect..HASH.js gzip 320 B 319 B
script-HASH.js gzip 386 B 386 B
withRouter-HASH.js gzip 315 B 315 B
1afbb74e6ecf..834.css gzip 106 B 106 B
Total 7.98 kB 7.98 kB ✅ -1 B

Server

Edge SSR
Canary PR Change
edge-ssr.js gzip 125 kB 125 kB
page.js gzip 256 kB 256 kB
Total 380 kB 381 kB ⚠️ +905 B
Middleware
Canary PR Change
middleware-b..fest.js gzip 618 B 617 B
middleware-r..fest.js gzip 156 B 155 B
middleware.js gzip 43.6 kB 43.9 kB
edge-runtime..pack.js gzip 842 B 842 B
Total 45.2 kB 45.5 kB ⚠️ +300 B
Build Details
Build Manifests
Canary PR Change
_buildManifest.js gzip 715 B 718 B
Total 715 B 718 B ⚠️ +3 B
Build Cache
Canary PR Change
0.pack gzip 4.07 MB 4.07 MB
index.pack gzip 103 kB 102 kB
index.pack.old gzip 103 kB 103 kB
Total 4.27 MB 4.28 MB ⚠️ +1.13 kB

🔄 Shared (bundler-independent)

Runtimes
Canary PR Change
app-page-exp...dev.js gzip 322 kB 322 kB
app-page-exp..prod.js gzip 171 kB 171 kB
app-page-tur...dev.js gzip 322 kB 322 kB
app-page-tur..prod.js gzip 171 kB 171 kB
app-page-tur...dev.js gzip 318 kB 318 kB
app-page-tur..prod.js gzip 169 kB 169 kB
app-page.run...dev.js gzip 319 kB 319 kB
app-page.run..prod.js gzip 169 kB 169 kB
app-route-ex...dev.js gzip 70.9 kB 70.9 kB
app-route-ex..prod.js gzip 49.3 kB 49.3 kB
app-route-tu...dev.js gzip 70.9 kB 70.9 kB
app-route-tu..prod.js gzip 49.3 kB 49.3 kB
app-route-tu...dev.js gzip 70.5 kB 70.5 kB
app-route-tu..prod.js gzip 49 kB 49 kB
app-route.ru...dev.js gzip 70.4 kB 70.4 kB
app-route.ru..prod.js gzip 49 kB 49 kB
dist_client_...dev.js gzip 324 B 324 B
dist_client_...dev.js gzip 326 B 326 B
dist_client_...dev.js gzip 318 B 318 B
dist_client_...dev.js gzip 317 B 317 B
pages-api-tu...dev.js gzip 43.2 kB 43.2 kB
pages-api-tu..prod.js gzip 32.9 kB 32.9 kB
pages-api.ru...dev.js gzip 43.2 kB 43.2 kB
pages-api.ru..prod.js gzip 32.9 kB 32.9 kB
pages-turbo....dev.js gzip 52.6 kB 52.6 kB
pages-turbo...prod.js gzip 38.5 kB 38.5 kB
pages.runtim...dev.js gzip 52.6 kB 52.6 kB
pages.runtim..prod.js gzip 38.5 kB 38.5 kB
server.runti..prod.js gzip 62 kB 62 kB
Total 2.84 MB 2.84 MB ⚠️ +5 B
📎 Tarball URL
https://vercel-packages.vercel.app/next/commits/7f829743dc4590639419b8c9d403eae09cd737d7/next

@gaojude gaojude force-pushed the evals-in-repo branch 3 times, most recently from e5c9cae to a5a1ed2 Compare March 5, 2026 01:15
@gaojude gaojude changed the title Move agent evals into the repo [Evals] Standardize Evals in Next.js Mar 5, 2026
@lubieowoce
Copy link
Member

lubieowoce commented Mar 5, 2026

We only pack next, not the whole workspace — the sandbox is remote Linux, so a local @next/swc darwin binary wouldn't run there anyway; the sandbox downloads the right one at runtime

we can iterate on this, but this is can definitely cause spurious failures when we have rust changes on canary that weren't published yet, so we're gonna have to deal with it at some point

run-evals.js Outdated
Comment on lines +102 to +103
const flags = argv.filter((a) => a.startsWith('-'))
const positional = argv.filter((a) => !a.startsWith('-'))
Copy link
Member

@lubieowoce lubieowoce Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we please use some argument parsing package instead of this

Comment on lines +27 to +31
Then edit three files:

**`PROMPT.md`** — what you'd type into the agent. Write it like a real user would: describe the symptom or goal, not the API. "Navigating from `/a` to `/b` is slow, fix it" is a good prompt. "Use `unstable_instant`" is not — you're testing whether the agent understands the feature well enough to reach for it, not whether it can pattern-match a name you handed it.

**`EVAL.ts`** — vitest assertions against files the agent wrote. Regex the source, don't run it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole convention seems to come from @vercel-labs/agent-eval, which isn't mentioned anywhere in this README. for someone like me who hasn't worked with this stuff at all, it's not even clear that that's what we're doing without reading the runner code. perhaps this README should mention that this is what we're using, and link to the docs for @vercel-labs/agent-eval in addition to inlining the relevant parts here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — the README explains the eval convention (PROMPT.md, EVAL.ts, fixture dirs) without mentioning that it's all driven by @vercel/agent-eval. Adding a brief "How it works" section that names the package, links to its docs, and explains the relationship between the generated experiments and the runner would make this much more approachable for someone encountering it for the first time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

evals/README.md Outdated

## Running without Vercel sandbox access

If you don't have Vercel credentials, the runner falls back to local Docker. Have Docker running and provide your own model key in `.env.local` at the repo root:
Copy link
Member

@lubieowoce lubieowoce Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to above: inlining docs is nice, but reference links are nicer https://github.com/vercel-labs/agent-eval#direct-api-keys-no-vercel-account-required

as in, it'd be good to mention which part of this setup is the "runner" here. i went looking for a dockerfile in the nextjs repo because i didn't know who's doing that

@gaojude gaojude requested a review from lubieowoce March 5, 2026 19:12
Comment on lines +59 to +61
// Eval fixtures are deliberately imperfect code for agents to fix; EVAL.ts
// uses vitest (not jest) and comes from an external repo.
'evals/evals/**/*',
Copy link
Member

@lubieowoce lubieowoce Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does jest have to do with anything? this is an eslint config. and EVAL.ts is, uh, not in an external repo! anyway, we should still be linting EVAL.ts because that's not part of the "imperfect" code, no?

@elsigh elsigh self-assigned this Mar 6, 2026
@elsigh elsigh self-requested a review March 6, 2026 16:21
@gaojude gaojude marked this pull request as ready for review March 6, 2026 16:35
@gaojude gaojude merged commit 0921733 into canary Mar 6, 2026
155 of 158 checks passed
@gaojude gaojude deleted the evals-in-repo branch March 6, 2026 17:05
sokra pushed a commit that referenced this pull request Mar 6, 2026
Fixtures now live next to the code they test, like e2e. `pnpm eval
<name>` packs the local `next` build, generates baseline + agents-md
experiment configs on the fly, and runs both in a sandbox. The agents-md
variant drops an `AGENTS.md` that points the agent at the bundled docs
in `node_modules/next/dist/docs/` — comparing the two variants tells you
whether shipping a doc actually changes agent behavior.

`run-evals.js` mirrors `run-tests.js`: pack once, pass the tarball path
to the child via `NEXT_EVAL_TARBALL` env, forward flags. We only pack
`next`, not the whole workspace — the sandbox is remote Linux, so a
local `@next/swc` darwin binary wouldn't run there anyway; the sandbox
downloads the right one at runtime. The experiment config uses `sandbox:
'auto'`, which picks Vercel sandboxes when credentials are present and
falls back to local Docker otherwise, so external contributors can run
the same evals with just Docker + `ANTHROPIC_API_KEY`.

`experiments/` is generated fresh each run and gitignored so we don't
maintain N committed config files that differ by one line. Fixture code
is excluded from eslint since it's deliberately imperfect code for
agents to fix, and `EVAL.ts` uses vitest rather than jest. Fixture
`package.json` files use `"next": "^16"` rather than a pinned canary so
agents reading `package.json` to infer capabilities aren't misled by a
stale version string; the tarball install overlays it regardless.

next-evals-oss stays as the full benchmark runner for nextjs.org/evals;
it'll pull fixtures from here instead of keeping its own copy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

created-by: Next.js team PRs by the Next.js team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants