perf(plugins,gateway): thread metadata snapshot + discovery through hot paths + plugin owner fixes#84649
Conversation
|
Codex review: needs maintainer review before merge. Reviewed May 24, 2026, 4:36 PM ET / 20:36 UTC. Summary PR surface: Source +156, Tests +47. Total +203 across 27 files. Reproducibility: no. live current-main reproduction was run in this read-only review. Source inspection and the PR's profiling/proof context show the repeated metadata/discovery seams being addressed, but the exact cold-start delta was not remeasured here. Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Land the snapshot/discovery threading after maintainers accept the provider-auth compatibility tradeoff and the current green checks/proof are considered enough for this hot path. Do we have a high-confidence way to reproduce the issue? No live current-main reproduction was run in this read-only review. Source inspection and the PR's profiling/proof context show the repeated metadata/discovery seams being addressed, but the exact cold-start delta was not remeasured here. Is this the best way to solve the issue? Yes, the approach fits the repository direction because it carries prepared snapshot and discovery facts through existing seams instead of adding broad filesystem caches. The remaining question is maintainer risk acceptance, not a different repair shape. Codex review notes: model gpt-5.5, reasoning high; reviewed against 8bf4f7d4a88d. Label changesLabel justifications:
Evidence reviewedPR surface: Source +156, Tests +47. Total +203 across 27 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
|
ClawSweeper PR egg ✨ Hatched: 🥚 common Sunspot Proofling Hatch commandComment Hatchability rules:
Rarity: 🥚 common. What is this egg doing here?
|
|
Pulled the
That work eliminated ~264 Branch: |
|
@clawsweeper reconsider labels, explain the compatibility risk one or remove it. |
|
🦞👀 I asked ClawSweeper to answer this maintainer mention in the next review comment. Tiny claws, bounded scope: this is a read-only assist pass unless it produces one of the existing structured safe-action markers. Request: reconsider labels, explain the compatibility risk one or remove it. |
|
Addressing the three review concerns: F1 — derived-cache invalidation gap. Resolved with watched-files widening + post-get freshness check.
F2 — F4 — eager The eager fetch was net-positive (saved ~34 downstream discoveries by short-circuiting the bundled-channel cascade), so just deleting it gave back a chunk of the headline freeze win. Fixed instead by capturing the discovery during snapshot building and reusing it at the consumer:
Nothing is synthesized or derived; the discovery is the same object reference that was created during Verification: 93/93 plugin tests pass, typecheck clean. Local cold-start freeze re-checked. |
406acb4 to
728b97d
Compare
|
🦞👀 I queued a lightweight read-only assist pass. It will post a separate answer comment and will not edit the durable ClawSweeper review comment or trigger close, merge, repair, label, or branch changes. Request: update labels, provide explanation for any of the merge risk ones you added, or remove them |
|
ClawSweeper assist: the current labels already reflect the requested update: the merge-risk labels were removed after the re-review of commit Evidence:
Suggested next action: no label action is indicated from the supplied context; for a fresh correctness pass on head Source: #84649 (comment) |
728b97d to
bd58a70
Compare
RomneyDa
left a comment
There was a problem hiding this comment.
Refreshed inline notes after rebasing past #85201 (Vincent's derived-snapshot freshness fix). Previous review comments were anchored to pre-rebase SHAs and have been replaced.
The PR now coexists with #85201: persisted snapshots get the cache-busting + LRU treatment, derived snapshots stay out of the memo per #85201's correctness rule.
5a990ea to
3afed66
Compare
|
@clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
With the snapshot memo now actually hitting, route the snapshot's manifestRegistry and discovery through the helper chains that already had fast paths for them. Eliminates redundant per-call rebuilds at two big amplifiers. - Provider resolve paths (resolvePluginProviders / isPluginProvidersLoadInFlight / resolveOwningPluginIdsForProvider / resolveExternalAuthProfilesWithPlugins) self-service a snapshot once at the public entry, then thread it as a separate required arg through resolvePluginProviderLoadBase, resolveExplicitProviderOwnerPluginIds, and the setup/runtime load state helpers. Inner reads change from 'params.pluginMetadataSnapshot?.x' to 'snapshot.x', no more enrichedParams clone. loadPluginManifestRegistryForInstalledIndex fires drop ~685 -> ~10 per cold start. - Bundled-channel / auto-enable chain accepts an optional PluginDiscoveryResult. discoverOpenClawPlugins is fired once during snapshot building (resolveInstalledPluginIndexRegistry already produced it internally; now bubbled up through loadInstalledPluginIndexWithDiscovery, PluginRegistrySnapshotResult, and onto PluginMetadataSnapshot.discovery). load-context reads metadataSnapshot.discovery and passes it through applyPluginAutoEnable, so the bundled-channel cascade (collectConfiguredChannelIds, listBundledChannelIdsWith*, listPotentialConfiguredChannelPresenceSignals) short-circuits instead of each leaf re-firing discovery. Persisted-cache path is unchanged: no discovery on the snapshot, downstream chain handles its own fallback (pre-PR behavior on that path).
…registry The snapshot memo is now process-scoped and effective (~98% hit rate). Three test files were depending on cache misses (because the broken cache returned them) — each test would set up its own loadPluginManifestRegistry mock and expect a fresh derive. With the cache fixed, an earlier test's mocked registry now leaks into later tests in the same file. - io.write-config.test.ts: afterEach now clears the snapshot memo so the 'demo' plugin mocked in the first test does not survive into 'keeps shipped plugin install config records when index migration fails', which expects an empty registry to surface the 'plugin not found: demo' warning. - gateway/model-pricing-cache.ts: resetGatewayModelPricingCacheForTest also clears the memo. Tests in model-pricing-cache.test.ts assert loadPluginManifestRegistryForInstalledIndex was called; the memo hit otherwise skips the call. - providers.test.ts: vi.doMock loadPluginMetadataSnapshot to wrap the existing loadPluginManifestRegistryMock fixture. The plumbing commit added an auto-fetch fall-through in resolveOwningPluginIdsForProvider; without the mock, providers tests hit real disk reads and return empty registries (which is what surfaced as 9 unrelated-looking failures in the prior CI run).
resolveOwningPluginIdsForProvider now also checks plugin.setup?.cliBackends. The pre-PR no-registry fallback used resolvePluginContributionOwners which includes both top-level cliBackends and setup.cliBackends; the PR's manifest scan replacement was missing the setup case.
…adata snapshot isPluginProvidersLoadInFlight and resolvePluginProviders now resolve env and workspaceDir once at the entry point (falling back to getActivePluginRegistryWorkspaceDir) and pass them into both loadPluginMetadataSnapshot and resolvePluginProviderLoadBase. Pre-fix the snapshot used params.workspaceDir raw while the load base inherited the active workspace, so workspace-scoped provider plugins could be absent from the snapshot manifest registry even though owner resolution expected them. Regression test asserts the snapshot mock receives the active workspaceDir when the caller omits it.
Every gateway applyPluginAutoEnable call now passes the snapshot's PluginDiscoveryResult so the bundled-channel cascade (collectConfiguredChannelIds → listBundledChannelIdsWith* → listPotentialConfiguredChannelPresenceSignals) short-circuits instead of each leaf re-firing discovery. Startup-time sites pull discovery from the snapshot/lookup-table they already hold: - server-plugin-bootstrap.ts (pluginLookUpTable) - server-startup-plugins.ts (pluginMetadataSnapshot) - server-startup-config.ts (pluginMetadataSnapshot) - server-plugins.ts (pluginLookUpTable, both call sites) Per-RPC sites (server.impl getRuntimeConfig callback, server-methods/channels status + start handlers, server-methods/send) source discovery via getCurrentPluginMetadataSnapshot using the runtime config to validate compatibility. Falls through to the original slow path when the snapshot is absent or incompatible.
Threading + correctness slice of the original plugin cold-start performance work.
Context
The original PR also rewrote the
loadPluginMetadataSnapshotmemo (LRU, symmetric memoKeys, watched-file freshness, eligibility broadening). That work has been superseded by two PRs from @steipete that landed on main:12f82270cperf: cache stable gateway metadata — process-scoped install-records cache, switcheschannel-catalog-registryto use discovery candidate metadata instead of re-loading full plugin manifests4314674054perf: reuse plugin metadata snapshots (perf: reuse plugin metadata snapshots #85843) — array-of-8 memo, removes per-call watched-file freshness machinery, addsresolvePluginMetadataSnapshotthat consults the current-snapshot handoff, drops ~370 lines fromplugin-metadata-snapshot.tsThose PRs implement the same architectural direction this branch was going (and aligned with the new
src/plugins/CLAUDE.mdrule: "gateway plugin metadata is stable while gateway runs. Reuse current snapshots ... avoid per-call stat/read/hash freshness"). The memo commits from this branch were dropped during rebase; only the orthogonal pieces remain.What's in the PR
1.
PluginMetadataSnapshot+discoverythreaded through hot pathsresolvePluginProviders,isPluginProvidersLoadInFlight,resolveOwningPluginIdsForProvider, andresolveExternalAuthProfilesWithPluginsall dispatch to helpers that have amanifestRegistryfast path. They now thread the snapshot through so the fast path actually fires instead of every helper rebuilding the manifest registry from disk.loadPluginManifestRegistryForInstalledIndexfires drop ~685 → ~10 per resolve chain.A second plumb adds
discovery?: PluginDiscoveryResulttoPluginMetadataSnapshotand threads it throughapplyPluginAutoEnable→detectPluginAutoEnableCandidates→resolvePluginAutoEnableReadiness→collectConfiguredChannelIds→ the bundled-channel leaf helpers, so the caller's already-resolved discovery short-circuits the cascade.This is independent of memoization — it works because we don't redundantly call
loadPluginMetadataSnapshotwithin a single resolve chain, and helpers reuse the manifest registry the entry-point already resolved.2. Discovery threading through gateway
applyPluginAutoEnablecall sites10+ gateway call sites now pass
discoveryintoapplyPluginAutoEnableso the bundled-channel cascade short-circuits:server-plugin-bootstrap,server-startup-plugins,server-startup-config,server-plugins(two sites)server-methods/channels(status + start),server-methods/send,server.implgetRuntimeConfig callbackPer-RPC sites source discovery via
getCurrentPluginMetadataSnapshotkeyed on the runtime config; falls through to the original slow path when the snapshot is absent or policy-incompatible.3.
setup.cliBackendsowner regression fix75ab995b— the rewritten single owner-scan inresolveOwningPluginIdsForProviderwas missingplugin.setup?.cliBackends. Pre-PR the no-registry fallback went throughresolvePluginContributionOwners({ contribution: "cliBackends" })which includes both top-level and setup CLI backends; the replacement scan needed the same coverage. Includes regression test inproviders.test.ts.4. Workspace inheritance fix
67ee6315—isPluginProvidersLoadInFlightandresolvePluginProvidersnow resolveenvandworkspaceDironce at entry (falling back togetActivePluginRegistryWorkspaceDir) and pass them into bothloadPluginMetadataSnapshotandresolvePluginProviderLoadBase. Pre-fix the snapshot usedparams.workspaceDirraw while the load base inherited the active workspace, so workspace-scoped provider plugins could be absent from the snapshot manifest registry. Includes regression test.5. Test isolation for the snapshot auto-fetch
1abe536f— the threading commit added an auto-fetch fall-through inresolveOwningPluginIdsForProvider. Without a mock, providers tests would hit real disk reads. Addsvi.doMock("./plugin-metadata-snapshot.js")to wrap the existing fixture. Also clears the memo inio.write-config.test.tsafterEach and inresetGatewayModelPricingCacheForTest.Verification
pnpm tsgo --noEmit --project tsconfig.core.json— cleannode scripts/run-vitest.mjs src/plugins/plugin-metadata-snapshot.memo.test.ts src/plugins/plugin-registry-snapshot.test.ts src/plugins/providers.test.ts src/plugins/runtime/load-context.test.ts src/config/io.write-config.test.ts src/gateway/server-plugin-bootstrap.test.ts src/gateway/server-startup-config.recovery.test.ts src/gateway/model-pricing-cache.test.ts— 201/201 passNot run: full
pnpm test/pnpm check, sibling-process plugin-install interleaving.Perf notes
Earlier measurements on this branch (pre-rebase, when the memo commits were still in) showed ~5s cold-start gateway improvement. With the memo work now provided by
4314674054on main, re-measurement should attribute the cold-start win primarily to the threading + discovery short-circuit (this PR) layered on top of the memo (already on main). Warm-start gateway boot is bottlenecked onrunProviderCatalog(network) and is unaffected by either set of changes.