You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#12198 moves pnpr's hosted store (published packages + static content) into an S3-compatible object store (S3 / Cloudflare R2 / MinIO / …). That removes the biggest blocker to running pnpr as multiple stateless replicas, but it isn't sufficient on its own: pnpr still keeps other state on local disk that a horizontally-scaled or serverless runtime won't preserve or share.
This issue tracks making the remaining state pluggable so pnpr can run as a stateless, horizontally-scalable service (the realistic near-term target is container-serverless with autoscaling — Cloud Run / Fly / Container Apps / Fargate scale-to-zero; true FaaS is a larger lift, see below).
Organizing principle: every per-instance disk dependency becomes a config-selected backend, defaulting to today's local behavior, following the pattern #12198 established (build the client once at config-load time into an Arc<dyn …> handle so request handlers stay infallible).
tee_to_cache (streaming.rs) writes incrementally to a local file then renames. S3 can't append, so stage to local tmp and upload on stream completion (multipart for large tarballs) — same stage-then-upload shape the publish path already uses.
No new external service — cheapest large win.
Record side (auth + grants)
Narrow async traits, keeping current impls as defaults:
Impls: InMemory, LocalSqlite (today, rusqlite), and a networked SQLite impl. Selected by config, built into Arc<dyn …> at load time. (Multiple impls behind dyn → async-trait.)
DB choice: SQLite-compatible, not Postgres
pnpr is already built on SQLite (rusqlite backs tokens, grants, public-packages, verdict cache), so the schemas and queries already exist. Standardizing on SQLite-compatible services keeps the SQL and makes an all-Cloudflare stack coherent: R2 (blobs) + D1 (records) + compute.
Important nuance: "SQLite" ≠ "rusqlite" once networked.rusqlite opens a local file; a networked SQLite service needs a different driver. Keep the SQL, swap the driver:
Option
Reach it via
Works with current stack?
Cloudflare D1
Workers binding, or REST API (SQL over HTTP)
REST API: yes (Cloud Run/Lambda). Workers binding: only inside a Worker.
Turso / libSQL
libsql crate (network + embedded replicas)
Yes — any tokio runtime
LiteFS (Fly)
FUSE-replicated file, rusqlite unchanged, single writer
Yes
rqlite
Raft + HTTP
Yes
Caveats:
D1 is happiest inside a Worker, and Workers can't run the current tokio/axum/rusqlite stack (WASM rewrite). From Cloud Run/Lambda you'd use D1's REST API (per-query HTTP).
Auth/token lookup is on the hot path (≈ every request). A pure-HTTP-to-D1 backend taxes every call. Mitigate with a short-TTL in-process token cache or Turso embedded replicas (local-fast reads).
Eventual consistency on replicas — fine for grants/public-packages (grant table is already best-effort, clear-on-discovery); the one to watch is token-revocation lag. A config knob (replica vs primary read for auth) covers it.
Recommendation: abstract behind the record-backend trait; default LocalSqlite. For the container-serverless target reachable today, Turso/libSQL is the pragmatic pick (works with the current stack; embedded replicas solve the hot-path auth read). D1 is the north-star for the Workers/edge path, gated on the larger stack rewrite.
Proxy cache + accelerator CAS → S3 — reuses object_store, no new external service. After this the only local state left is the SQLite record stores. Biggest bang, lowest risk — proposed next step.
Deployment glue — Cloud Run needs ~a Dockerfile + readiness probe; Lambda needs lambda_http + response-streaming handling; Workers is a separate, larger effort.
Cross-cutting caveat (independent of where bytes live)
With N replicas, the read-modify-write flows are the real distributed-systems risk, not storage:
Publish merges the incoming manifest into the existing packument (publish.rs::merge_manifest); partial-unpublish rewrites it. Two replicas publishing the same package concurrently = last-write-wins on the packument object → a lost version.
Single-instance half ✅ done (feat(pnpr): config-selectable networked-SQLite auth backend (#12199 phase 3) #12206): a striped per-package lock serializes the read-modify-write packument flows (publish, dist-tag, partial-unpublish) on one instance. Cross-replica still pending: S3 conditional writes (If-Match / ETag) or a DB-level conditional update. pnpm clients usually serialize per package, but a shared registry shouldn't rely on that.
Also: keep cold-start cost down — the install accelerator is already lazily built via OnceLock; keep that laziness and use a small connection pool.
Written by an agent (Claude Code, claude-opus-4-8).
Motivation
#12198 moves pnpr's hosted store (published packages + static content) into an S3-compatible object store (S3 / Cloudflare R2 / MinIO / …). That removes the biggest blocker to running pnpr as multiple stateless replicas, but it isn't sufficient on its own: pnpr still keeps other state on local disk that a horizontally-scaled or serverless runtime won't preserve or share.
This issue tracks making the remaining state pluggable so pnpr can run as a stateless, horizontally-scalable service (the realistic near-term target is container-serverless with autoscaling — Cloud Run / Fly / Container Apps / Fargate scale-to-zero; true FaaS is a larger lift, see below).
Organizing principle: every per-instance disk dependency becomes a config-selected backend, defaulting to today's local behavior, following the pattern #12198 established (build the client once at config-load time into an
Arc<dyn …>handle so request handlers stay infallible).Two backend "kinds"
Everything pnpr keeps locally is one of:
object_store; reuse theS3Storefrom feat(pnpr): store hosted packages in an S3-compatible object store #12198.Components
storage.rsHostedStore {Fs|S3}storage.rscached: Store(fs-only){Fs|S3}, reuseS3Storepnpr-store)install_accelerator.rsstore_dirauth.rsUserStore/TokenStoreUserBackend/TokenBackendtraits + libsql/Turso implinstall_accelerator/grant_table.rsinstall_accelerator/public_packages.rsinstall_accelerator/verdict_cache.rsBlob side (cache + accelerator CAS)
Reuses #12198. Two fs-isms to handle in the proxy cache:
Store::read_fresh_packumentreads file mtime; the S3 variant usesObjectMeta.last_modified(object_store exposes it).tee_to_cache(streaming.rs) writes incrementally to a local file then renames. S3 can't append, so stage to local tmp and upload on stream completion (multipart for large tarballs) — same stage-then-upload shape the publish path already uses.No new external service — cheapest large win.
Record side (auth + grants)
Narrow async traits, keeping current impls as defaults:
Impls:
InMemory,LocalSqlite(today,rusqlite), and a networked SQLite impl. Selected by config, built intoArc<dyn …>at load time. (Multiple impls behinddyn→async-trait.)DB choice: SQLite-compatible, not Postgres
pnpr is already built on SQLite (
rusqlitebacks tokens, grants, public-packages, verdict cache), so the schemas and queries already exist. Standardizing on SQLite-compatible services keeps the SQL and makes an all-Cloudflare stack coherent: R2 (blobs) + D1 (records) + compute.Important nuance: "SQLite" ≠ "rusqlite" once networked.
rusqliteopens a local file; a networked SQLite service needs a different driver. Keep the SQL, swap the driver:libsqlcrate (network + embedded replicas)rusqliteunchanged, single writerCaveats:
tokio/axum/rusqlitestack (WASM rewrite). From Cloud Run/Lambda you'd use D1's REST API (per-query HTTP).Recommendation: abstract behind the record-backend trait; default
LocalSqlite. For the container-serverless target reachable today, Turso/libSQL is the pragmatic pick (works with the current stack; embedded replicas solve the hot-path auth read). D1 is the north-star for the Workers/edge path, gated on the larger stack rewrite.Config shape (each block absent ⇒ local default)
Phased rollout
object_store, no new external service. After this the only local state left is the SQLite record stores. Biggest bang, lowest risk — proposed next step.UserBackend/TokenBackendasync traits; local (htpasswd + SQLite) / in-memory / libsql/Turso impls selected by abackend.libsql:block. Embedded-replica read acceleration (replicaPath/syncIntervalSecs) for the hot-path token lookup; strict atomic registration cap.lambda_http+ response-streaming handling; Workers is a separate, larger effort.Cross-cutting caveat (independent of where bytes live)
With N replicas, the read-modify-write flows are the real distributed-systems risk, not storage:
publish.rs::merge_manifest); partial-unpublish rewrites it. Two replicas publishing the same package concurrently = last-write-wins on the packument object → a lost version.If-Match/ ETag) or a DB-level conditional update. pnpm clients usually serialize per package, but a shared registry shouldn't rely on that.Also: keep cold-start cost down — the install accelerator is already lazily built via
OnceLock; keep that laziness and use a small connection pool.Written by an agent (Claude Code, claude-opus-4-8).