Skip to content

[RFC] Use SQLite for package metadata within pnpm store #10826

@SukkaW

Description

@SukkaW

Contribution

Describe the user story

This RFC follows my previous RFC Increase default network-concurrency. It is another idea about improving pnpm's performance.

In Bun's blog Behind The Scenes of Bun Install, they described how they make bun install fast.

We will ignore all the BS like "async dns resolution only on macOS", and focus on what matters.

In their post, they describe a Cache-Friendly Data Layout, a special binary format for caching manifests.

Currently, pnpm stores all the package metadata in the content-addressable JSON files (e.g. $PNPM_HOME/store/v10/index/XX/XXXX-name@version.json). Later, pnpm will join the file path and filename, read the file, and call JSON.parse.

Describe the solution you'd like

pnpm doesn't really need to home-bake a custom binary data layout, because there is already a battle-tested solution: SQLite.

SQLite is very good at storing small pieces of data. SQLite claims that with blobs under 10 KiB, it can be up to 35% faster than File System, and almost never slower, most likely due to SQLite only needing to touch very few files, thus reducing syscalls overhead.

Another SQLite study shows that SQLite can store up to 100 KiB blob within the database, while still faster than storing blob externally. 100 KiB should be enough for the vast majority of the pnpm package metadata.

Also, SQLite is space-efficient with small blobs. With most of the filesystems using a default block size of 4 KiB (most of the time), having many small JSON files on the filesystem is not space-efficient. SQLite, on the other hand, stores small blobs in a compact, continuous manner.

Here is the find + awk command I used to analyze my pnpm index store file size distribution:

cd $PNPM_HOME/store/v10/index/

find . -type f -exec stat -f '%z' {} \; | awk '{s=$1; if(s<1024)c1++; else if(s<4096)c2++; else if(s<8192)c3++; else if(s<102400)c4++; else c5++} END{print "<1KiB:",c1; print "1-4KiB:",c2; print "4-8KiB:",c3; print "8-100KiB:",c4; print ">100KiB:",c5}'

...and here is the output:

<1KiB: 637
1-4KiB: 1176
4-8KiB: 150
8-100KiB: 255
>100KiB: 34

As shown above, most of the pnpm stored package metadata JSON files are smaller than 4 KiB; only a handful of JSON files exceed the SQLite external blob 100 KiB performance threshold.

Also, SQLite may potentially give a performance advantage. Instead of doing a content-addressable store, pnpm could simply store the package name and version in table rows, then create indexes with those two rows. And instead of reading JSON files then JSON.parse, pnpm could store some commonly used fields directly in the table, and later just SELECT those fields out w/o deserilization overhead.

Describe the drawbacks of your solution

Node.js built-in SQLite support only removes --experimental-sqlite since v22.13.0 and v23.4.0, and only out of experimental since v25.7.0.

@sqlite.org/sqlite-wasm currently only officially supports an in-memory database on Node.js w/o persistence. Manually reading the DB buffer from the filesystem, then allocFromTypedArray, and later export buffer via sqlite3_js_db_export and writing back to the filesystem is supported though.

Both sqlite3 and better-sqlite3 require node-gyp, which is not ideal.

There are a few N-API versions of SQLite bindings out in the wild, none of which are battle-tested and may not be trustworthy. This RFC only lists two of them for reference:

Describe alternatives you've considered

No response

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions