[turbopack] add a new hasher implementation to eliminate allocations#89059
[turbopack] add a new hasher implementation to eliminate allocations#89059lukesandberg merged 1 commit intocanaryfrom
Conversation
Merging this PR will not alter performance
Comparing Footnotes
|
Stats from current PR✅ No significant changes detected📊 All Metrics📖 Metrics GlossaryDev Server Metrics:
Build Metrics:
Change Thresholds:
⚡ Dev Server
📦 Dev Server (Webpack) (Legacy)📦 Dev Server (Webpack)
⚡ Production Builds
📦 Production Builds (Webpack) (Legacy)📦 Production Builds (Webpack)
📦 Bundle SizesBundle Sizes⚡ TurbopackClient Main Bundles: **401 kB** → **401 kB** ✅ -21 B80 files with content-based hashes (individual files not comparable between builds) Server Middleware
Build DetailsBuild Manifests
📦 WebpackClient Main Bundles
Polyfills
Pages
Server Edge SSR
Middleware
Build DetailsBuild Manifests
Build Cache
🔄 Shared (bundler-independent)Runtimes
📎 Tarball URL |
1750d5e to
83303ff
Compare
1eb984c to
d1287e0
Compare
83303ff to
a974ae3
Compare
Tests Passed |
a974ae3 to
9dfbd35
Compare
d1287e0 to
262d47a
Compare
922a7d8 to
6d6fb80
Compare
9dfbd35 to
0b20818
Compare
6d6fb80 to
0ecbfe9
Compare
0b20818 to
9621e35
Compare
4fac530 to
cee667c
Compare
92efbfb to
5f9077c
Compare
cee667c to
6cade98
Compare
5f9077c to
6f0d811
Compare
6cade98 to
9fe6b3c
Compare
2d1bdb2 to
42b2f7e
Compare
889900b to
1e6270a
Compare
|
DeterministicHash is used for creating filenames, and so it isn't appropriate for TaskInputs. (hashing VCs is explicitly called out as verboten) so i would need to produce a new trait 'GoodEnoughForTurboPersistenceHash', and then derive that for TaskInput. This approach seemed simpler, but i'll admit that that might be a bit nicer. |
42b2f7e to
ed77de8
Compare
1e6270a to
53a30a8
Compare
ed77de8 to
0dd663f
Compare
f98c5f2 to
153f568
Compare
0b363db to
d858438
Compare
ab8eb1e to
4df1745
Compare
d858438 to
6f21ee4
Compare
3c9b033 to
9603703
Compare
6f21ee4 to
8d88bf2
Compare
8d88bf2 to
bb9c371
Compare
Save space in the persisent store by only recording TaskType one time instead of twice. ## What? Instead of using an encoded TaskType struct as a key in persistent storage just use a 64 bit hash. Now when looking up we actually need to do a cascading read * read task ids that match the hash * restore all the tasks until we find one that matches our CachedTaskType Full cache misses perform the same and so do cache hits since we always end up reading the TaskData anyway (in the `connect_child` operation that immediately follows). This just slightly changes _when_ that second read happens, as such we shouldn't really expect it to slow down. In the case of a hash collisions we do end up doing strictly more work (reading and restoring TaskData entries for the wrong task), but this work is cached and this should be _extremely_ rare assuming a good hash function.. From measuring vercel-site this saves ~231M of data in the persistent cache. (The cache goes from from 3846M -> 3615 or -231M or about -6%). Finally this also addresses a narrow race condition where two racing calls to `get_persistent_task_id` for the same task could result in two entries being pusshed to the `persistent_task_log`, that is now addressed as well. ## Why? Currently we encode 2 copies of every `CachedTaskType` in the database. 1. as the key of the `TaskType`->`TaskId` map (aka `TaskCache` keyspace) 2. as a part of the `TaskStorage` struct stored in the `TaskData` keyspace This redundancy is wasteful. Instead we can make the `TaskCache` map much smaller and add a bit of complexity to lookups. ## Future Work ### Better hash functions Right now to compute the hashes we are just running `encode` and then hashing the bytes. This is not optimal, but we do not have a hash function that is suitable for this usecase. So we should create a new `PersistentHash` trait that TaskInputs implement in order to support this without encoding. Or perhaps a custom _encoder_ that accumulates encoding data to a hasher. This will be addressed in #89059 ### New SST ffile heuristics now that the TaskcCache keyspace is smaller our heuristics on 'maxiumum number of keys in a file' need to be rethought since we are now producing lots of 7Mb files for the taskcache. this will be addressed in #89058 ### New compaction semantics Right now we tolerate duplicates in the database but compaction will delete them. This is not too harmful for us since it means if there is a hash coliision we will tend to lose one of the results over time. A better option would be to change compaction semantics for this KeySpace to either tolerate duplicates, leverage values for comparison, or something wilder where we simply 'recompute' the TaskCache instead of compacting it. This will be addressed in #89075
02b7412 to
34fdcbb
Compare
bb9c371 to
20942e3
Compare
20942e3 to
952fbad
Compare
Merge activity
|
…89059) We only store hashes for TaskTypes in the TaskCache keyspace, so this introduces an allocation free path to compute it. For writing out the TaskCache this wont make much of a difference since we were reusing a scratch buffer, but this should eliminate a source of allocations from the read path. This does of course cause a small binary size regression `157M (160,952K) -> 158M (161,800K) = +1M (+848K)`

We only store hashes for TaskTypes in the TaskCache keyspace, so this introduces an allocation free path to compute it.
For writing out the TaskCache this wont make much of a difference since we were reusing a scratch buffer, but this should eliminate a source of allocations from the read path.
This does of course cause a small binary size regression
157M (160,952K) -> 158M (161,800K) = +1M (+848K)