A reproducible investigation into why N child processes doing concurrent large-buffer allocations (fs.readFileSync, Buffer.allocUnsafe, addon mallocs, etc.) are measurably faster than N Worker threads doing the same work in a single Node.js process — and a direct kernel-side measurement of the cause.
fs.readFileSync is used as the canonical reproducer because it's the easiest to instrument, but the finding generalizes to any allocation-heavy workload, including native addons. See the Scope section below.
Root cause: per-mm_struct mmap_lock contention in the Linux kernel. Worker threads share one address space (one mm_struct, one mmap_lock); separate processes each have their own. Confirmed by bpftrace: workers wait ~17× longer per mmap_lock acquisition than equivalent processes (~29 μs vs ~1.7 μs on average, with a fat tail to milliseconds), and accumulate ~40× more total wait time for the same workload.
The contention triggers every time glibc grows/shrinks an arena — which happens often when many threads hammer malloc/free with Buffer.allocUnsafe(size) from readFileSync. The 128KB M_MMAP_THRESHOLD is not the only thing that triggers mmap; arena growth does too.
readFileSync is just the easiest way to reproduce this. The underlying mechanism — mmap_lock write contention on a shared mm_struct — fires for any allocation hot path inside a multi-Worker process. If your code (or a dependency) allocates and frees buffers above the per-thread-cache threshold under concurrency, you will see the same slowdown.
Concrete examples of code paths that hit the same kernel lock:
Buffer.allocUnsafe(size)/Buffer.alloc(size)forsize≥ ~8KB (above the internal Buffer pool). Anything that builds large buffers in a hot loop — body parsers, stream chunkers, base64/hex codecs, framing protocols — is affected.crypto—randomBytes,pbkdf2,scrypt,createHash/createHmacfinal outputs, AES/GCM encrypt/decrypt buffers, key derivation, certificate parsing.zlib—deflate/inflate/brotliCompressallocate large output buffers per operation.JSON.parse/JSON.stringifyon large objects — V8 allocates contiguous strings/objects via the same ArrayBuffer allocator path.- HTTP / WebSocket frame assembly — body buffering, large header parsing,
req.bodymaterialization. - WASM linear memory growth —
WebAssembly.Memory.grow()is literallymmap/mremapand takesmmap_lockfor write. - V8 GC compaction / large-object-space allocation — V8 itself calls
mmapfor backing store and for trusted-space pages.
Native addons are equally vulnerable — often more so, because they bypass the V8 ArrayBufferAllocator's per-isolate accounting and call malloc/new/mmap directly:
- Anything using N-API's
napi_create_buffer,napi_create_arraybuffer, ornapi_create_external_bufferfor large buffers. - Common addons that allocate frequently:
sharp(libvips image buffers),canvas(Cairo/pixman surfaces),node-sass/sass-embedded,bcrypt(work buffers),@node-rs/*crates,better-sqlite3result rows,couchbase/mongodbnative BSON paths, ML/ONNX runtimes, FFmpeg/GStreamer bindings. - Any addon that calls
std::vector::resize,new T[n],malloc(n), ormmap()from a Worker — those calls hit the samemmap_lockfrom the samemm_structas JS-side allocations, just through a different layer. - Threadpool-using addons (those that call
napi_create_async_work/uv_queue_work) execute on libuv threadpool threads which sharemm_structwith all Workers — so the contention is process-wide, not per-Worker.
This means the mitigation hierarchy is the same regardless of source:
- Reuse buffers wherever the hot path allows (preallocate once, reset between uses). Applies to both JS Buffers and addon-side allocations — many addons expose APIs that accept a destination buffer for exactly this reason (e.g.
crypto.randomFillSync(buf)vscrypto.randomBytes(n);sharp(...).raw().toBuffer({resolveWithObject: true})vs.toBuffer()). - Process-shard hot workloads —
node:cluster, or runtime-managed worker processes (Platformatic Watt, PM2, custom forks) instead ofworker_threadsfor CPU+allocation-bound work. - Cap concurrency to physical cores — adding more Workers past your physical-core count only adds
mmap_lockcontenders without adding I/O parallelism. Workers don't scale with allocation hotness the way independent processes do. - Audit addons with
bpftrace(the samemmap-lock.btworks against any process) to identify which dependency is the loudestmmap_lockwriter. Counts per PID + comm reveal the culprit.
If you are building a Node.js addon or library that allocates buffers on a hot path: provide an API variant that accepts a pre-allocated buffer. This is a real performance lever for downstream multi-Worker consumers.
| Mitigation | Effect | Why it works |
|---|---|---|
1. Reuse a Buffer (openSync + readSync into a pre-allocated buffer) |
Up to 5× throughput, gap nearly disappears | Removes per-iteration allocation → no malloc → no mmap_lock |
2. Raise Buffer.poolSize per Worker (e.g. 256KB for 64KB files) |
+32% workers throughput, gap closes from 0.53 → 0.81 | Routes Buffer.allocUnsafe allocations through the in-realm slab pool; reduces mmap_lock write count by 2.5× (10,500 → 4,137). Only works when fileSize < poolSize/2 AND pool stays under glibc M_MMAP_THRESHOLD (~128KB-1MB sweet spot) |
| 3. Shard work across child processes instead of Workers | Restores procs-fast behavior | Each process has its own mm_struct → no shared lock |
MALLOC_MMAP_MAX_=0 MALLOC_TRIM_THRESHOLD_=-1 |
Marginal (~5-7% in independent re-runs, within noise; an early single run showed +30% but didn't replicate) | In theory stops glibc from mmap/munmap'ing freed chunks back to kernel — in practice arena growth still dominates |
LD_PRELOAD) |
Marginal at 64KB, worse at 8MB | Not worth it |
always/never/madvise) |
Within noise (~10% variance) | mmap_lock taken once per syscall regardless of page size |
readFile |
Workers unchanged; procs get slower (CPU oversubscription) | Async doesn't avoid the kernel lock — it moves the syscall to libuv threadpool threads, which still share mm_struct |
The only mitigation that genuinely eliminates the contention is buffer reuse. Everything else either dodges it (separate processes) or marginally reduces it.
- OS: Linux 6.8 (Ubuntu)
- CPU: Intel i7-7700 (4C/8T, single NUMA node)
- Node: built from the
nodejs/nodemain branch (27.0.0-pre); same results expected on stable Node 20+ - Allocator: glibc 2.39 (Ubuntu 24.04 default)
- bpftrace: 0.20.2 with
mmap_lock:*tracepoints available - Default kernel built without
CONFIG_LOCK_STAT— so we usebpftracerather than/proc/lock_stat
The benchmark targets are pre-generated random files at /tmp/fs-contention-files/{small,med,large,huge} (128B, 64KB, 1MB, 8MB). run.sh creates them.
Reading the Node source confirms the sync readFileSync path bypasses libuv's threadpool entirely (uv_fs_open/read/close(nullptr, ...) runs uv__fs_work inline on the calling thread). Each Worker has its own V8 Isolate, libuv loop, ArrayBuffer allocator, and Permission object. No process-global Node-level locks are taken on the sync read path.
So Node itself is innocent. The contention must be in something workers implicitly share: the kernel, libc, or hardware.
run-workers.js and run-procs.js spawn K parallel JS contexts (Worker threads or child processes), each doing N readFileSync iterations on the same file. A SharedArrayBuffer barrier (for workers) ensures the timed loops start in sync.
Results at k=8, file=64KB, iter=2000 (the worst case we found):
workers k=8 iter=2000 throughput=79,027 ops/s
procs k=8 iter=2000 throughput=146,526 ops/s ← procs are 1.85× faster
Reproduced. Worker/procs ratio: 0.54.
Test by toggling MALLOC_ARENA_MAX:
| Setting | workers ops/s | procs ops/s | ratio |
|---|---|---|---|
| default | 76,775 | 141,259 | 0.54 |
MALLOC_ARENA_MAX=1 |
96,403 | 138,477 | 0.70 |
MALLOC_ARENA_MAX=64 |
94,128 | 153,064 | 0.61 |
Arena count modulates total throughput slightly but the worker/procs ratio doesn't close. Arena lock is a contributor but not the dominant cause.
Test by lowering M_MMAP_THRESHOLD so every 64KB allocation goes through mmap:
MALLOC_MMAP_THRESHOLD_=4096:
workers k=8: 30,769 ops/s ← collapse
procs k=8: 94,045 ops/s
ratio: 0.33
When forced to mmap every allocation, workers slow down dramatically more than processes (ratio drops from 0.54 to 0.33). This is the smoking gun for mmap_lock contention.
64KB k=8:
glibc workers: 85,155 glibc procs: 141,957 (ratio 0.60)
jemalloc workers: 90,619 jemalloc procs: 124,621 (ratio 0.73)
8MB k=8:
glibc workers: 1,515 glibc procs: 1,640 (ratio 0.92)
jemalloc workers: 1,144 jemalloc procs: 1,210 (ratio 0.70) ← worse
jemalloc closes the small-file gap mostly by making processes slower, and it actively regresses on large files (its large-chunk strategy doesn't fit this workload). Not a recommended fix.
vm.max_map_count is already at 1M (not the limit). Tested THP always/madvise/never:
THP=always: workers 81,774 procs 139,966 ratio 0.58
THP=madvise: workers 90,206 procs 132,356 ratio 0.68
THP=never: workers 92,394 procs 145,795 ratio 0.63
All within run-to-run noise. THP doesn't help because mmap_lock is taken once per syscall regardless of the page size that backs the mapping.
Buffer.poolSize (default 8KB) is per-realm, not per-process — lib/buffer.js is loaded fresh in each Worker and the slab variables (allocPool, poolOffset, etc.) live in module scope. So raising it on a Worker is a zero-coordination, per-Worker change.
When Buffer.allocUnsafe(size) is called with size < (Buffer.poolSize >>> 1), it slices from a shared in-realm slab instead of allocating a fresh ArrayBuffer. The slab refills via one createUnsafeBuffer(poolSize) call per ~poolSize/avgAlloc operations — batching many small allocations into one large kernel-visible event.
Sweep at 64KB k=8 (workers / procs throughput in ops/s, ratio):
Buffer.poolSize |
workers | procs | ratio | notes |
|---|---|---|---|---|
| 8KB (default) | 79,708 | 151,100 | 0.53 | baseline — 64KB allocations bypass pool |
| 256KB | 105,031 | 130,364 | 0.81 | sweet spot — pool stays under glibc mmap threshold |
| 1MB | 90,101 | 120,722 | 0.75 | |
| 4MB | 76,891 | 114,283 | 0.67 | pool refills now mmap (4MB > 128KB threshold) |
| 16MB | 93,264 | 107,290 | 0.87 |
bpftrace confirms the mechanism — workers with 256KB pool show:
mmap_lockwrite count: 10,500 → 4,137 (2.5× reduction)- Cumulative write wait: 360 ms → 139 ms (2.6× reduction)
- Per-acquisition wait unchanged (~34 μs) — the few remaining acquisitions still contend just as hard; we just have fewer of them
This is a real fix that doesn't require restructuring the app: a single Buffer.poolSize = 256 * 1024 at Worker startup. Caveats: doesn't help for files > ~1MB (the allocation always exceeds poolSize/2 no matter what); too-large pools (≥ 4MB) become counterproductive because pool refills themselves mmap; the pool occupies RSS for the lifetime of any sliced Buffer that's still referenced.
Tested file sizes:
- 64KB: big win (above)
- 1MB: no significant change — 1MB allocations exceed
M_MMAP_THRESHOLDregardless of pooling - 8MB: regression with 32MB pool (workers 1,519 → 880 ops/s). Pool refills become a worse
mmap_locksource than direct allocations.
The 8KB default was set in May 2015 (Trevor Norris, commit 63da0dfd3a44, "buffer: implement Uint8Array backed Buffer") and hasn't been touched since. In ten years, typical HTTP frame sizes (16KB-1MB for HTTP/2), JSON payload sizes, and machine RAM have all grown ~10×. The 8KB default predates the dominant modern allocation patterns.
There's also a more subtle issue: the pool check is size < (Buffer.poolSize >>> 1). With default poolSize=8KB, the threshold is 4KB — and the strict inequality means a 4KB allocation itself bypasses the pool. So the current default helps allocations from 1B to 3.99KB and abruptly stops helping at exactly 4KB — precisely where many real allocation sizes land (HTTP frames, page-aligned chunks, small file reads).
Workers throughput sweep at k=8 across file sizes and pool sizes (ops/s):
| File | 8KB (default) | 16KB | 32KB | 64KB | 128KB | 256KB | procs reference |
|---|---|---|---|---|---|---|---|
| 512B | 404k | 460k | 453k | 431k | 455k | 446k | 496k |
| 2KB | 360k | 382k | 410k | 367k | 417k | 411k | 472k |
| 4KB | 326k | 332k | 326k | 360k | 372k | 382k | 413k |
| 8KB | 202k | 187k | 208k | 254k | 232k | 271k | 302k |
| 16KB | 148k | 147k | 150k | 181k | 189k | 243k | 284k |
| 64KB | 86k | 87k | 80k | 87k | 88k | 108k | 142k |
| 1MB | 12k | 13k | 12k | 12k | 12k | 13k | 12k |
Observations:
- 64KB pool is a Pareto-near improvement over 8KB. Wins at 8KB files (+26%) and 16KB files (+23%); ties everywhere else within noise. RSS cost: +56KB per realm.
- 256KB pool is more aggressive but still Pareto. Wins at every multi-KB file size — +17% at 4KB, +34% at 8KB, +64% at 16KB, +27% at 64KB. Never regresses. RSS cost: +248KB per realm (~2MB across 8 Workers).
- 1MB file row is unchanged across all pools. Large allocations bypass the pool regardless, so no risk.
A bump from 8KB → 64KB (or 128KB) would be a low-risk, near-Pareto improvement for the entire Node ecosystem at trivial RSS cost. Worth proposing upstream.
Use openSync + readSync into a pre-allocated buffer instead of readFileSync:
64KB k=8:
workers-reuse: 393,321 ops/s ← 5× faster than readFileSync workers
procs-reuse: 484,380 ops/s ← 4× faster than readFileSync procs
ratio: 0.81 ← gap nearly closes
1MB k=8:
workers-reuse: 39,133 ops/s
procs-reuse: 35,376 ops/s ← workers now FASTER
ratio: 1.11
Same syscalls (open/read/close), no allocation. Confirms that the allocation is the cost, not the I/O.
64KB k=8:
workers procs ratio
sync: 79,027 146,526 0.54
async serial (conc=1): 35,449 41,392 0.86
async parallel (conc=8): 80,216 60,855 1.32
Async does not help workers (80k vs 79k sync — same). The ratio "closes" only because async drags procs down (libuv threadpool adds 4 active threads × 8 procs = ~40 threads on 8 cores → CPU oversubscription).
Theoretically consistent: async only moves which thread issues the syscall — it doesn't change the per-mm_struct lock that the kernel takes.
mmap-lock.bt attaches to the mmap_lock:mmap_lock_start_locking and mmap_lock:mmap_lock_acquire_returned tracepoints to measure per-acquisition wait time (start → acquired). Filtered to Node-relevant comms (node-MainThread, WorkerThread, libuv-worker).
Sync, 64KB, k=8:
workers procs (×9 PIDs)
write acquisitions 10,572 4,711
total write wait 311 ms 8 ms
avg write wait 29,389 ns 1,662 ns ← 17× longer
max write wait 2-4 ms 1-2 ms (rare)
Worker write-wait histogram (showing the contention shape):
[1K, 2K) 1760 ← uncontended floor
[16K, 32K) 1782 ← clearly blocking
[32K, 64K) 2177 ← heavy contention
[64K, 128K) 386
[128K, 256K) 232
[256K, 512K) 41
[512K, 1M) 43
[1M, 2M) 14
[2M, 4M) 1 ← worker slept 2-4 ms waiting for the lock
Procs are tightly clustered at 1-2K ns (the uncontended atomic-acquire floor). Workers show a bimodal distribution with a fat tail — classic rwsem blocking behavior.
Async, 64KB, k=8:
workers procs (×9 PIDs)
write acquisitions 4,332 2,492
total write wait 89 ms 4 ms
avg write wait 20,642 ns 1,700 ns ← still 12× longer
Async reduces contention by ~3× but does not eliminate it. Same kind of lock, fewer events because async staggers allocations across libuv threadpool threads.
The cost workers pay vs procs at 64KB sync (~360 ms of cumulative mmap_lock write waits vs ~6 ms) is the same order of magnitude as the wall-clock difference. Theory and measurement agree.
Why so many mmap_lock writes at 64KB? glibc's arena segment growth, madvise(DONTNEED) on free chunks, and contention-fallback mmap calls — none of which we control from JS. The trace shows 10,572 writes from one shared mm_struct for the workers case vs ~560 per process for the 9-process case.
| File | Purpose |
|---|---|
task.js |
Sync readFileSync loop. Worker or child-process entry point. SharedArrayBuffer barrier when used as Worker. |
task-readsync.js |
Sync read using openSync/readSync into a pre-allocated buffer. Tests the buffer-reuse mitigation. |
task-readfile.js |
Async fs.promises.readFile loop. CONCURRENCY=N controls in-flight reads per worker/proc. |
run-workers.js |
Spawn K Worker threads running task.js, barrier-synchronize, report per-worker timings and aggregate throughput. |
run-procs.js |
Spawn K child processes running task.js. |
run-workers-reuse.js, run-procs-reuse.js |
Variants using task-readsync.js. |
run-workers-async.js, run-procs-async.js |
Variants using task-readfile.js. |
run.sh |
End-to-end suite: small/med/large/huge files × k=1/4/8 × default/MALLOC_ARENA_MAX=1/=64. |
run-thp.sh |
Toggles /sys/kernel/mm/transparent_hugepage/enabled between always/madvise/never and runs the 64KB/1MB/8MB k=8 cases. Requires sudo; restores original setting on exit. |
mmap-lock.bt |
bpftrace script: measures mmap_lock start→acquire latency, separated by write (mmap/munmap/brk) and read (page-fault) acquisitions. Filters on Node thread comms. |
All scripts are independent — pick whichever question you want to answer.
# Linux with mmap_lock tracepoints (kernel ≥ 5.8)
uname -r
# bpftrace and sudo (for tracing and THP toggling)
sudo apt install bpftrace # Debian/Ubuntu
bpftrace --version # need ≥ 0.16
# Node.js 20+ (for fs.promises and stable Worker threads)
node --version
# Optional: jemalloc for the allocator-swap test
sudo apt install libjemalloc2
# 8 cores recommended (we use k=8 in the canonical tests)
nprocgit clone git@github.com:platformatic/node-worker-mmap-lock-contention.git
cd node-worker-mmap-lock-contention
# Set NODE to whichever node binary you want to test.
# Defaults below assume system node.
export NODE=$(which node)mkdir -p /tmp/fs-contention-files
dd if=/dev/urandom of=/tmp/fs-contention-files/small bs=128 count=1 status=none
dd if=/dev/urandom of=/tmp/fs-contention-files/med bs=65536 count=1 status=none
dd if=/dev/urandom of=/tmp/fs-contention-files/large bs=1048576 count=1 status=none
dd if=/dev/urandom of=/tmp/fs-contention-files/huge bs=1048576 count=8 status=none# 64KB file is the contention sweet spot
$NODE run-workers.js /tmp/fs-contention-files/med 2000 8
$NODE run-procs.js /tmp/fs-contention-files/med 2000 8Expected: workers ~70-90k ops/s, procs ~120-150k ops/s, ratio 0.55-0.65 (typical 0.60), procs at least 1.4× faster. If you don't see a gap of at least 1.4×, check that nothing else is competing for CPU.
$NODE run-workers-reuse.js /tmp/fs-contention-files/med 2000 8
$NODE run-procs-reuse.js /tmp/fs-contention-files/med 2000 8Expected: both jump to 300-500k ops/s, ratio above 0.8.
NODE=$NODE bash run.shWalks through file sizes 128B / 64KB / 1MB / 8MB at k=1/4/8 with default, MALLOC_ARENA_MAX=1, and MALLOC_ARENA_MAX=64. Takes ~5 minutes.
MALLOC_MMAP_MAX_=0 MALLOC_TRIM_THRESHOLD_=-1 $NODE run-workers.js /tmp/fs-contention-files/med 2000 8Expected: marginal — typically only ~5-10% improvement in repeated runs, often within run-to-run noise. (An early single run on this hardware showed +30% but did not replicate under independent re-verification.) Listed here for completeness, not as a recommended mitigation.
sudo bash run-thp.sh # restores original THP setting on exitExpected: all three modes within noise. Reported here only to rule it out.
LD_PRELOAD=/lib/x86_64-linux-gnu/libjemalloc.so.2 \
$NODE run-workers.js /tmp/fs-contention-files/med 2000 8Expected: small improvement at 64KB, regression at 8MB. Skip in production.
# Trace workers run
sudo bpftrace -o /tmp/bpf-workers.txt mmap-lock.bt \
-c "$NODE run-workers.js /tmp/fs-contention-files/med 2000 8"
# Trace procs run
sudo bpftrace -o /tmp/bpf-procs.txt mmap-lock.bt \
-c "$NODE run-procs.js /tmp/fs-contention-files/med 2000 8"
# Compare
echo "=== WORKERS ===" && cat /tmp/bpf-workers.txt
echo "=== PROCS ===" && cat /tmp/bpf-procs.txtWhat to look for:
@cnt_write— totalmmap_lockwrite acquisitions. Workers ~10k; procs ~5k spread across 9 PIDs (@cnt_per_pid_write[…]).@sum_write_ns— total cumulative wait time. Expect workers ~30-50× procs (observed range across runs: 38-60×).@wait_write_nshistogram — workers will have a fat tail in 16K-128K ns; procs will be clustered at 1-2K ns.
If you see workers' write waits clustered in the 1-2K ns range, the trace didn't capture the right threads — double-check the comm filter in mmap-lock.bt matches your Node build's thread names (/proc/PID/task/*/comm).
CONCURRENCY=8 $NODE run-workers-async.js /tmp/fs-contention-files/med 2000 8
CONCURRENCY=8 $NODE run-procs-async.js /tmp/fs-contention-files/med 2000 8
# Trace it
sudo bpftrace -o /tmp/bpf-workers-async.txt mmap-lock.bt \
-c "env CONCURRENCY=8 $NODE run-workers-async.js /tmp/fs-contention-files/med 2000 8"
cat /tmp/bpf-workers-async.txtExpected: workers' throughput unchanged. mmap_lock waits reduced by ~3× (~89 ms vs 311 ms) but still 12-15× higher per acquisition than procs. Procs async gets slower than procs sync due to libuv threadpool CPU oversubscription.
The full 10-step guide above was independently re-executed by a fresh agent that did not see the original investigation notes — only the README. It re-ran every step against the same hardware and judged each against the README's predictions. Result summary:
| Step | Verdict | Notes |
|---|---|---|
| 3 — basic gap | ✅ matches | workers 77-88k / procs 138-141k, ratio 0.55-0.64 |
| 4 — buffer reuse fix | ✅ matches | workers-reuse 333-375k / procs-reuse 408-461k, ratio 0.81-0.92 |
| 5 — full matrix | ✅ matches | 64KB k=8 default: workers 77k / procs 122k, 1.58× gap |
6 — MALLOC_MMAP_MAX_=0 … |
❌ does not reproduce as claimed | only ~5-7% over baseline across 5 runs; the ≥15% claim was based on a single non-representative run |
| 7 — THP modes | ✅ matches | within 3% across always/madvise/never |
| 8 — jemalloc | ✅ matches | ~5-10% at 64KB, not dramatic |
| 9 — bpftrace smoking gun | ✅ matches decisively | workers @sum_write_ns = 298 ms vs procs 7.7 ms (38.7×); avg per acquisition 28,418 ns vs 1,611 ns (17.6×); tail mass above 16K ns: workers 4,583 events vs procs 4 events (>1000×) |
| 10 — async doesn't fix it | ✅ matches | workers-async avg wait 24,512 ns vs procs sync 1,611 ns = 15.2× — confirms async only reduces, doesn't eliminate, the contention |
Independent verdict on the central claim: the per-mm_struct mmap_lock contention thesis is confirmed by direct kernel measurement. The ~290 ms cumulative worker write-wait closely matches the wall-clock gap between workers and procs. Cause and effect line up.
Corrections applied based on validation:
- The headline "27× longer per acquisition" was a one-run outlier — corrected to ~17×, consistent with the body's measurement table and the validation re-run (17.6×).
MALLOC_MMAP_MAX_=0 MALLOC_TRIM_THRESHOLD_=-1was downgraded in the mitigation ranking. It was promising on one early run but only ~5-7% on repeated trials.- Headline ratio narrowed from "~0.54" to "0.55-0.65 (typical 0.60)" to reflect typical run-to-run variance rather than the best-case single number.
The mitigation ranking now correctly elevates buffer reuse (#1, the only thing that actually fixes it) and process sharding (#2, sidesteps the kernel lock entirely); everything else is footnote-worthy at best on this kernel/glibc combination.
- All numbers are from a single i7-7700 (8 logical cores). On a higher-core-count machine the absolute contention may differ but the qualitative result should hold —
mmap_lockis per-mm_structeverywhere. - Tested with glibc 2.39. Other libcs (musl, jemalloc-replaced) have different arena strategies; results may differ. jemalloc was tested here and did not help.
- The Node binary used was built from main (
27.0.0-pre). Same pattern should reproduce on Node 20.x / 22.x / 24.x — the relevant code paths (readFileSync, libuvuv_fs_*, V8 ArrayBuffer allocator) have been stable. - Kernel was Linux 6.8 (with maple-tree VMAs and per-VMA page-fault locks introduced in 6.4). The mmap/munmap write-side serialization on
mmap_lockis unchanged on newer kernels at time of writing. - We did not test with
CONFIG_LOCK_STAT-enabled kernel. That would give cleaner per-lock contention numbers without needing eBPF. - We did not test on a system with multiple NUMA nodes —
mmap_lockis per-mm_structso NUMA shouldn't change the per-process picture, but cross-socket cacheline bouncing could amplify it.
- Linux
mmap_lockdesign:kernel/Documentation/mm/process_addrs.rst - libuv sync fs path:
deps/uv/src/unix/fs.cPOSTmacro at line 139 - Node
readFileUtf8:src/node_file.ccReadFileUtf8 - glibc malloc arena behavior:
glibc/malloc/arena.candMALLOC_TUNABLES(3) - bpftrace
mmap_locktracepoints: introduced kernel 5.8,include/trace/events/mmap_lock.h