Node.js Worker Threads vs Processes — Memory Allocation Contention Analysis

A reproducible investigation into why N child processes doing concurrent large-buffer allocations (fs.readFileSync, Buffer.allocUnsafe, addon mallocs, etc.) are measurably faster than N Worker threads doing the same work in a single Node.js process — and a direct kernel-side measurement of the cause.

fs.readFileSync is used as the canonical reproducer because it's the easiest to instrument, but the finding generalizes to any allocation-heavy workload, including native addons. See the Scope section below.

Root cause: per-mm_struct mmap_lock contention in the Linux kernel. Worker threads share one address space (one mm_struct, one mmap_lock); separate processes each have their own. Confirmed by bpftrace: workers wait ~17× longer per mmap_lock acquisition than equivalent processes (~29 μs vs ~1.7 μs on average, with a fat tail to milliseconds), and accumulate ~40× more total wait time for the same workload.

The contention triggers every time glibc grows/shrinks an arena — which happens often when many threads hammer malloc/free with Buffer.allocUnsafe(size) from readFileSync. The 128KB M_MMAP_THRESHOLD is not the only thing that triggers mmap; arena growth does too.

⚠ Scope: This Is Not An `fs` Problem

readFileSync is just the easiest way to reproduce this. The underlying mechanism — mmap_lock write contention on a shared mm_struct — fires for any allocation hot path inside a multi-Worker process. If your code (or a dependency) allocates and frees buffers above the per-thread-cache threshold under concurrency, you will see the same slowdown.

Concrete examples of code paths that hit the same kernel lock:

Buffer.allocUnsafe(size) / Buffer.alloc(size) for size ≥ ~8KB (above the internal Buffer pool). Anything that builds large buffers in a hot loop — body parsers, stream chunkers, base64/hex codecs, framing protocols — is affected.
crypto — randomBytes, pbkdf2, scrypt, createHash/createHmac final outputs, AES/GCM encrypt/decrypt buffers, key derivation, certificate parsing.
zlib — deflate/inflate/brotliCompress allocate large output buffers per operation.
JSON.parse / JSON.stringify on large objects — V8 allocates contiguous strings/objects via the same ArrayBuffer allocator path.
HTTP / WebSocket frame assembly — body buffering, large header parsing, req.body materialization.
WASM linear memory growth — WebAssembly.Memory.grow() is literally mmap/mremap and takes mmap_lock for write.
V8 GC compaction / large-object-space allocation — V8 itself calls mmap for backing store and for trusted-space pages.

Native addons are equally vulnerable — often more so, because they bypass the V8 ArrayBufferAllocator's per-isolate accounting and call malloc/new/mmap directly:

Anything using N-API's napi_create_buffer, napi_create_arraybuffer, or napi_create_external_buffer for large buffers.
Common addons that allocate frequently: sharp (libvips image buffers), canvas (Cairo/pixman surfaces), node-sass/sass-embedded, bcrypt (work buffers), @node-rs/* crates, better-sqlite3 result rows, couchbase/mongodb native BSON paths, ML/ONNX runtimes, FFmpeg/GStreamer bindings.
Any addon that calls std::vector::resize, new T[n], malloc(n), or mmap() from a Worker — those calls hit the same mmap_lock from the same mm_struct as JS-side allocations, just through a different layer.
Threadpool-using addons (those that call napi_create_async_work / uv_queue_work) execute on libuv threadpool threads which share mm_struct with all Workers — so the contention is process-wide, not per-Worker.

This means the mitigation hierarchy is the same regardless of source:

Reuse buffers wherever the hot path allows (preallocate once, reset between uses). Applies to both JS Buffers and addon-side allocations — many addons expose APIs that accept a destination buffer for exactly this reason (e.g. crypto.randomFillSync(buf) vs crypto.randomBytes(n); sharp(...).raw().toBuffer({resolveWithObject: true}) vs .toBuffer()).
Process-shard hot workloads — node:cluster, or runtime-managed worker processes (Platformatic Watt, PM2, custom forks) instead of worker_threads for CPU+allocation-bound work.
Cap concurrency to physical cores — adding more Workers past your physical-core count only adds mmap_lock contenders without adding I/O parallelism. Workers don't scale with allocation hotness the way independent processes do.
Audit addons with bpftrace (the same mmap-lock.bt works against any process) to identify which dependency is the loudest mmap_lock writer. Counts per PID + comm reveal the culprit.

If you are building a Node.js addon or library that allocates buffers on a hot path: provide an API variant that accepts a pre-allocated buffer. This is a real performance lever for downstream multi-Worker consumers.

TL;DR — Mitigation Ranking (Evidence-Based)

Mitigation	Effect	Why it works
1. Reuse a Buffer (`openSync` + `readSync` into a pre-allocated buffer)	Up to 5× throughput, gap nearly disappears	Removes per-iteration allocation → no `malloc` → no `mmap_lock`
2. Raise `Buffer.poolSize` per Worker (e.g. `256KB` for 64KB files)	+32% workers throughput, gap closes from 0.53 → 0.81	Routes `Buffer.allocUnsafe` allocations through the in-realm slab pool; reduces `mmap_lock` write count by 2.5× (10,500 → 4,137). Only works when `fileSize < poolSize/2` AND pool stays under glibc `M_MMAP_THRESHOLD` (~128KB-1MB sweet spot)
3. Shard work across child processes instead of Workers	Restores procs-fast behavior	Each process has its own `mm_struct` → no shared lock
~~4. `MALLOC_MMAP_MAX_=0 MALLOC_TRIM_THRESHOLD_=-1`~~	Marginal (~5-7% in independent re-runs, within noise; an early single run showed +30% but didn't replicate)	In theory stops glibc from `mmap`/`munmap`'ing freed chunks back to kernel — in practice arena growth still dominates
~~5. jemalloc (`LD_PRELOAD`)~~	Marginal at 64KB, worse at 8MB	Not worth it
~~6. THP tuning (`always`/`never`/`madvise`)~~	Within noise (~10% variance)	`mmap_lock` taken once per syscall regardless of page size
~~7. Switch sync → async `readFile`~~	Workers unchanged; procs get slower (CPU oversubscription)	Async doesn't avoid the kernel lock — it moves the syscall to libuv threadpool threads, which still share `mm_struct`

The only mitigation that genuinely eliminates the contention is buffer reuse. Everything else either dodges it (separate processes) or marginally reduces it.

Setup

OS: Linux 6.8 (Ubuntu)
CPU: Intel i7-7700 (4C/8T, single NUMA node)
Node: built from the nodejs/node main branch (27.0.0-pre); same results expected on stable Node 20+
Allocator: glibc 2.39 (Ubuntu 24.04 default)
bpftrace: 0.20.2 with mmap_lock:* tracepoints available
Default kernel built without CONFIG_LOCK_STAT — so we use bpftrace rather than /proc/lock_stat

The benchmark targets are pre-generated random files at /tmp/fs-contention-files/{small,med,large,huge} (128B, 64KB, 1MB, 8MB). run.sh creates them.

The Investigation, In Order

Hypothesis 0: Theoretical analysis says workers should not contend

Reading the Node source confirms the sync readFileSync path bypasses libuv's threadpool entirely (uv_fs_open/read/close(nullptr, ...) runs uv__fs_work inline on the calling thread). Each Worker has its own V8 Isolate, libuv loop, ArrayBuffer allocator, and Permission object. No process-global Node-level locks are taken on the sync read path.

So Node itself is innocent. The contention must be in something workers implicitly share: the kernel, libc, or hardware.

Hypothesis 1: Reproduce the gap

run-workers.js and run-procs.js spawn K parallel JS contexts (Worker threads or child processes), each doing N readFileSync iterations on the same file. A SharedArrayBuffer barrier (for workers) ensures the timed loops start in sync.

Results at k=8, file=64KB, iter=2000 (the worst case we found):

workers k=8 iter=2000 throughput=79,027 ops/s
procs   k=8 iter=2000 throughput=146,526 ops/s    ← procs are 1.85× faster

Reproduced. Worker/procs ratio: 0.54.

Hypothesis 2: glibc malloc arena lock?

Test by toggling MALLOC_ARENA_MAX:

Setting	workers ops/s	procs ops/s	ratio
default	76,775	141,259	0.54
`MALLOC_ARENA_MAX=1`	96,403	138,477	0.70
`MALLOC_ARENA_MAX=64`	94,128	153,064	0.61

Arena count modulates total throughput slightly but the worker/procs ratio doesn't close. Arena lock is a contributor but not the dominant cause.

Hypothesis 3: kernel `mmap_lock` contention via large `mmap` allocations

Test by lowering M_MMAP_THRESHOLD so every 64KB allocation goes through mmap:

MALLOC_MMAP_THRESHOLD_=4096:
  workers k=8: 30,769 ops/s     ← collapse
  procs   k=8: 94,045 ops/s
  ratio:       0.33

When forced to mmap every allocation, workers slow down dramatically more than processes (ratio drops from 0.54 to 0.33). This is the smoking gun for mmap_lock contention.

Hypothesis 4: jemalloc — does a better allocator fix it?

64KB k=8:
  glibc workers:    85,155     glibc procs:    141,957    (ratio 0.60)
  jemalloc workers: 90,619     jemalloc procs: 124,621    (ratio 0.73)

8MB k=8:
  glibc workers:    1,515      glibc procs:    1,640      (ratio 0.92)
  jemalloc workers: 1,144      jemalloc procs: 1,210      (ratio 0.70)   ← worse

jemalloc closes the small-file gap mostly by making processes slower, and it actively regresses on large files (its large-chunk strategy doesn't fit this workload). Not a recommended fix.

Hypothesis 5: kernel tunables — THP, `vm.max_map_count`

vm.max_map_count is already at 1M (not the limit). Tested THP always/madvise/never:

THP=always:  workers 81,774  procs 139,966  ratio 0.58
THP=madvise: workers 90,206  procs 132,356  ratio 0.68
THP=never:   workers 92,394  procs 145,795  ratio 0.63

All within run-to-run noise. THP doesn't help because mmap_lock is taken once per syscall regardless of the page size that backs the mapping.

Hypothesis 5b: shrink the allocation frequency via `Buffer.poolSize`

Buffer.poolSize (default 8KB) is per-realm, not per-process — lib/buffer.js is loaded fresh in each Worker and the slab variables (allocPool, poolOffset, etc.) live in module scope. So raising it on a Worker is a zero-coordination, per-Worker change.

When Buffer.allocUnsafe(size) is called with size < (Buffer.poolSize >>> 1), it slices from a shared in-realm slab instead of allocating a fresh ArrayBuffer. The slab refills via one createUnsafeBuffer(poolSize) call per ~poolSize/avgAlloc operations — batching many small allocations into one large kernel-visible event.

Sweep at 64KB k=8 (workers / procs throughput in ops/s, ratio):

`Buffer.poolSize`	workers	procs	ratio	notes
8KB (default)	79,708	151,100	0.53	baseline — 64KB allocations bypass pool
256KB	105,031	130,364	0.81	sweet spot — pool stays under glibc mmap threshold
1MB	90,101	120,722	0.75
4MB	76,891	114,283	0.67	pool refills now `mmap` (4MB > 128KB threshold)
16MB	93,264	107,290	0.87

bpftrace confirms the mechanism — workers with 256KB pool show:

mmap_lock write count: 10,500 → 4,137 (2.5× reduction)
Cumulative write wait: 360 ms → 139 ms (2.6× reduction)
Per-acquisition wait unchanged (~34 μs) — the few remaining acquisitions still contend just as hard; we just have fewer of them

This is a real fix that doesn't require restructuring the app: a single Buffer.poolSize = 256 * 1024 at Worker startup. Caveats: doesn't help for files > ~1MB (the allocation always exceeds poolSize/2 no matter what); too-large pools (≥ 4MB) become counterproductive because pool refills themselves mmap; the pool occupies RSS for the lifetime of any sliced Buffer that's still referenced.

Tested file sizes:

64KB: big win (above)
1MB: no significant change — 1MB allocations exceed M_MMAP_THRESHOLD regardless of pooling
8MB: regression with 32MB pool (workers 1,519 → 880 ops/s). Pool refills become a worse mmap_lock source than direct allocations.

Hypothesis 5c: is 8KB the right default for `Buffer.poolSize` in 2025?

The 8KB default was set in May 2015 (Trevor Norris, commit 63da0dfd3a44, "buffer: implement Uint8Array backed Buffer") and hasn't been touched since. In ten years, typical HTTP frame sizes (16KB-1MB for HTTP/2), JSON payload sizes, and machine RAM have all grown ~10×. The 8KB default predates the dominant modern allocation patterns.

There's also a more subtle issue: the pool check is size < (Buffer.poolSize >>> 1). With default poolSize=8KB, the threshold is 4KB — and the strict inequality means a 4KB allocation itself bypasses the pool. So the current default helps allocations from 1B to 3.99KB and abruptly stops helping at exactly 4KB — precisely where many real allocation sizes land (HTTP frames, page-aligned chunks, small file reads).

Workers throughput sweep at k=8 across file sizes and pool sizes (ops/s):

File	8KB (default)	16KB	32KB	64KB	128KB	256KB	procs reference
512B	404k	460k	453k	431k	455k	446k	496k
2KB	360k	382k	410k	367k	417k	411k	472k
4KB	326k	332k	326k	360k	372k	382k	413k
8KB	202k	187k	208k	254k	232k	271k	302k
16KB	148k	147k	150k	181k	189k	243k	284k
64KB	86k	87k	80k	87k	88k	108k	142k
1MB	12k	13k	12k	12k	12k	13k	12k

Observations:

64KB pool is a Pareto-near improvement over 8KB. Wins at 8KB files (+26%) and 16KB files (+23%); ties everywhere else within noise. RSS cost: +56KB per realm.
256KB pool is more aggressive but still Pareto. Wins at every multi-KB file size — +17% at 4KB, +34% at 8KB, +64% at 16KB, +27% at 64KB. Never regresses. RSS cost: +248KB per realm (~2MB across 8 Workers).
1MB file row is unchanged across all pools. Large allocations bypass the pool regardless, so no risk.

A bump from 8KB → 64KB (or 128KB) would be a low-risk, near-Pareto improvement for the entire Node ecosystem at trivial RSS cost. Worth proposing upstream.

Hypothesis 6: avoid the allocation entirely

Use openSync + readSync into a pre-allocated buffer instead of readFileSync:

64KB k=8:
  workers-reuse: 393,321 ops/s    ← 5× faster than readFileSync workers
  procs-reuse:   484,380 ops/s    ← 4× faster than readFileSync procs
  ratio:         0.81             ← gap nearly closes

1MB k=8:
  workers-reuse: 39,133 ops/s
  procs-reuse:   35,376 ops/s     ← workers now FASTER
  ratio:         1.11

Same syscalls (open/read/close), no allocation. Confirms that the allocation is the cost, not the I/O.

Hypothesis 7: does async `readFile` fix it?

64KB k=8:
                              workers   procs    ratio
  sync:                       79,027    146,526  0.54
  async serial (conc=1):      35,449    41,392   0.86
  async parallel (conc=8):    80,216    60,855   1.32

Async does not help workers (80k vs 79k sync — same). The ratio "closes" only because async drags procs down (libuv threadpool adds 4 active threads × 8 procs = ~40 threads on 8 cores → CPU oversubscription).

Theoretically consistent: async only moves which thread issues the syscall — it doesn't change the per-mm_struct lock that the kernel takes.

Hypothesis 8: direct kernel measurement with `bpftrace`

mmap-lock.bt attaches to the mmap_lock:mmap_lock_start_locking and mmap_lock:mmap_lock_acquire_returned tracepoints to measure per-acquisition wait time (start → acquired). Filtered to Node-relevant comms (node-MainThread, WorkerThread, libuv-worker).

Sync, 64KB, k=8:

                          workers       procs (×9 PIDs)
write acquisitions        10,572        4,711
total write wait          311 ms        8 ms
avg write wait            29,389 ns     1,662 ns         ← 17× longer
max write wait            2-4 ms        1-2 ms (rare)

Worker write-wait histogram (showing the contention shape):

[1K, 2K)            1760  ← uncontended floor
[16K, 32K)          1782  ← clearly blocking
[32K, 64K)          2177  ← heavy contention
[64K, 128K)          386
[128K, 256K)         232
[256K, 512K)          41
[512K, 1M)            43
[1M, 2M)              14
[2M, 4M)               1   ← worker slept 2-4 ms waiting for the lock

Procs are tightly clustered at 1-2K ns (the uncontended atomic-acquire floor). Workers show a bimodal distribution with a fat tail — classic rwsem blocking behavior.

Async, 64KB, k=8:

                          workers       procs (×9 PIDs)
write acquisitions        4,332         2,492
total write wait          89 ms         4 ms
avg write wait            20,642 ns     1,700 ns         ← still 12× longer

Async reduces contention by ~3× but does not eliminate it. Same kind of lock, fewer events because async staggers allocations across libuv threadpool threads.

Conclusion

The cost workers pay vs procs at 64KB sync (~360 ms of cumulative mmap_lock write waits vs ~6 ms) is the same order of magnitude as the wall-clock difference. Theory and measurement agree.

Why so many mmap_lock writes at 64KB? glibc's arena segment growth, madvise(DONTNEED) on free chunks, and contention-fallback mmap calls — none of which we control from JS. The trace shows 10,572 writes from one shared mm_struct for the workers case vs ~560 per process for the 9-process case.

Scripts

File	Purpose
`task.js`	Sync `readFileSync` loop. Worker or child-process entry point. SharedArrayBuffer barrier when used as Worker.
`task-readsync.js`	Sync read using `openSync`/`readSync` into a pre-allocated buffer. Tests the buffer-reuse mitigation.
`task-readfile.js`	Async `fs.promises.readFile` loop. `CONCURRENCY=N` controls in-flight reads per worker/proc.
`run-workers.js`	Spawn K Worker threads running `task.js`, barrier-synchronize, report per-worker timings and aggregate throughput.
`run-procs.js`	Spawn K child processes running `task.js`.
`run-workers-reuse.js`, `run-procs-reuse.js`	Variants using `task-readsync.js`.
`run-workers-async.js`, `run-procs-async.js`	Variants using `task-readfile.js`.
`run.sh`	End-to-end suite: small/med/large/huge files × k=1/4/8 × default/`MALLOC_ARENA_MAX=1`/`=64`.
`run-thp.sh`	Toggles `/sys/kernel/mm/transparent_hugepage/enabled` between `always`/`madvise`/`never` and runs the 64KB/1MB/8MB k=8 cases. Requires `sudo`; restores original setting on exit.
`mmap-lock.bt`	`bpftrace` script: measures `mmap_lock` start→acquire latency, separated by write (mmap/munmap/brk) and read (page-fault) acquisitions. Filters on Node thread comms.

All scripts are independent — pick whichever question you want to answer.

Step-by-Step Reproduction Guide

Prerequisites

# Linux with mmap_lock tracepoints (kernel ≥ 5.8)
uname -r

# bpftrace and sudo (for tracing and THP toggling)
sudo apt install bpftrace      # Debian/Ubuntu
bpftrace --version             # need ≥ 0.16

# Node.js 20+ (for fs.promises and stable Worker threads)
node --version

# Optional: jemalloc for the allocator-swap test
sudo apt install libjemalloc2

# 8 cores recommended (we use k=8 in the canonical tests)
nproc

1. Clone and pick a Node binary

git clone git@github.com:platformatic/node-worker-mmap-lock-contention.git
cd node-worker-mmap-lock-contention

# Set NODE to whichever node binary you want to test.
# Defaults below assume system node.
export NODE=$(which node)

2. Create the test fixtures

mkdir -p /tmp/fs-contention-files
dd if=/dev/urandom of=/tmp/fs-contention-files/small bs=128     count=1 status=none
dd if=/dev/urandom of=/tmp/fs-contention-files/med   bs=65536   count=1 status=none
dd if=/dev/urandom of=/tmp/fs-contention-files/large bs=1048576 count=1 status=none
dd if=/dev/urandom of=/tmp/fs-contention-files/huge  bs=1048576 count=8 status=none

3. Reproduce the workers-slower-than-procs gap

# 64KB file is the contention sweet spot
$NODE run-workers.js /tmp/fs-contention-files/med 2000 8
$NODE run-procs.js   /tmp/fs-contention-files/med 2000 8

Expected: workers ~70-90k ops/s, procs ~120-150k ops/s, ratio 0.55-0.65 (typical 0.60), procs at least 1.4× faster. If you don't see a gap of at least 1.4×, check that nothing else is competing for CPU.

4. Verify the buffer-reuse fix

$NODE run-workers-reuse.js /tmp/fs-contention-files/med 2000 8
$NODE run-procs-reuse.js   /tmp/fs-contention-files/med 2000 8

Expected: both jump to 300-500k ops/s, ratio above 0.8.

5. Run the full mitigation matrix

NODE=$NODE bash run.sh

Walks through file sizes 128B / 64KB / 1MB / 8MB at k=1/4/8 with default, MALLOC_ARENA_MAX=1, and MALLOC_ARENA_MAX=64. Takes ~5 minutes.

6. Try the kernel-level allocator workaround

MALLOC_MMAP_MAX_=0 MALLOC_TRIM_THRESHOLD_=-1 $NODE run-workers.js /tmp/fs-contention-files/med 2000 8

Expected: marginal — typically only ~5-10% improvement in repeated runs, often within run-to-run noise. (An early single run on this hardware showed +30% but did not replicate under independent re-verification.) Listed here for completeness, not as a recommended mitigation.

7. THP tuning (does NOT help)

sudo bash run-thp.sh   # restores original THP setting on exit

Expected: all three modes within noise. Reported here only to rule it out.

8. jemalloc (does NOT help)

LD_PRELOAD=/lib/x86_64-linux-gnu/libjemalloc.so.2 \
  $NODE run-workers.js /tmp/fs-contention-files/med 2000 8

Expected: small improvement at 64KB, regression at 8MB. Skip in production.

9. Direct kernel measurement with `bpftrace` (the smoking gun)

# Trace workers run
sudo bpftrace -o /tmp/bpf-workers.txt mmap-lock.bt \
  -c "$NODE run-workers.js /tmp/fs-contention-files/med 2000 8"

# Trace procs run
sudo bpftrace -o /tmp/bpf-procs.txt mmap-lock.bt \
  -c "$NODE run-procs.js /tmp/fs-contention-files/med 2000 8"

# Compare
echo "=== WORKERS ===" && cat /tmp/bpf-workers.txt
echo "=== PROCS ==="   && cat /tmp/bpf-procs.txt

What to look for:

@cnt_write — total mmap_lock write acquisitions. Workers ~10k; procs ~5k spread across 9 PIDs (@cnt_per_pid_write[…]).
@sum_write_ns — total cumulative wait time. Expect workers ~30-50× procs (observed range across runs: 38-60×).
@wait_write_ns histogram — workers will have a fat tail in 16K-128K ns; procs will be clustered at 1-2K ns.

If you see workers' write waits clustered in the 1-2K ns range, the trace didn't capture the right threads — double-check the comm filter in mmap-lock.bt matches your Node build's thread names (/proc/PID/task/*/comm).

10. Verify it's NOT fixed by switching to async

CONCURRENCY=8 $NODE run-workers-async.js /tmp/fs-contention-files/med 2000 8
CONCURRENCY=8 $NODE run-procs-async.js   /tmp/fs-contention-files/med 2000 8

# Trace it
sudo bpftrace -o /tmp/bpf-workers-async.txt mmap-lock.bt \
  -c "env CONCURRENCY=8 $NODE run-workers-async.js /tmp/fs-contention-files/med 2000 8"
cat /tmp/bpf-workers-async.txt

Expected: workers' throughput unchanged. mmap_lock waits reduced by ~3× (~89 ms vs 311 ms) but still 12-15× higher per acquisition than procs. Procs async gets slower than procs sync due to libuv threadpool CPU oversubscription.

Independent Validation

The full 10-step guide above was independently re-executed by a fresh agent that did not see the original investigation notes — only the README. It re-ran every step against the same hardware and judged each against the README's predictions. Result summary:

Step	Verdict	Notes
3 — basic gap	✅ matches	workers 77-88k / procs 138-141k, ratio 0.55-0.64
4 — buffer reuse fix	✅ matches	workers-reuse 333-375k / procs-reuse 408-461k, ratio 0.81-0.92
5 — full matrix	✅ matches	64KB k=8 default: workers 77k / procs 122k, 1.58× gap
6 — `MALLOC_MMAP_MAX_=0 …`	❌ does not reproduce as claimed	only ~5-7% over baseline across 5 runs; the ≥15% claim was based on a single non-representative run
7 — THP modes	✅ matches	within 3% across `always`/`madvise`/`never`
8 — jemalloc	✅ matches	~5-10% at 64KB, not dramatic
9 — bpftrace smoking gun	✅ matches decisively	workers `@sum_write_ns` = 298 ms vs procs 7.7 ms (38.7×); avg per acquisition 28,418 ns vs 1,611 ns (17.6×); tail mass above 16K ns: workers 4,583 events vs procs 4 events (>1000×)
10 — async doesn't fix it	✅ matches	workers-async avg wait 24,512 ns vs procs sync 1,611 ns = 15.2× — confirms async only reduces, doesn't eliminate, the contention

Independent verdict on the central claim: the per-mm_struct mmap_lock contention thesis is confirmed by direct kernel measurement. The ~290 ms cumulative worker write-wait closely matches the wall-clock gap between workers and procs. Cause and effect line up.

Corrections applied based on validation:

The headline "27× longer per acquisition" was a one-run outlier — corrected to ~17×, consistent with the body's measurement table and the validation re-run (17.6×).
MALLOC_MMAP_MAX_=0 MALLOC_TRIM_THRESHOLD_=-1 was downgraded in the mitigation ranking. It was promising on one early run but only ~5-7% on repeated trials.
Headline ratio narrowed from "~0.54" to "0.55-0.65 (typical 0.60)" to reflect typical run-to-run variance rather than the best-case single number.

The mitigation ranking now correctly elevates buffer reuse (#1, the only thing that actually fixes it) and process sharding (#2, sidesteps the kernel lock entirely); everything else is footnote-worthy at best on this kernel/glibc combination.

Caveats & Open Questions

All numbers are from a single i7-7700 (8 logical cores). On a higher-core-count machine the absolute contention may differ but the qualitative result should hold — mmap_lock is per-mm_struct everywhere.
Tested with glibc 2.39. Other libcs (musl, jemalloc-replaced) have different arena strategies; results may differ. jemalloc was tested here and did not help.
The Node binary used was built from main (27.0.0-pre). Same pattern should reproduce on Node 20.x / 22.x / 24.x — the relevant code paths (readFileSync, libuv uv_fs_*, V8 ArrayBuffer allocator) have been stable.
Kernel was Linux 6.8 (with maple-tree VMAs and per-VMA page-fault locks introduced in 6.4). The mmap/munmap write-side serialization on mmap_lock is unchanged on newer kernels at time of writing.
We did not test with CONFIG_LOCK_STAT-enabled kernel. That would give cleaner per-lock contention numbers without needing eBPF.
We did not test on a system with multiple NUMA nodes — mmap_lock is per-mm_struct so NUMA shouldn't change the per-process picture, but cross-socket cacheline bouncing could amplify it.

References

Linux mmap_lock design: kernel/Documentation/mm/process_addrs.rst
libuv sync fs path: deps/uv/src/unix/fs.c POST macro at line 139
Node readFileUtf8: src/node_file.cc ReadFileUtf8
glibc malloc arena behavior: glibc/malloc/arena.c and MALLOC_TUNABLES(3)
bpftrace mmap_lock tracepoints: introduced kernel 5.8, include/trace/events/mmap_lock.h

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
mmap-lock.bt		mmap-lock.bt
run-procs-async.js		run-procs-async.js
run-procs-reuse.js		run-procs-reuse.js
run-procs.js		run-procs.js
run-thp.sh		run-thp.sh
run-workers-async.js		run-workers-async.js
run-workers-reuse.js		run-workers-reuse.js
run-workers.js		run-workers.js
run.sh		run.sh
task-readfile.js		task-readfile.js
task-readsync.js		task-readsync.js
task.js		task.js

Folders and files

Latest commit

History

Repository files navigation

Node.js Worker Threads vs Processes — Memory Allocation Contention Analysis

⚠ Scope: This Is Not An fs Problem

TL;DR — Mitigation Ranking (Evidence-Based)

Setup

The Investigation, In Order

Hypothesis 0: Theoretical analysis says workers should not contend

Hypothesis 1: Reproduce the gap

Hypothesis 2: glibc malloc arena lock?

Hypothesis 3: kernel mmap_lock contention via large mmap allocations

Hypothesis 4: jemalloc — does a better allocator fix it?

Hypothesis 5: kernel tunables — THP, vm.max_map_count

Hypothesis 5b: shrink the allocation frequency via Buffer.poolSize

Hypothesis 5c: is 8KB the right default for Buffer.poolSize in 2025?

Hypothesis 6: avoid the allocation entirely

Hypothesis 7: does async readFile fix it?

Hypothesis 8: direct kernel measurement with bpftrace

Conclusion

Scripts

Step-by-Step Reproduction Guide

Prerequisites

1. Clone and pick a Node binary

2. Create the test fixtures

3. Reproduce the workers-slower-than-procs gap

4. Verify the buffer-reuse fix

5. Run the full mitigation matrix

6. Try the kernel-level allocator workaround

7. THP tuning (does NOT help)

8. jemalloc (does NOT help)

9. Direct kernel measurement with bpftrace (the smoking gun)

10. Verify it's NOT fixed by switching to async

Independent Validation

Caveats & Open Questions

References

About

Resources

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

⚠ Scope: This Is Not An `fs` Problem

Hypothesis 3: kernel `mmap_lock` contention via large `mmap` allocations

Hypothesis 5: kernel tunables — THP, `vm.max_map_count`

Hypothesis 5b: shrink the allocation frequency via `Buffer.poolSize`

Hypothesis 5c: is 8KB the right default for `Buffer.poolSize` in 2025?

Hypothesis 7: does async `readFile` fix it?

Hypothesis 8: direct kernel measurement with `bpftrace`

9. Direct kernel measurement with `bpftrace` (the smoking gun)

Packages