Skip to content

platformatic/node-worker-mmap-lock-contention

Repository files navigation

Node.js Worker Threads vs Processes — Memory Allocation Contention Analysis

A reproducible investigation into why N child processes doing concurrent large-buffer allocations (fs.readFileSync, Buffer.allocUnsafe, addon mallocs, etc.) are measurably faster than N Worker threads doing the same work in a single Node.js process — and a direct kernel-side measurement of the cause.

fs.readFileSync is used as the canonical reproducer because it's the easiest to instrument, but the finding generalizes to any allocation-heavy workload, including native addons. See the Scope section below.

Root cause: per-mm_struct mmap_lock contention in the Linux kernel. Worker threads share one address space (one mm_struct, one mmap_lock); separate processes each have their own. Confirmed by bpftrace: workers wait ~17× longer per mmap_lock acquisition than equivalent processes (~29 μs vs ~1.7 μs on average, with a fat tail to milliseconds), and accumulate ~40× more total wait time for the same workload.

The contention triggers every time glibc grows/shrinks an arena — which happens often when many threads hammer malloc/free with Buffer.allocUnsafe(size) from readFileSync. The 128KB M_MMAP_THRESHOLD is not the only thing that triggers mmap; arena growth does too.


⚠ Scope: This Is Not An fs Problem

readFileSync is just the easiest way to reproduce this. The underlying mechanism — mmap_lock write contention on a shared mm_struct — fires for any allocation hot path inside a multi-Worker process. If your code (or a dependency) allocates and frees buffers above the per-thread-cache threshold under concurrency, you will see the same slowdown.

Concrete examples of code paths that hit the same kernel lock:

  • Buffer.allocUnsafe(size) / Buffer.alloc(size) for size ≥ ~8KB (above the internal Buffer pool). Anything that builds large buffers in a hot loop — body parsers, stream chunkers, base64/hex codecs, framing protocols — is affected.
  • cryptorandomBytes, pbkdf2, scrypt, createHash/createHmac final outputs, AES/GCM encrypt/decrypt buffers, key derivation, certificate parsing.
  • zlibdeflate/inflate/brotliCompress allocate large output buffers per operation.
  • JSON.parse / JSON.stringify on large objects — V8 allocates contiguous strings/objects via the same ArrayBuffer allocator path.
  • HTTP / WebSocket frame assembly — body buffering, large header parsing, req.body materialization.
  • WASM linear memory growthWebAssembly.Memory.grow() is literally mmap/mremap and takes mmap_lock for write.
  • V8 GC compaction / large-object-space allocation — V8 itself calls mmap for backing store and for trusted-space pages.

Native addons are equally vulnerable — often more so, because they bypass the V8 ArrayBufferAllocator's per-isolate accounting and call malloc/new/mmap directly:

  • Anything using N-API's napi_create_buffer, napi_create_arraybuffer, or napi_create_external_buffer for large buffers.
  • Common addons that allocate frequently: sharp (libvips image buffers), canvas (Cairo/pixman surfaces), node-sass/sass-embedded, bcrypt (work buffers), @node-rs/* crates, better-sqlite3 result rows, couchbase/mongodb native BSON paths, ML/ONNX runtimes, FFmpeg/GStreamer bindings.
  • Any addon that calls std::vector::resize, new T[n], malloc(n), or mmap() from a Worker — those calls hit the same mmap_lock from the same mm_struct as JS-side allocations, just through a different layer.
  • Threadpool-using addons (those that call napi_create_async_work / uv_queue_work) execute on libuv threadpool threads which share mm_struct with all Workers — so the contention is process-wide, not per-Worker.

This means the mitigation hierarchy is the same regardless of source:

  1. Reuse buffers wherever the hot path allows (preallocate once, reset between uses). Applies to both JS Buffers and addon-side allocations — many addons expose APIs that accept a destination buffer for exactly this reason (e.g. crypto.randomFillSync(buf) vs crypto.randomBytes(n); sharp(...).raw().toBuffer({resolveWithObject: true}) vs .toBuffer()).
  2. Process-shard hot workloadsnode:cluster, or runtime-managed worker processes (Platformatic Watt, PM2, custom forks) instead of worker_threads for CPU+allocation-bound work.
  3. Cap concurrency to physical cores — adding more Workers past your physical-core count only adds mmap_lock contenders without adding I/O parallelism. Workers don't scale with allocation hotness the way independent processes do.
  4. Audit addons with bpftrace (the same mmap-lock.bt works against any process) to identify which dependency is the loudest mmap_lock writer. Counts per PID + comm reveal the culprit.

If you are building a Node.js addon or library that allocates buffers on a hot path: provide an API variant that accepts a pre-allocated buffer. This is a real performance lever for downstream multi-Worker consumers.


TL;DR — Mitigation Ranking (Evidence-Based)

Mitigation Effect Why it works
1. Reuse a Buffer (openSync + readSync into a pre-allocated buffer) Up to 5× throughput, gap nearly disappears Removes per-iteration allocation → no malloc → no mmap_lock
2. Raise Buffer.poolSize per Worker (e.g. 256KB for 64KB files) +32% workers throughput, gap closes from 0.53 → 0.81 Routes Buffer.allocUnsafe allocations through the in-realm slab pool; reduces mmap_lock write count by 2.5× (10,500 → 4,137). Only works when fileSize < poolSize/2 AND pool stays under glibc M_MMAP_THRESHOLD (~128KB-1MB sweet spot)
3. Shard work across child processes instead of Workers Restores procs-fast behavior Each process has its own mm_struct → no shared lock
4. MALLOC_MMAP_MAX_=0 MALLOC_TRIM_THRESHOLD_=-1 Marginal (~5-7% in independent re-runs, within noise; an early single run showed +30% but didn't replicate) In theory stops glibc from mmap/munmap'ing freed chunks back to kernel — in practice arena growth still dominates
5. jemalloc (LD_PRELOAD) Marginal at 64KB, worse at 8MB Not worth it
6. THP tuning (always/never/madvise) Within noise (~10% variance) mmap_lock taken once per syscall regardless of page size
7. Switch sync → async readFile Workers unchanged; procs get slower (CPU oversubscription) Async doesn't avoid the kernel lock — it moves the syscall to libuv threadpool threads, which still share mm_struct

The only mitigation that genuinely eliminates the contention is buffer reuse. Everything else either dodges it (separate processes) or marginally reduces it.


Setup

  • OS: Linux 6.8 (Ubuntu)
  • CPU: Intel i7-7700 (4C/8T, single NUMA node)
  • Node: built from the nodejs/node main branch (27.0.0-pre); same results expected on stable Node 20+
  • Allocator: glibc 2.39 (Ubuntu 24.04 default)
  • bpftrace: 0.20.2 with mmap_lock:* tracepoints available
  • Default kernel built without CONFIG_LOCK_STAT — so we use bpftrace rather than /proc/lock_stat

The benchmark targets are pre-generated random files at /tmp/fs-contention-files/{small,med,large,huge} (128B, 64KB, 1MB, 8MB). run.sh creates them.


The Investigation, In Order

Hypothesis 0: Theoretical analysis says workers should not contend

Reading the Node source confirms the sync readFileSync path bypasses libuv's threadpool entirely (uv_fs_open/read/close(nullptr, ...) runs uv__fs_work inline on the calling thread). Each Worker has its own V8 Isolate, libuv loop, ArrayBuffer allocator, and Permission object. No process-global Node-level locks are taken on the sync read path.

So Node itself is innocent. The contention must be in something workers implicitly share: the kernel, libc, or hardware.

Hypothesis 1: Reproduce the gap

run-workers.js and run-procs.js spawn K parallel JS contexts (Worker threads or child processes), each doing N readFileSync iterations on the same file. A SharedArrayBuffer barrier (for workers) ensures the timed loops start in sync.

Results at k=8, file=64KB, iter=2000 (the worst case we found):

workers k=8 iter=2000 throughput=79,027 ops/s
procs   k=8 iter=2000 throughput=146,526 ops/s    ← procs are 1.85× faster

Reproduced. Worker/procs ratio: 0.54.

Hypothesis 2: glibc malloc arena lock?

Test by toggling MALLOC_ARENA_MAX:

Setting workers ops/s procs ops/s ratio
default 76,775 141,259 0.54
MALLOC_ARENA_MAX=1 96,403 138,477 0.70
MALLOC_ARENA_MAX=64 94,128 153,064 0.61

Arena count modulates total throughput slightly but the worker/procs ratio doesn't close. Arena lock is a contributor but not the dominant cause.

Hypothesis 3: kernel mmap_lock contention via large mmap allocations

Test by lowering M_MMAP_THRESHOLD so every 64KB allocation goes through mmap:

MALLOC_MMAP_THRESHOLD_=4096:
  workers k=8: 30,769 ops/s     ← collapse
  procs   k=8: 94,045 ops/s
  ratio:       0.33

When forced to mmap every allocation, workers slow down dramatically more than processes (ratio drops from 0.54 to 0.33). This is the smoking gun for mmap_lock contention.

Hypothesis 4: jemalloc — does a better allocator fix it?

64KB k=8:
  glibc workers:    85,155     glibc procs:    141,957    (ratio 0.60)
  jemalloc workers: 90,619     jemalloc procs: 124,621    (ratio 0.73)

8MB k=8:
  glibc workers:    1,515      glibc procs:    1,640      (ratio 0.92)
  jemalloc workers: 1,144      jemalloc procs: 1,210      (ratio 0.70)   ← worse

jemalloc closes the small-file gap mostly by making processes slower, and it actively regresses on large files (its large-chunk strategy doesn't fit this workload). Not a recommended fix.

Hypothesis 5: kernel tunables — THP, vm.max_map_count

vm.max_map_count is already at 1M (not the limit). Tested THP always/madvise/never:

THP=always:  workers 81,774  procs 139,966  ratio 0.58
THP=madvise: workers 90,206  procs 132,356  ratio 0.68
THP=never:   workers 92,394  procs 145,795  ratio 0.63

All within run-to-run noise. THP doesn't help because mmap_lock is taken once per syscall regardless of the page size that backs the mapping.

Hypothesis 5b: shrink the allocation frequency via Buffer.poolSize

Buffer.poolSize (default 8KB) is per-realm, not per-process — lib/buffer.js is loaded fresh in each Worker and the slab variables (allocPool, poolOffset, etc.) live in module scope. So raising it on a Worker is a zero-coordination, per-Worker change.

When Buffer.allocUnsafe(size) is called with size < (Buffer.poolSize >>> 1), it slices from a shared in-realm slab instead of allocating a fresh ArrayBuffer. The slab refills via one createUnsafeBuffer(poolSize) call per ~poolSize/avgAlloc operations — batching many small allocations into one large kernel-visible event.

Sweep at 64KB k=8 (workers / procs throughput in ops/s, ratio):

Buffer.poolSize workers procs ratio notes
8KB (default) 79,708 151,100 0.53 baseline — 64KB allocations bypass pool
256KB 105,031 130,364 0.81 sweet spot — pool stays under glibc mmap threshold
1MB 90,101 120,722 0.75
4MB 76,891 114,283 0.67 pool refills now mmap (4MB > 128KB threshold)
16MB 93,264 107,290 0.87

bpftrace confirms the mechanism — workers with 256KB pool show:

  • mmap_lock write count: 10,500 → 4,137 (2.5× reduction)
  • Cumulative write wait: 360 ms → 139 ms (2.6× reduction)
  • Per-acquisition wait unchanged (~34 μs) — the few remaining acquisitions still contend just as hard; we just have fewer of them

This is a real fix that doesn't require restructuring the app: a single Buffer.poolSize = 256 * 1024 at Worker startup. Caveats: doesn't help for files > ~1MB (the allocation always exceeds poolSize/2 no matter what); too-large pools (≥ 4MB) become counterproductive because pool refills themselves mmap; the pool occupies RSS for the lifetime of any sliced Buffer that's still referenced.

Tested file sizes:

  • 64KB: big win (above)
  • 1MB: no significant change — 1MB allocations exceed M_MMAP_THRESHOLD regardless of pooling
  • 8MB: regression with 32MB pool (workers 1,519 → 880 ops/s). Pool refills become a worse mmap_lock source than direct allocations.

Hypothesis 5c: is 8KB the right default for Buffer.poolSize in 2025?

The 8KB default was set in May 2015 (Trevor Norris, commit 63da0dfd3a44, "buffer: implement Uint8Array backed Buffer") and hasn't been touched since. In ten years, typical HTTP frame sizes (16KB-1MB for HTTP/2), JSON payload sizes, and machine RAM have all grown ~10×. The 8KB default predates the dominant modern allocation patterns.

There's also a more subtle issue: the pool check is size < (Buffer.poolSize >>> 1). With default poolSize=8KB, the threshold is 4KB — and the strict inequality means a 4KB allocation itself bypasses the pool. So the current default helps allocations from 1B to 3.99KB and abruptly stops helping at exactly 4KB — precisely where many real allocation sizes land (HTTP frames, page-aligned chunks, small file reads).

Workers throughput sweep at k=8 across file sizes and pool sizes (ops/s):

File 8KB (default) 16KB 32KB 64KB 128KB 256KB procs reference
512B 404k 460k 453k 431k 455k 446k 496k
2KB 360k 382k 410k 367k 417k 411k 472k
4KB 326k 332k 326k 360k 372k 382k 413k
8KB 202k 187k 208k 254k 232k 271k 302k
16KB 148k 147k 150k 181k 189k 243k 284k
64KB 86k 87k 80k 87k 88k 108k 142k
1MB 12k 13k 12k 12k 12k 13k 12k

Observations:

  • 64KB pool is a Pareto-near improvement over 8KB. Wins at 8KB files (+26%) and 16KB files (+23%); ties everywhere else within noise. RSS cost: +56KB per realm.
  • 256KB pool is more aggressive but still Pareto. Wins at every multi-KB file size — +17% at 4KB, +34% at 8KB, +64% at 16KB, +27% at 64KB. Never regresses. RSS cost: +248KB per realm (~2MB across 8 Workers).
  • 1MB file row is unchanged across all pools. Large allocations bypass the pool regardless, so no risk.

A bump from 8KB → 64KB (or 128KB) would be a low-risk, near-Pareto improvement for the entire Node ecosystem at trivial RSS cost. Worth proposing upstream.

Hypothesis 6: avoid the allocation entirely

Use openSync + readSync into a pre-allocated buffer instead of readFileSync:

64KB k=8:
  workers-reuse: 393,321 ops/s    ← 5× faster than readFileSync workers
  procs-reuse:   484,380 ops/s    ← 4× faster than readFileSync procs
  ratio:         0.81             ← gap nearly closes

1MB k=8:
  workers-reuse: 39,133 ops/s
  procs-reuse:   35,376 ops/s     ← workers now FASTER
  ratio:         1.11

Same syscalls (open/read/close), no allocation. Confirms that the allocation is the cost, not the I/O.

Hypothesis 7: does async readFile fix it?

64KB k=8:
                              workers   procs    ratio
  sync:                       79,027    146,526  0.54
  async serial (conc=1):      35,449    41,392   0.86
  async parallel (conc=8):    80,216    60,855   1.32

Async does not help workers (80k vs 79k sync — same). The ratio "closes" only because async drags procs down (libuv threadpool adds 4 active threads × 8 procs = ~40 threads on 8 cores → CPU oversubscription).

Theoretically consistent: async only moves which thread issues the syscall — it doesn't change the per-mm_struct lock that the kernel takes.

Hypothesis 8: direct kernel measurement with bpftrace

mmap-lock.bt attaches to the mmap_lock:mmap_lock_start_locking and mmap_lock:mmap_lock_acquire_returned tracepoints to measure per-acquisition wait time (start → acquired). Filtered to Node-relevant comms (node-MainThread, WorkerThread, libuv-worker).

Sync, 64KB, k=8:

                          workers       procs (×9 PIDs)
write acquisitions        10,572        4,711
total write wait          311 ms        8 ms
avg write wait            29,389 ns     1,662 ns         ← 17× longer
max write wait            2-4 ms        1-2 ms (rare)

Worker write-wait histogram (showing the contention shape):

[1K, 2K)            1760  ← uncontended floor
[16K, 32K)          1782  ← clearly blocking
[32K, 64K)          2177  ← heavy contention
[64K, 128K)          386
[128K, 256K)         232
[256K, 512K)          41
[512K, 1M)            43
[1M, 2M)              14
[2M, 4M)               1   ← worker slept 2-4 ms waiting for the lock

Procs are tightly clustered at 1-2K ns (the uncontended atomic-acquire floor). Workers show a bimodal distribution with a fat tail — classic rwsem blocking behavior.

Async, 64KB, k=8:

                          workers       procs (×9 PIDs)
write acquisitions        4,332         2,492
total write wait          89 ms         4 ms
avg write wait            20,642 ns     1,700 ns         ← still 12× longer

Async reduces contention by ~3× but does not eliminate it. Same kind of lock, fewer events because async staggers allocations across libuv threadpool threads.

Conclusion

The cost workers pay vs procs at 64KB sync (~360 ms of cumulative mmap_lock write waits vs ~6 ms) is the same order of magnitude as the wall-clock difference. Theory and measurement agree.

Why so many mmap_lock writes at 64KB? glibc's arena segment growth, madvise(DONTNEED) on free chunks, and contention-fallback mmap calls — none of which we control from JS. The trace shows 10,572 writes from one shared mm_struct for the workers case vs ~560 per process for the 9-process case.


Scripts

File Purpose
task.js Sync readFileSync loop. Worker or child-process entry point. SharedArrayBuffer barrier when used as Worker.
task-readsync.js Sync read using openSync/readSync into a pre-allocated buffer. Tests the buffer-reuse mitigation.
task-readfile.js Async fs.promises.readFile loop. CONCURRENCY=N controls in-flight reads per worker/proc.
run-workers.js Spawn K Worker threads running task.js, barrier-synchronize, report per-worker timings and aggregate throughput.
run-procs.js Spawn K child processes running task.js.
run-workers-reuse.js, run-procs-reuse.js Variants using task-readsync.js.
run-workers-async.js, run-procs-async.js Variants using task-readfile.js.
run.sh End-to-end suite: small/med/large/huge files × k=1/4/8 × default/MALLOC_ARENA_MAX=1/=64.
run-thp.sh Toggles /sys/kernel/mm/transparent_hugepage/enabled between always/madvise/never and runs the 64KB/1MB/8MB k=8 cases. Requires sudo; restores original setting on exit.
mmap-lock.bt bpftrace script: measures mmap_lock start→acquire latency, separated by write (mmap/munmap/brk) and read (page-fault) acquisitions. Filters on Node thread comms.

All scripts are independent — pick whichever question you want to answer.


Step-by-Step Reproduction Guide

Prerequisites

# Linux with mmap_lock tracepoints (kernel ≥ 5.8)
uname -r

# bpftrace and sudo (for tracing and THP toggling)
sudo apt install bpftrace      # Debian/Ubuntu
bpftrace --version             # need ≥ 0.16

# Node.js 20+ (for fs.promises and stable Worker threads)
node --version

# Optional: jemalloc for the allocator-swap test
sudo apt install libjemalloc2

# 8 cores recommended (we use k=8 in the canonical tests)
nproc

1. Clone and pick a Node binary

git clone git@github.com:platformatic/node-worker-mmap-lock-contention.git
cd node-worker-mmap-lock-contention

# Set NODE to whichever node binary you want to test.
# Defaults below assume system node.
export NODE=$(which node)

2. Create the test fixtures

mkdir -p /tmp/fs-contention-files
dd if=/dev/urandom of=/tmp/fs-contention-files/small bs=128     count=1 status=none
dd if=/dev/urandom of=/tmp/fs-contention-files/med   bs=65536   count=1 status=none
dd if=/dev/urandom of=/tmp/fs-contention-files/large bs=1048576 count=1 status=none
dd if=/dev/urandom of=/tmp/fs-contention-files/huge  bs=1048576 count=8 status=none

3. Reproduce the workers-slower-than-procs gap

# 64KB file is the contention sweet spot
$NODE run-workers.js /tmp/fs-contention-files/med 2000 8
$NODE run-procs.js   /tmp/fs-contention-files/med 2000 8

Expected: workers ~70-90k ops/s, procs ~120-150k ops/s, ratio 0.55-0.65 (typical 0.60), procs at least 1.4× faster. If you don't see a gap of at least 1.4×, check that nothing else is competing for CPU.

4. Verify the buffer-reuse fix

$NODE run-workers-reuse.js /tmp/fs-contention-files/med 2000 8
$NODE run-procs-reuse.js   /tmp/fs-contention-files/med 2000 8

Expected: both jump to 300-500k ops/s, ratio above 0.8.

5. Run the full mitigation matrix

NODE=$NODE bash run.sh

Walks through file sizes 128B / 64KB / 1MB / 8MB at k=1/4/8 with default, MALLOC_ARENA_MAX=1, and MALLOC_ARENA_MAX=64. Takes ~5 minutes.

6. Try the kernel-level allocator workaround

MALLOC_MMAP_MAX_=0 MALLOC_TRIM_THRESHOLD_=-1 $NODE run-workers.js /tmp/fs-contention-files/med 2000 8

Expected: marginal — typically only ~5-10% improvement in repeated runs, often within run-to-run noise. (An early single run on this hardware showed +30% but did not replicate under independent re-verification.) Listed here for completeness, not as a recommended mitigation.

7. THP tuning (does NOT help)

sudo bash run-thp.sh   # restores original THP setting on exit

Expected: all three modes within noise. Reported here only to rule it out.

8. jemalloc (does NOT help)

LD_PRELOAD=/lib/x86_64-linux-gnu/libjemalloc.so.2 \
  $NODE run-workers.js /tmp/fs-contention-files/med 2000 8

Expected: small improvement at 64KB, regression at 8MB. Skip in production.

9. Direct kernel measurement with bpftrace (the smoking gun)

# Trace workers run
sudo bpftrace -o /tmp/bpf-workers.txt mmap-lock.bt \
  -c "$NODE run-workers.js /tmp/fs-contention-files/med 2000 8"

# Trace procs run
sudo bpftrace -o /tmp/bpf-procs.txt mmap-lock.bt \
  -c "$NODE run-procs.js /tmp/fs-contention-files/med 2000 8"

# Compare
echo "=== WORKERS ===" && cat /tmp/bpf-workers.txt
echo "=== PROCS ==="   && cat /tmp/bpf-procs.txt

What to look for:

  • @cnt_write — total mmap_lock write acquisitions. Workers ~10k; procs ~5k spread across 9 PIDs (@cnt_per_pid_write[…]).
  • @sum_write_ns — total cumulative wait time. Expect workers ~30-50× procs (observed range across runs: 38-60×).
  • @wait_write_ns histogram — workers will have a fat tail in 16K-128K ns; procs will be clustered at 1-2K ns.

If you see workers' write waits clustered in the 1-2K ns range, the trace didn't capture the right threads — double-check the comm filter in mmap-lock.bt matches your Node build's thread names (/proc/PID/task/*/comm).

10. Verify it's NOT fixed by switching to async

CONCURRENCY=8 $NODE run-workers-async.js /tmp/fs-contention-files/med 2000 8
CONCURRENCY=8 $NODE run-procs-async.js   /tmp/fs-contention-files/med 2000 8

# Trace it
sudo bpftrace -o /tmp/bpf-workers-async.txt mmap-lock.bt \
  -c "env CONCURRENCY=8 $NODE run-workers-async.js /tmp/fs-contention-files/med 2000 8"
cat /tmp/bpf-workers-async.txt

Expected: workers' throughput unchanged. mmap_lock waits reduced by ~3× (~89 ms vs 311 ms) but still 12-15× higher per acquisition than procs. Procs async gets slower than procs sync due to libuv threadpool CPU oversubscription.


Independent Validation

The full 10-step guide above was independently re-executed by a fresh agent that did not see the original investigation notes — only the README. It re-ran every step against the same hardware and judged each against the README's predictions. Result summary:

Step Verdict Notes
3 — basic gap ✅ matches workers 77-88k / procs 138-141k, ratio 0.55-0.64
4 — buffer reuse fix ✅ matches workers-reuse 333-375k / procs-reuse 408-461k, ratio 0.81-0.92
5 — full matrix ✅ matches 64KB k=8 default: workers 77k / procs 122k, 1.58× gap
6 — MALLOC_MMAP_MAX_=0 … does not reproduce as claimed only ~5-7% over baseline across 5 runs; the ≥15% claim was based on a single non-representative run
7 — THP modes ✅ matches within 3% across always/madvise/never
8 — jemalloc ✅ matches ~5-10% at 64KB, not dramatic
9 — bpftrace smoking gun ✅ matches decisively workers @sum_write_ns = 298 ms vs procs 7.7 ms (38.7×); avg per acquisition 28,418 ns vs 1,611 ns (17.6×); tail mass above 16K ns: workers 4,583 events vs procs 4 events (>1000×)
10 — async doesn't fix it ✅ matches workers-async avg wait 24,512 ns vs procs sync 1,611 ns = 15.2× — confirms async only reduces, doesn't eliminate, the contention

Independent verdict on the central claim: the per-mm_struct mmap_lock contention thesis is confirmed by direct kernel measurement. The ~290 ms cumulative worker write-wait closely matches the wall-clock gap between workers and procs. Cause and effect line up.

Corrections applied based on validation:

  • The headline "27× longer per acquisition" was a one-run outlier — corrected to ~17×, consistent with the body's measurement table and the validation re-run (17.6×).
  • MALLOC_MMAP_MAX_=0 MALLOC_TRIM_THRESHOLD_=-1 was downgraded in the mitigation ranking. It was promising on one early run but only ~5-7% on repeated trials.
  • Headline ratio narrowed from "~0.54" to "0.55-0.65 (typical 0.60)" to reflect typical run-to-run variance rather than the best-case single number.

The mitigation ranking now correctly elevates buffer reuse (#1, the only thing that actually fixes it) and process sharding (#2, sidesteps the kernel lock entirely); everything else is footnote-worthy at best on this kernel/glibc combination.


Caveats & Open Questions

  • All numbers are from a single i7-7700 (8 logical cores). On a higher-core-count machine the absolute contention may differ but the qualitative result should hold — mmap_lock is per-mm_struct everywhere.
  • Tested with glibc 2.39. Other libcs (musl, jemalloc-replaced) have different arena strategies; results may differ. jemalloc was tested here and did not help.
  • The Node binary used was built from main (27.0.0-pre). Same pattern should reproduce on Node 20.x / 22.x / 24.x — the relevant code paths (readFileSync, libuv uv_fs_*, V8 ArrayBuffer allocator) have been stable.
  • Kernel was Linux 6.8 (with maple-tree VMAs and per-VMA page-fault locks introduced in 6.4). The mmap/munmap write-side serialization on mmap_lock is unchanged on newer kernels at time of writing.
  • We did not test with CONFIG_LOCK_STAT-enabled kernel. That would give cleaner per-lock contention numbers without needing eBPF.
  • We did not test on a system with multiple NUMA nodes — mmap_lock is per-mm_struct so NUMA shouldn't change the per-process picture, but cross-socket cacheline bouncing could amplify it.

References

  • Linux mmap_lock design: kernel/Documentation/mm/process_addrs.rst
  • libuv sync fs path: deps/uv/src/unix/fs.c POST macro at line 139
  • Node readFileUtf8: src/node_file.cc ReadFileUtf8
  • glibc malloc arena behavior: glibc/malloc/arena.c and MALLOC_TUNABLES(3)
  • bpftrace mmap_lock tracepoints: introduced kernel 5.8, include/trace/events/mmap_lock.h

About

Reproducible PoC and analysis: Node.js Worker threads vs child processes for fs.readFileSync — mmap_lock contention investigation with bpftrace evidence

Resources

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors