Skip to content

Revamp allocator #20513

Description

@overlookmotel

#18168 brought our arena allocator implementation "in house". We can now alter it to our needs.

This issue is more of a sketch than an concrete battle plan. I can see how most of the elements expressed below gel together, but I don't have a completely coherent whole yet.

Problems we have

2 Allocators

Oxlint has essentially 2 different allocator implementations.

  1. Standard arena allocator
  2. "Fixed size allocator" which allocates a single 4 GiB chunk, used for raw transfer.

It's a mess. We have to copy data into a fixed size allocator (inefficient) and there's some very shonky unsafe code - the whole thing is patched together with gaffer tape and string.

Windows OOM

Out of memory errors in Oxlint on Windows due to the large allocations. Also manifests on some (relatively uncommon) linux distros.

Problem stems from Window's strategy in processing allocations. It requires backing memory to be available, even if most of the allocation's pages are never touched. This is generally not a problem on Linux and Mac, as standard configurations allow overcommit - allocations only consume virtual memory, not physical memory, until data is written to them. The problematic linux distros reported appear to disable overcommit, and so behave more like Windows.

Disconnected arenas

Semantic / Scoping also store data in an arena, but a separate one from the arena storing the AST. They have different lifetimes, so data in one cannot reference data in the other. Semantic therefore has to copy string data for all symbols into it's own arena. The copying is expensive, and cache utilisation is poor.

Proposed solutions

A "proper" allocator

To solve the OOM issue we need to explicitly reserve virtual memory only with the OS, via mmap and VirtualAlloc (Windows). Our allocator manages that memory itself, converting pages to mapped physical memory as required. This involves interfacing directly with the OS.

This is tricky, but as well as solving our problem, it does also present opportunities for optimizations and interesting features.

Unified allocator

1 allocator implementation which unifies the 2 allocator variants we have in Oxlint.

This new allocator would have all the features of the current one, but all arenas would be raw transfer-enabled.

This would:

  1. Remove all the mess and unsafe code from Oxlint.
  2. Unlock Rolldown using raw transfer for transform plugins.
  3. Allow us to stabilise raw transfer in oxc-parser and make it the default (making Oxc by far the fastest parser available in JS).

1 allocator, multiple arenas

Allocator should contain 2 separate arenas - for AST and semantic data. Both would have the same lifetime, and so there'd be no need to copy data between them.

The semantic arena would be reused, rather than having to be dropped and re-created each time you re-run semantic analysis (Rolldown).

Separate string data arena

All types stored in AST allocator are 8-byte aligned, with 1 exception: strings.

Currently this consumes memory with useless padding. Allocating a string, followed by an AST node, followed by a string leads to spare padding bytes due to the alignment requirements of the AST nodes.

Additionally, when building strings, they tend to end up copied multiple times as they grow, leaving stale data scattered across the arena.

Instead we should have 2 arenas:

  1. AST data - everything aligned on 8, bumping downwards
  2. String data - unaligned, tightly packed, bumping upwards

Alternatively, we could have a single arena which is consumed from both ends. Strings are allocated at the start of arena, bumping upwards, other types from the end, bumping downwards. An allocator chunk is full when the two meet in the middle.

Tightly packing strings also has one other important advantage - all strings are guaranteed to have readable bytes after them. This enables reading out of bounds, which would make all common string ops (e.g. comparison, hashing) faster (oxc-project/backlog#195). It's also far more amenable to SIMD, as you can always load a SIMD vector's worth of bytes, with no need for scalar fallbacks.

Scratch arena

Allocator would also contain a 3rd arena - scratch space.

This is an arena specifically for temporary intermediate data which is needed during e.g. parsing, transforming, minifying etc, but is dropped at the end of processing, and doesn't get put in the final AST / Semantic data structures.

This acts as a lightweight heap which is wiped at the end of each compilation phase, and then reused by the next phase - with much of it still warm in CPU cache. We could potentially remove all heap allocations from the entire pipeline.

Per-thread storage

Each thread should have it's own allocator pool, and as much as possible only operate in allocators which are tied to the thread.

Our current AllocatorPool implementation shares allocators across all threads, so they're constantly being moved between CPU cores.

If each thread "stays in its own lane", each CPU core operates on the same chunks of memory over and over, keeping it warm in cache for that core.

On-disc caching

AST contains pointers which are only valid if the AST remains in the same location in memory. This makes on-disc caching expensive. Writing to disc and reloading from disc requires serializing/deserializing AST and loading the whole thing into memory before any operations on it can begin.

If we take full control of memory management, we can ensure that a particular AST is loaded into memory at a specific address. Saving to disc is as simple as dumping the whole arena to a file. Retrieving it from disc is even simpler - you just mmap the file to the right address. The AST gets loaded from disc lazily as you walk the AST, and any parts of the AST which aren't accessed remain on disc, untouched - never even loaded into memory.

I imagine this could be very useful and performant for Rolldown to implement persistent caching.

Implementation

The basic sketch is this:

  • On startup, reserve a large chunk of address space from system - 4 GiB x thread count.
  • Each thread gets it's own allocator pool, stored in thread local.
  • Each pool owns a 4 GiB address space, aligned on a 4 GiB boundary.
  • Address space is obtained from OS via mmap / VirtualAlloc - no OOM, even with overcommit disabled.
  • Pools create Allocators on demand, utilising sections of their reserved address space to back them.
  • Allocators can grow by requesting another chunk from the pool.
  • When chunks are created, only then is memory committed to back them (again, via mmap).
  • In extremely large projects, where more than 4 GiB per thread is required, a further large address space can be reserved, and added to the pools.

This scheme removes the need for raw transfer to have a 4 GiB allocation for each AST. The AST can be scattered across multiple chunks (as is the case with our current bumpalo-based allocator). But as long as all chunks come from within the same 4 GiB address space, the invariant that raw transfer requires - that all pointers in an AST have the same top 32 bits - is satisfied.

The most challenging part is that the pools need to maintain free lists to allow creating and destroying allocator chunks in non-serial order. However, this is far simpler than in a "real" allocator, as they only need to deal in large allocations, with a very limited range of size classes.

Metadata

Metadata

Assignees

Labels

Fields

Priority

None yet

Start date

None yet

Target date

None yet

Effort

High

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions