Fix a nasty bug in the bytecode VM initialisation #13549

stedolan · 2024-10-14T09:50:25Z

The patch

The bytecode VM uses a fixed 7-instruction sequence of bytecode to set up a call from C into OCaml. This bytecode sequence requires initialisation, to register the code fragment (and thread the bytecode, if not a debug build).

Currently, this code sequence is defined per-thread and initialised on first use. This patch changes it to be defined globally and initialised at startup. (The main thread's copy was already initialised at startup through caml_init_callbacks. The change here is to make that the only copy).

Using a code sequence per thread instead of one global one is somewhat less efficient but not by itself a bug. The problem is that the code fragments leaked, as they were not unregistered at thread termination. This causes #13512.

The bug

With the old code, the following sequence of unfortunate events can occur, if you are very, very unlucky (or you are @jmid with multicoretests, manufacturing bad luck on an industrial scale):

The program uses Domain.spawn f, which creates a new domain and uses caml_callback_res to invoke f.
When caml_callback_res is called, the callback initialisation logic runs, creating a 28-byte code fragment. This fragment covers the 7-instruction callback sequence stored in thread-local storage.
Later, the domain terminates. The C library / OS frees resources used by its threads, including the memory for thread-local storage. The 28-byte code fragment remains registered.
Later, another domain does a minor GC, and runs out of space on the major heap to promote live blocks to. It requests more memory from the OS, which happens to return a 32K pool that includes the 28 bytes that are still registered as a code fragment. A particularly unlucky value is promoted into those 28 bytes.
This unlucky value remains live, but happens not to be pointed to by any other heap value: all references to it are from local variables (so, on the bytecode stack).
At the start of the next major GC cycle, the runtime scans the stack. The bytecode stack contains an arbitrary mix of values and code pointers (return addresses & exception handlers), so the GC skips over any pointers it finds that lie inside code fragments. So, it does not see that the unlucky value is live.
The unlucky value gets collected, despite being in use, causing chaos when it's next used.

In other words: dangling code fragments that are not unregistered are not just a resource leak but a soundness issue. They leave blindspots in the address space that the GC cannot perceive and cause weird bugs later on.

The callback thunk code is now initialised once, not per-thread, avoiding leaking of code fragments.

ghost · 2024-10-14T10:01:50Z

This diff appears to solve indeed the bytecode issues (#13402 and probably others) for me.

However, are you sure it is safe to use a single copy for the whole program, as two elements of callback_code get modified everytime caml_callbackN_exn is invoked? In other words, are you sure there is no risk of two domains attempting to modify and use callback_code at the same time?

stedolan · 2024-10-14T10:06:01Z

Urgh, well spotted, I'd forgotten it actually mutates it after initialisation. I'll update this with a better fix.

Instead of mutating bytecode, this version uses the same mechanism in native code and bytecode: separate code exists for 1, 2 and 3 arguments, and more arguments are handled by iterating callback3.

stedolan · 2024-10-14T16:13:19Z

Just pushed a version that (hopefully) fixes the bug without introducing a new one.

It removes the optimised caml_callbackN for bytecode, and instead uses fixed code sequences for caml_callback1/2/3, and iterates caml_callback3 for more arguments. (This is the mechanism used by native code callbacks).

ghost

This is smart and looks correct to me. And it still appears to work.

kayceesrk · 2024-10-15T04:55:29Z

This looks correct to me. However, I'm not an expert on this part of the runtime. Perhaps a review from someone who understands this better would be useful.

@Octachron, we should consider this for 5.3.

gasche · 2024-10-15T05:20:57Z

My impression is that @xavierleroy would be the right person to look at this.

xavierleroy · 2024-10-15T07:51:02Z

Thanks for pinging me. From a quick look, I agree we have a problem here, but I'm a bit sad about mirroring the native-code behavior (using "slices" of at most 3 arguments for callbacks) in the bytecode behavior. My feeling is that it's time to change a bit the API of caml_interprete to make bytecode callbacks easier. Can I have a couple of weeks to look into this before this PR is merged?

xavierleroy · 2024-10-15T07:54:18Z

Also: just exposing init_callback and uninit_callback functions, and calling them when threads/domains start/stop would solve the issue with fewer changes to the existing code.

- Use the new `caml_bytecode_interpreter` API to jump straight to the function's code. - Avoid modifying bytecode in place. - Avoid per-thread bytecode. - Register bytecode early and only once. Fixes: ocaml#13512 Closes: ocaml#13549

xavierleroy · 2024-10-15T16:15:46Z

That was easier than I thought. See #13553 for my alternate proposal.

Fix a nasty bug in the bytecode VM initialisation

23af981

The callback thunk code is now initialised once, not per-thread, avoiding leaking of code fragments.

stedolan mentioned this pull request Oct 14, 2024

GC crashes under bytecode #13512

Closed

Changes

5f46089

Use immutable bytecode for caml_callback

a019bed

Instead of mutating bytecode, this version uses the same mechanism in native code and bytecode: separate code exists for 1, 2 and 3 arguments, and more arguments are handled by iterating callback3.

ghost approved these changes Oct 14, 2024

View reviewed changes

kayceesrk approved these changes Oct 15, 2024

View reviewed changes

Octachron added this to the 5.3 milestone Oct 15, 2024

xavierleroy mentioned this pull request Oct 15, 2024

Implement bytecode callbacks with static, never-modified bytecode #13553

Merged

dra27 linked an issue Oct 16, 2024 that may be closed by this pull request

GC crashes under bytecode #13512

Closed

xavierleroy closed this in a11272a Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix a nasty bug in the bytecode VM initialisation #13549

Fix a nasty bug in the bytecode VM initialisation #13549

Uh oh!

stedolan commented Oct 14, 2024

Uh oh!

ghost commented Oct 14, 2024

Uh oh!

stedolan commented Oct 14, 2024

Uh oh!

stedolan commented Oct 14, 2024

Uh oh!

ghost left a comment

Uh oh!

kayceesrk commented Oct 15, 2024

Uh oh!

gasche commented Oct 15, 2024

Uh oh!

xavierleroy commented Oct 15, 2024

Uh oh!

xavierleroy commented Oct 15, 2024

Uh oh!

xavierleroy commented Oct 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fix a nasty bug in the bytecode VM initialisation #13549

Fix a nasty bug in the bytecode VM initialisation #13549

Uh oh!

Conversation

stedolan commented Oct 14, 2024

The patch

The bug

Uh oh!

ghost commented Oct 14, 2024

Uh oh!

stedolan commented Oct 14, 2024

Uh oh!

stedolan commented Oct 14, 2024

Uh oh!

ghost left a comment

Choose a reason for hiding this comment

Uh oh!

kayceesrk commented Oct 15, 2024

Uh oh!

gasche commented Oct 15, 2024

Uh oh!

xavierleroy commented Oct 15, 2024

Uh oh!

xavierleroy commented Oct 15, 2024

Uh oh!

xavierleroy commented Oct 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants