Skip to content

macOS malloc abort during wasmtime cleanup after Ctrl-C-in-host-import #595

@aallan

Description

@aallan

Summary

When a Vera program is interrupted (or possibly during normal execution — see updated analysis), the Python process aborts with a macOS malloc error inside wasmtime's host-function trampoline:

Python(NNN,0xN): malloc: *** error for object 0xN: pointer being freed was not allocated
Python(NNN,0xN): malloc: *** set a breakpoint in malloc_error_break to debug
Abort trap: 6

The user sees a macOS "Python quit unexpectedly" popup. The fix for the related Python KeyboardInterrupt traceback is shipped in v0.0.137 (host_sleep catches KeyboardInterrupt and raises _VeraExit(130) for clean exit). The malloc abort is a separate, lower-level issue that may persist even after the Python-traceback fix lands.

Updated diagnosis (with crash report stack trace)

The full macOS crash report points the abort at a very specific call site:

3   libsystem_malloc.dylib   malloc_vreport + 892
4   libsystem_malloc.dylib   malloc_report + 64
5   libsystem_malloc.dylib   ___BUG_IN_CLIENT_OF_LIBMALLOC_POINTER_BEING_FREED_WAS_NOT_ALLOCATED
6   _libwasmtime.dylib       wasmtime::runtime::func::HostFunc::array_call_trampoline + 456
7-27 ???                     (24 frames of unsymbolicated JIT-compiled WASM code, all at offset 0x103491e9c — deep recursion through the same function)
28  _libwasmtime.dylib       wasmtime::runtime::func::Func::call_unchecked_raw + 356
29  _libwasmtime.dylib       wasmtime::runtime::func::Func::call_impl_do_call + 808
30  _libwasmtime.dylib       wasmtime_func_call + 420
31  libffi.dylib             ffi_call_SYSV + 80
32  libffi.dylib             ffi_call_int + 1220
33  _ctypes.cpython-314      _ctypes_callproc + 788
34  _ctypes.cpython-314      PyCFuncPtr_call + 424
35-50 Python interpreter

This rules out my earlier "cleanup-path" hypothesis: the abort happens inside wasmtime::runtime::func::HostFunc::array_call_trampoline (at offset +456), which is wasmtime's trampoline that wraps host imports. The trampoline:

  1. Marshals call args from WASM ABI to Rust ABI
  2. Invokes the host function (our Python callback via ctypes)
  3. Marshals return values back / cleans up

The +456 offset places us AFTER the host callback returned (or threw), in the cleanup/return phase. Memory the trampoline allocated for the call is being freed, but the freed pointer wasn't malloc'd by the same allocator.

Combined with the 24 frames of unsymbolicated WASM code at the same offset (suggesting 24-deep recursion through run_loop), the crash signature is consistent with: the deep WASM recursion has corrupted some memory wasmtime depends on, and the corruption surfaces when the host trampoline tries to clean up after a host call.

Revised hypothesis (likely related to #593)

The previous hypothesis listed three possibilities. The stack trace narrows it:

  1. wasmtime-py callback teardown ordering — likely NOT the cleanup ordering itself; the abort is mid-trampoline, not at process exit.
  2. Outstanding shadow-stack root not cleared — possible but doesn't directly explain malloc/free mismatch.
  3. Native callback re-entrancy — possible but the trace shows a single call stack, not signal-handler reentry.
  4. NEW (most likely): heap corruption from an in-progress codegen bug. The same codegen path that produces the U+FFFD-string corruption documented in #593 is plausibly also corrupting wasmtime-internal heap structures (e.g. the WASM linear memory could be overflowing into wasmtime's own allocator state, or a misaligned write to linear memory could clobber a metadata header that wasmtime later tries to free).

This hypothesis is supported by:

Possibly Python-3.14-related?

The user's Python is 3.14.3, released October 2025. Python 3.14 included significant ctypes refactoring. wasmtime-py may not yet be hardened against the new ctypes ABI behaviour. Worth testing the same reproducer under Python 3.13 to see if the abort still fires — if not, this is partially a wasmtime-py-vs-Python-3.14 ABI gap.

Reproducer

Run any Vera program that (a) recurses deeply with allocating arguments, (b) uses host imports (especially IO.sleep, IO.print), and (c) hits #593's heap-corruption trigger. The simplest:

vera run /Users/aa/Downloads/files/life_full_program.vera
# Wait through generations 0-50, then Ctrl-C OR let it run to completion

The malloc abort fires reliably once the Life program reaches the corruption window from #593.

DETERMINISTIC REPRODUCER (added 2026-05-07)

While testing the IO.sleep KeyboardInterrupt guard fix in PR #594, I temporarily reverted the guard and ran the e2e test to confirm it caught the regression. The test triggered an immediate SIGABRT matching this issue's stack trace exactly:

3   Python                    faulthandler_fatal_error + 380
4   libsystem_platform.dylib  _sigtramp + 56
5   libsystem_pthread.dylib   pthread_kill + 296
6   libsystem_c.dylib         abort + 124
7-9 libsystem_malloc.dylib    ___BUG_IN_CLIENT_OF_LIBMALLOC_POINTER_BEING_FREED_WAS_NOT_ALLOCATED
10  _libwasmtime.dylib        wasmtime::runtime::func::HostFunc::array_call_trampoline + 456
11-13 ???                     (JIT-compiled WASM, 3 frames)
14-20 wasmtime + ffi + ctypes

This is the SAME signature this issue documents — but reproducible via a 5-line setup, no Life program / 200 generations / manual Ctrl-C required.

Reproducer:

# Run with the production host_sleep guard REMOVED
# (the guard at vera/codegen/api.py around line 1191).
import time as _time
from unittest.mock import patch
from vera.codegen import compile as compile_program, execute
from vera.parser import parse_to_ast

source = '''
public fn main(@Unit -> @Unit)
  requires(true) ensures(true) effects(<IO>)
{
  IO.sleep(120)
}
'''
result = compile_program(parse_to_ast(source), source=source)

with patch.object(_time, "sleep", side_effect=KeyboardInterrupt):
    execute(result)   # <-- reliably aborts with the malloc trampoline crash

This narrows the hypothesis space dramatically:

  • It's NOT about deep recursion (the program above does ONE IO.sleep call).
  • It's NOT about heap corruption from any Vera codegen bug (the program is trivial; the corruption is in wasmtime / libmalloc).
  • It's NOT about scale (one host call, one Python exception).
  • It's NOT specific to actual SIGINT — any KeyboardInterrupt raised inside a host import triggers it.

The bug is a pure interaction between:

  1. wasmtime-py's HostFunc::array_call_trampoline (Rust)
  2. Python 3.14's ctypes/ffi callback ABI
  3. A KeyboardInterrupt (or any Python exception?) escaping the host callback unexpectedly

Implications:

  • The previous suspicion that this shares a root cause with #593 is now less likelyConway's Life at 12x30 still corrupts strings from gen 1 onwards (additional trigger beyond #588) #593 surfaces only at scale and involves heap corruption that the errors="replace" fix can mask. This abort is fully synthetic and doesn't need any Life program at all.
  • Worth filing upstream against wasmtime-py as a minimal reproducer.
  • Worth testing the same reproducer under Python 3.13 to isolate any 3.14-specific ABI gap (Python 3.14's ctypes refactoring is the prime suspect).

The PR #594 production guard (catching KeyboardInterrupt and converting to _VeraExit(130) before it can escape the host import) closes the user-visible half by ensuring the production path never reaches this trigger. The underlying wasmtime/Python bug remains.


Severity

Medium — but possibly an early surface of #593's underlying corruption. Severity escalates if the same heap corruption can be triggered by a Vera program without manual interruption.

Acceptance

Workaround

None known.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions