macOS malloc abort during wasmtime cleanup after Ctrl-C-in-host-import

## Summary

When a Vera program is interrupted (or possibly during normal execution — see updated analysis), the Python process aborts with a macOS malloc error inside wasmtime's host-function trampoline:

```
Python(NNN,0xN): malloc: *** error for object 0xN: pointer being freed was not allocated
Python(NNN,0xN): malloc: *** set a breakpoint in malloc_error_break to debug
Abort trap: 6
```

The user sees a macOS "Python quit unexpectedly" popup. The fix for the related Python `KeyboardInterrupt` traceback is shipped in v0.0.137 (host_sleep catches `KeyboardInterrupt` and raises `_VeraExit(130)` for clean exit). **The malloc abort is a separate, lower-level issue that may persist even after the Python-traceback fix lands.**

## Updated diagnosis (with crash report stack trace)

The full macOS crash report points the abort at a very specific call site:

```
3   libsystem_malloc.dylib   malloc_vreport + 892
4   libsystem_malloc.dylib   malloc_report + 64
5   libsystem_malloc.dylib   ___BUG_IN_CLIENT_OF_LIBMALLOC_POINTER_BEING_FREED_WAS_NOT_ALLOCATED
6   _libwasmtime.dylib       wasmtime::runtime::func::HostFunc::array_call_trampoline + 456
7-27 ???                     (24 frames of unsymbolicated JIT-compiled WASM code, all at offset 0x103491e9c — deep recursion through the same function)
28  _libwasmtime.dylib       wasmtime::runtime::func::Func::call_unchecked_raw + 356
29  _libwasmtime.dylib       wasmtime::runtime::func::Func::call_impl_do_call + 808
30  _libwasmtime.dylib       wasmtime_func_call + 420
31  libffi.dylib             ffi_call_SYSV + 80
32  libffi.dylib             ffi_call_int + 1220
33  _ctypes.cpython-314      _ctypes_callproc + 788
34  _ctypes.cpython-314      PyCFuncPtr_call + 424
35-50 Python interpreter
```

This **rules out my earlier "cleanup-path" hypothesis**: the abort happens *inside* `wasmtime::runtime::func::HostFunc::array_call_trampoline` (at offset +456), which is wasmtime's trampoline that wraps host imports. The trampoline:

1. Marshals call args from WASM ABI to Rust ABI
2. Invokes the host function (our Python callback via ctypes)
3. Marshals return values back / cleans up

The `+456` offset places us AFTER the host callback returned (or threw), in the cleanup/return phase. Memory the trampoline allocated for the call is being freed, but the freed pointer wasn't malloc'd by the same allocator.

Combined with the 24 frames of unsymbolicated WASM code at the same offset (suggesting 24-deep recursion through `run_loop`), the crash signature is consistent with: **the deep WASM recursion has corrupted some memory wasmtime depends on, and the corruption surfaces when the host trampoline tries to clean up after a host call**.

## Revised hypothesis (likely related to #593)

The previous hypothesis listed three possibilities. The stack trace narrows it:

1. ~~wasmtime-py callback teardown ordering~~ — likely NOT the cleanup ordering itself; the abort is mid-trampoline, not at process exit.
2. ~~Outstanding shadow-stack root not cleared~~ — possible but doesn't directly explain malloc/free mismatch.
3. ~~Native callback re-entrancy~~ — possible but the trace shows a single call stack, not signal-handler reentry.
4. **NEW (most likely): heap corruption from an in-progress codegen bug**. The same codegen path that produces the U+FFFD-string corruption documented in [#593](https://github.com/aallan/vera/issues/593) is plausibly also corrupting wasmtime-internal heap structures (e.g. the WASM linear memory could be overflowing into wasmtime's own allocator state, or a misaligned write to linear memory could clobber a metadata header that wasmtime later tries to free).

This hypothesis is supported by:
- Both #593 and this abort surface from the SAME Life program at 12×30 scale.
- Both appear from "generation 1+" timing (deep into the recursive run_loop).
- The 24-frame deep WASM stack at frames 7-27 corresponds to ~24 generations of run_loop recursion.

## Possibly Python-3.14-related?

The user's Python is **3.14.3**, released October 2025. Python 3.14 included significant ctypes refactoring. wasmtime-py may not yet be hardened against the new ctypes ABI behaviour. Worth testing the same reproducer under Python 3.13 to see if the abort still fires — if not, this is partially a wasmtime-py-vs-Python-3.14 ABI gap.

## Reproducer

Run any Vera program that (a) recurses deeply with allocating arguments, (b) uses host imports (especially `IO.sleep`, `IO.print`), and (c) hits #593's heap-corruption trigger. The simplest:

```bash
vera run /Users/aa/Downloads/files/life_full_program.vera
# Wait through generations 0-50, then Ctrl-C OR let it run to completion
```

The malloc abort fires reliably once the Life program reaches the corruption window from #593.

## DETERMINISTIC REPRODUCER (added 2026-05-07)

While testing the `IO.sleep` `KeyboardInterrupt` guard fix in PR #594, I temporarily reverted the guard and ran the e2e test to confirm it caught the regression. The test triggered an immediate `SIGABRT` matching this issue's stack trace exactly:

```
3   Python                    faulthandler_fatal_error + 380
4   libsystem_platform.dylib  _sigtramp + 56
5   libsystem_pthread.dylib   pthread_kill + 296
6   libsystem_c.dylib         abort + 124
7-9 libsystem_malloc.dylib    ___BUG_IN_CLIENT_OF_LIBMALLOC_POINTER_BEING_FREED_WAS_NOT_ALLOCATED
10  _libwasmtime.dylib        wasmtime::runtime::func::HostFunc::array_call_trampoline + 456
11-13 ???                     (JIT-compiled WASM, 3 frames)
14-20 wasmtime + ffi + ctypes
```

This is the SAME signature this issue documents — but reproducible via a 5-line setup, no Life program / 200 generations / manual Ctrl-C required.

**Reproducer:**

```python
# Run with the production host_sleep guard REMOVED
# (the guard at vera/codegen/api.py around line 1191).
import time as _time
from unittest.mock import patch
from vera.codegen import compile as compile_program, execute
from vera.parser import parse_to_ast

source = '''
public fn main(@Unit -> @Unit)
  requires(true) ensures(true) effects(<IO>)
{
  IO.sleep(120)
}
'''
result = compile_program(parse_to_ast(source), source=source)

with patch.object(_time, "sleep", side_effect=KeyboardInterrupt):
    execute(result)   # <-- reliably aborts with the malloc trampoline crash
```

This narrows the hypothesis space dramatically:

- It's **NOT** about deep recursion (the program above does ONE `IO.sleep` call).
- It's **NOT** about heap corruption from any Vera codegen bug (the program is trivial; the corruption is in wasmtime / libmalloc).
- It's **NOT** about scale (one host call, one Python exception).
- It's **NOT** specific to actual SIGINT — any `KeyboardInterrupt` raised inside a host import triggers it.

The bug is a pure interaction between:
1. wasmtime-py's `HostFunc::array_call_trampoline` (Rust)
2. Python 3.14's ctypes/ffi callback ABI
3. A `KeyboardInterrupt` (or any Python exception?) escaping the host callback unexpectedly

**Implications:**

- The previous suspicion that this shares a root cause with [#593](https://github.com/aallan/vera/issues/593) is now **less likely** — #593 surfaces only at scale and involves heap corruption that the `errors="replace"` fix can mask. This abort is fully synthetic and doesn't need any Life program at all.
- Worth filing upstream against wasmtime-py as a minimal reproducer.
- Worth testing the same reproducer under Python 3.13 to isolate any 3.14-specific ABI gap (Python 3.14's ctypes refactoring is the prime suspect).

The PR #594 production guard (catching `KeyboardInterrupt` and converting to `_VeraExit(130)` before it can escape the host import) closes the user-visible half by ensuring the production path never reaches this trigger. The underlying wasmtime/Python bug remains.

---

## Severity

Medium — but possibly an early surface of #593's underlying corruption. Severity escalates if the same heap corruption can be triggered by a Vera program without manual interruption.

## Acceptance

- Reproducer above runs to completion without malloc abort.
- If the underlying cause is shared with #593 (likely), closing #593 will close this. Otherwise, narrowing what wasmtime-internal memory is being clobbered is the next investigation step.
- Verify reproducer under Python 3.13 to isolate any Python-3.14-specific ctypes ABI gap.

## Workaround

None known.

## Related

- [#593](https://github.com/aallan/vera/issues/593) — full-Life corruption (likely shared root cause)
- [#589](https://github.com/aallan/vera/issues/589) — UTF-8 traceback escape (same WasmTrapError contract class)
- v0.0.137 — `host_sleep` `KeyboardInterrupt` → `_VeraExit(130)` fix (Python-traceback half; lands separately in PR #594).



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

macOS malloc abort during wasmtime cleanup after Ctrl-C-in-host-import #595

Summary

Updated diagnosis (with crash report stack trace)

Revised hypothesis (likely related to #593)

Possibly Python-3.14-related?

Reproducer

DETERMINISTIC REPRODUCER (added 2026-05-07)

Severity

Acceptance

Workaround

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

macOS malloc abort during wasmtime cleanup after Ctrl-C-in-host-import #595

Description

Summary

Updated diagnosis (with crash report stack trace)

Revised hypothesis (likely related to #593)

Possibly Python-3.14-related?

Reproducer

DETERMINISTIC REPRODUCER (added 2026-05-07)

Severity

Acceptance

Workaround

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions