GC-aware tail-call optimization for allocating functions

## Summary

v0.0.126 (#517) shipped WASM `return_call` emission for tail-position calls but disabled the optimization for allocating functions (functions whose body translation set `ctx.needs_alloc = True`).  This issue tracks the follow-up: extending TCO to allocating functions by emitting GC-shadow-stack cleanup before the `return_call`.

## Why allocating functions were excluded

WASM `return_call $foo` discards the current frame and jumps to `$foo`.  In Vera's GC model, that means the **GC epilogue never runs** — the saved `$gc_sp` is not restored, and the shadow-stack pointer slots pushed by the prologue stay live.  Each tail-call iteration thus leaks shadow-stack slots; eventually `$gc_sp` passes the worklist boundary and the next `$alloc` traps with out-of-bounds.  Strictly *worse* for some recursion shapes than the pre-fix "call stack exhausted" trap (silent shadow-stack corruption rather than a clean trap).

The post-process at the end of `_compile_fn` (`vera/codegen/functions.py`) reverts every `return_call` → `call` when `ctx.needs_alloc` is True, paying the WASM frame cost in exchange for correct shadow-stack management.

## What needs to happen

Emit a "GC tail-call epilogue" before each `return_call` in an allocating function:

```wat
;; Restore $gc_sp to entry value BEFORE evaluating call args.
;; Args may allocate; those allocations push from the entry-level
;; baseline, so the callee's prologue sees a clean shadow stack.
local.get $gc_sp_save
global.set $gc_sp

;; Now evaluate args — any allocation here pushes from clean state.
... arg expressions ...

return_call $foo
```

The semantics:
- At entry: `$gc_sp` is at value S0 (caller's pre-call value).
- Prologue: save S0 to local `$gc_sp_save`, push pointer params, `$gc_sp` is now S1.
- ... function body executes, may allocate further ...
- At return_call: restore `$gc_sp` to S0 BEFORE arg evaluation.
- Args evaluate from S0 baseline; any allocations push correctly.
- `return_call` jumps to callee; callee's prologue saves the new gc_sp baseline.

## Implementation sketch

In `_compile_fn` (`vera/codegen/functions.py`):
1. Track which body instructions are `return_call $X` emissions.
2. When `needs_alloc` is True, instead of reverting to plain `call`, prepend the GC-restore-prologue to each `return_call`.
3. Need to ensure the args don't reference locals that were freed by the gc_sp restore — they shouldn't, locals are independent of `$gc_sp`.

## Tests

Once shipped:
- A tail-recursive allocating function (e.g. building an ADT each iteration) should run for 1M+ iterations without trapping.
- Structural assertion: the `return_call` site is preceded by the GC-restore sequence.
- Behavioural: equivalent of `test_tail_recursive_iteration_succeeds_at_1m` but with allocation.

## Out of scope

- TCO for closures called via `call_indirect` (would need `return_call_indirect`); separate issue if there's user demand.

## Notes

Discovered as a deliberate scoping decision in v0.0.126 (#517).  The current behaviour is correct (allocating functions get plain `call`, paying the stack cost); this issue extends correctness by also flattening the stack.  Probably 30-60 minutes of focused work given the existing prologue/epilogue infrastructure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GC-aware tail-call optimization for allocating functions #549

Summary

Why allocating functions were excluded

What needs to happen

Implementation sketch

Tests

Out of scope

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

GC-aware tail-call optimization for allocating functions #549

Description

Summary

Why allocating functions were excluded

What needs to happen

Implementation sketch

Tests

Out of scope

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions