Summary
v0.0.126 (#517) shipped WASM return_call emission for tail-position calls but disabled the optimization for allocating functions (functions whose body translation set ctx.needs_alloc = True). This issue tracks the follow-up: extending TCO to allocating functions by emitting GC-shadow-stack cleanup before the return_call.
Why allocating functions were excluded
WASM return_call $foo discards the current frame and jumps to $foo. In Vera's GC model, that means the GC epilogue never runs — the saved $gc_sp is not restored, and the shadow-stack pointer slots pushed by the prologue stay live. Each tail-call iteration thus leaks shadow-stack slots; eventually $gc_sp passes the worklist boundary and the next $alloc traps with out-of-bounds. Strictly worse for some recursion shapes than the pre-fix "call stack exhausted" trap (silent shadow-stack corruption rather than a clean trap).
The post-process at the end of _compile_fn (vera/codegen/functions.py) reverts every return_call → call when ctx.needs_alloc is True, paying the WASM frame cost in exchange for correct shadow-stack management.
What needs to happen
Emit a "GC tail-call epilogue" before each return_call in an allocating function:
;; Restore $gc_sp to entry value BEFORE evaluating call args.
;; Args may allocate; those allocations push from the entry-level
;; baseline, so the callee's prologue sees a clean shadow stack.
local.get $gc_sp_save
global.set $gc_sp
;; Now evaluate args — any allocation here pushes from clean state.
... arg expressions ...
return_call $foo
The semantics:
- At entry:
$gc_sp is at value S0 (caller's pre-call value).
- Prologue: save S0 to local
$gc_sp_save, push pointer params, $gc_sp is now S1.
- ... function body executes, may allocate further ...
- At return_call: restore
$gc_sp to S0 BEFORE arg evaluation.
- Args evaluate from S0 baseline; any allocations push correctly.
return_call jumps to callee; callee's prologue saves the new gc_sp baseline.
Implementation sketch
In _compile_fn (vera/codegen/functions.py):
- Track which body instructions are
return_call $X emissions.
- When
needs_alloc is True, instead of reverting to plain call, prepend the GC-restore-prologue to each return_call.
- Need to ensure the args don't reference locals that were freed by the gc_sp restore — they shouldn't, locals are independent of
$gc_sp.
Tests
Once shipped:
- A tail-recursive allocating function (e.g. building an ADT each iteration) should run for 1M+ iterations without trapping.
- Structural assertion: the
return_call site is preceded by the GC-restore sequence.
- Behavioural: equivalent of
test_tail_recursive_iteration_succeeds_at_1m but with allocation.
Out of scope
- TCO for closures called via
call_indirect (would need return_call_indirect); separate issue if there's user demand.
Notes
Discovered as a deliberate scoping decision in v0.0.126 (#517). The current behaviour is correct (allocating functions get plain call, paying the stack cost); this issue extends correctness by also flattening the stack. Probably 30-60 minutes of focused work given the existing prologue/epilogue infrastructure.
Summary
v0.0.126 (#517) shipped WASM
return_callemission for tail-position calls but disabled the optimization for allocating functions (functions whose body translation setctx.needs_alloc = True). This issue tracks the follow-up: extending TCO to allocating functions by emitting GC-shadow-stack cleanup before thereturn_call.Why allocating functions were excluded
WASM
return_call $foodiscards the current frame and jumps to$foo. In Vera's GC model, that means the GC epilogue never runs — the saved$gc_spis not restored, and the shadow-stack pointer slots pushed by the prologue stay live. Each tail-call iteration thus leaks shadow-stack slots; eventually$gc_sppasses the worklist boundary and the next$alloctraps with out-of-bounds. Strictly worse for some recursion shapes than the pre-fix "call stack exhausted" trap (silent shadow-stack corruption rather than a clean trap).The post-process at the end of
_compile_fn(vera/codegen/functions.py) reverts everyreturn_call→callwhenctx.needs_allocis True, paying the WASM frame cost in exchange for correct shadow-stack management.What needs to happen
Emit a "GC tail-call epilogue" before each
return_callin an allocating function:The semantics:
$gc_spis at value S0 (caller's pre-call value).$gc_sp_save, push pointer params,$gc_spis now S1.$gc_spto S0 BEFORE arg evaluation.return_calljumps to callee; callee's prologue saves the new gc_sp baseline.Implementation sketch
In
_compile_fn(vera/codegen/functions.py):return_call $Xemissions.needs_allocis True, instead of reverting to plaincall, prepend the GC-restore-prologue to eachreturn_call.$gc_sp.Tests
Once shipped:
return_callsite is preceded by the GC-restore sequence.test_tail_recursive_iteration_succeeds_at_1mbut with allocation.Out of scope
call_indirect(would needreturn_call_indirect); separate issue if there's user demand.Notes
Discovered as a deliberate scoping decision in v0.0.126 (#517). The current behaviour is correct (allocating functions get plain
call, paying the stack cost); this issue extends correctness by also flattening the stack. Probably 30-60 minutes of focused work given the existing prologue/epilogue infrastructure.