Skip to content

GC-aware tail-call optimization for allocating functions #549

@aallan

Description

@aallan

Summary

v0.0.126 (#517) shipped WASM return_call emission for tail-position calls but disabled the optimization for allocating functions (functions whose body translation set ctx.needs_alloc = True). This issue tracks the follow-up: extending TCO to allocating functions by emitting GC-shadow-stack cleanup before the return_call.

Why allocating functions were excluded

WASM return_call $foo discards the current frame and jumps to $foo. In Vera's GC model, that means the GC epilogue never runs — the saved $gc_sp is not restored, and the shadow-stack pointer slots pushed by the prologue stay live. Each tail-call iteration thus leaks shadow-stack slots; eventually $gc_sp passes the worklist boundary and the next $alloc traps with out-of-bounds. Strictly worse for some recursion shapes than the pre-fix "call stack exhausted" trap (silent shadow-stack corruption rather than a clean trap).

The post-process at the end of _compile_fn (vera/codegen/functions.py) reverts every return_callcall when ctx.needs_alloc is True, paying the WASM frame cost in exchange for correct shadow-stack management.

What needs to happen

Emit a "GC tail-call epilogue" before each return_call in an allocating function:

;; Restore $gc_sp to entry value BEFORE evaluating call args.
;; Args may allocate; those allocations push from the entry-level
;; baseline, so the callee's prologue sees a clean shadow stack.
local.get $gc_sp_save
global.set $gc_sp

;; Now evaluate args — any allocation here pushes from clean state.
... arg expressions ...

return_call $foo

The semantics:

  • At entry: $gc_sp is at value S0 (caller's pre-call value).
  • Prologue: save S0 to local $gc_sp_save, push pointer params, $gc_sp is now S1.
  • ... function body executes, may allocate further ...
  • At return_call: restore $gc_sp to S0 BEFORE arg evaluation.
  • Args evaluate from S0 baseline; any allocations push correctly.
  • return_call jumps to callee; callee's prologue saves the new gc_sp baseline.

Implementation sketch

In _compile_fn (vera/codegen/functions.py):

  1. Track which body instructions are return_call $X emissions.
  2. When needs_alloc is True, instead of reverting to plain call, prepend the GC-restore-prologue to each return_call.
  3. Need to ensure the args don't reference locals that were freed by the gc_sp restore — they shouldn't, locals are independent of $gc_sp.

Tests

Once shipped:

  • A tail-recursive allocating function (e.g. building an ADT each iteration) should run for 1M+ iterations without trapping.
  • Structural assertion: the return_call site is preceded by the GC-restore sequence.
  • Behavioural: equivalent of test_tail_recursive_iteration_succeeds_at_1m but with allocation.

Out of scope

  • TCO for closures called via call_indirect (would need return_call_indirect); separate issue if there's user demand.

Notes

Discovered as a deliberate scoping decision in v0.0.126 (#517). The current behaviour is correct (allocating functions get plain call, paying the stack cost); this issue extends correctness by also flattening the stack. Probably 30-60 minutes of focused work given the existing prologue/epilogue infrastructure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions