Address PR review findings on #671

aallan · Claude · aallan · commit ae573d36b94e · 2026-05-13T19:58:11.000+01:00
Comprehensive PR review (code-reviewer + pr-test-analyzer + comment-analyzer) surfaced 5 actionable items. All fixed here: **Inaccurate comments** (comment-analyzer) 1. `tests/test_stress.py` — `test_tco_with_allocation_1m_iterations` docstring claimed "~12 bytes per leaked frame ≈ 1300 iters". `gc_shadow_push` writes one i32 = 4 bytes, not 12 (the 12-byte figure came from a different fn elsewhere in the file with 2 pointer params + 1 tmp root). The `loop` fn here only pushes 1 pointer per iteration. Corrected to "4 bytes per leaked pointer root ≈ 4,096 iterations" and added the inline note that the @int params aren't pointer-typed. Also dropped the "~200ms on a 2026 MacBook Pro" claim (will rot). 2. `vera/codegen/functions.py` — comment cited "operators.py" unqualified, but `vera/codegen/` and `vera/wasm/` are peer directories — should be `vera/wasm/operators.py`. This matters because #672 explicitly audits the inline-indentation surface; a mis-pathed reference would cost search time. Added a #672 cross-reference too. 3. `vera/codegen/functions.py` dispatch comment said "Two sources of post-body work" but the actual code is a three-branch dispatch (revert / patch / untouched). Reframed as "Three outcomes (precedence: 1 > 2 > 3)" with the third explicitly named — the common non-allocating tail-recursion case from SKILL.md's "Iteration" section. Also trimmed the `alloc_local` implementation-detail digression that didn't add value. 4. `tests/test_codegen.py` — `test_allocating_function_uses_gc_ aware_tco_549` docstring said "sacrificing TCO for shadow- stack correctness". Inverted framing — pre-#549 the GC epilogue ran fine, what was sacrificed was WASM call-stack depth (1M iters trapped with `call stack exhausted` because the function paid a WASM frame per recursion). Reframed. **Test coverage gap** (pr-test-analyzer rating 8) 5. Added `test_allocating_function_gc_aware_tco_patches_both_ branches`. The patch loop iterates every body instruction and patches every `return_call` site individually; a buggy implementation that bails after the first match (e.g. an accidental `break`) would still pass the single-site test above. New test uses a `match` with two ADT arms each ending in `return_call $build`, asserts BOTH sites have the `local.get N; global.set $gc_sp` preamble, and that N matches the same `$gc_sp_save` local captured by the prologue. **Suggestion** (pr-test-analyzer rating 6) 6. `test_shadow_stack_overflow_traps` — added inline calibration for the 2000-iteration count: "16K shadow stack / 12 bytes per frame (2 Array<Bool> params + 1 tmp root) ≈ 1,365 frames; 2000 chosen with ~1.5× safety margin". Previously the threshold was undocumented; a future per-frame size change could silently weaken the test without anyone noticing. **Not addressed**: - pr-test-analyzer's mutual-recursion test suggestion (rating 8) — the dispatch is per-function and each function's correctness is independently pinned. Cross-function GC composition is a deeper concern that doesn't sit in this PR's scope and would benefit from a separate issue. - pr-test-analyzer's zero-`return_call`-site test (rating 6) — edge case is fine; allocating leaf fns just have an unused `$gc_sp_save` local, no harm. - api.py inner-helper workaround mention in `stack_exhausted` message — too tangential for an end-user-facing trap message. **code-reviewer verdict**: no issues found at confidence ≥ 80. The fix is structurally correct, the dispatch precedence is explicitly pinned by tests, and CodeRabbit's prior concerns (S101 noqa, indent alignment, boundary-safe regex) are all addressed. Validation: pytest 3863/3863 ✓, mypy clean ✓, ruff S rules clean ✓, doc-counts consistent (test counts updated to 3,893). Co-Authored-By: Claude <noreply@anthropic.invalid>
diff --git a/README.md b/README.md
@@ -181,7 +181,7 @@ cp /path/to/vera/SKILL.md ~/.claude/skills/vera-language/SKILL.md
 
 ## Project status
 
-Vera is in **active development** at v0.0.154 — 810+ commits, 154 releases, 3,892 tests, 96% code coverage, 86 conformance programs, 34 examples, and a 13-chapter specification. See **[HISTORY.md](HISTORY.md)** for how the compiler was built.
+Vera is in **active development** at v0.0.154 — 810+ commits, 154 releases, 3,893 tests, 96% code coverage, 86 conformance programs, 34 examples, and a 13-chapter specification. See **[HISTORY.md](HISTORY.md)** for how the compiler was built.
 
 The reference compiler — parser, AST, type checker, contract verifier (Z3), WASM code generator, module system, browser runtime, and runtime contract insertion — is working. The language specification is in draft across [13 chapters](spec/).
 
diff --git a/ROADMAP.md b/ROADMAP.md
@@ -4,7 +4,7 @@ Where the project is going.  See [HISTORY.md](HISTORY.md) for what's been built
 
 ## Where we are
 
-3,892 tests, 86 conformance programs, 34 examples, 13 spec chapters.
+3,893 tests, 86 conformance programs, 34 examples, 13 spec chapters.
 
 ## What's next
 
diff --git a/TESTING.md b/TESTING.md
@@ -6,7 +6,7 @@ This is the single source of truth for Vera's testing infrastructure, coverage d
 
 | Metric | Value |
 |--------|-------|
-| **Tests** | 3,892 across 32 files (~52,852 lines of test code; 3,862 passed + 16 stress, 14 skipped) |
+| **Tests** | 3,893 across 32 files (~52,968 lines of test code; 3,863 passed + 16 stress, 14 skipped) |
 | **Compiler code coverage** | 96% of 15,149 statements (CI minimum: 80%) |
 | **Conformance programs** | 86 programs across 9 spec chapters, validating every language feature |
 | **Example programs** | 34, all validated through `vera check` + `vera verify` |
@@ -58,15 +58,15 @@ python scripts/fix_allowlists.py --fix               # auto-fix stale allowlists
 | `test_ast.py` | 127 | 1,130 | AST transformation, node structure, serialisation, string escape sequences, ability declarations |
 | `test_checker.py` | 521 | 5,939 | Type synthesis, slot resolution, effects, effect subtyping, contracts, exhaustiveness, cross-module typing, visibility, error codes, string built-ins, generic rejection, IO operation types, Markdown types, Regex types, abilities, Map collection, Set collection, Decimal type, Json type, Html type, Http effect, Inference effect, removed legacy name regression |
 | `test_verifier.py` | 145 | 2,094 | Z3 verification, counterexamples, tier classification, call-site preconditions, branch-aware preconditions, pipe operator, cross-module contracts, match/ADT verification, decreases verification, mutual recursion, refined Bool/String/Float64 param sorts, **@Nat subtraction underflow obligation** (#520 — Path-A obligation discharge via requires/path-conditions/path-aware Z3 refutation, pure-literal exclusion, Int-Int and Nat-Int → Int exemptions) |
-| `test_codegen.py` | 1,104 | 18,510 | WASM compilation, arithmetic, Float64, Byte, arrays (incl. compound element types), ADTs, match (incl. nested patterns), generics, State\<T\>, Exn\<E\> handlers, control flow, strings, string escape sequences, IO (read\_line, read\_file, write\_file, args, exit, get\_env, sleep, time, stderr), bounds checking, quantifiers, assert/assume, refinement type aliases, pipe operator, string built-ins, built-in shadowing, parse\_nat Result, GC, Markdown host bindings, Regex host bindings, Map collection, Set collection, Decimal type, Json type, Html type, Http effect, Inference effect, Random effect, example round-trips, GC shadow stack overflow, **WASM tail-call optimization** (#517 — `return_call` emission for tail-position calls, 50K- and 1M-iteration stress, structural assertions on `return_call`/plain `call` boundary, **GC-aware TCO for allocating fns (#549 — `$gc_sp` restore before each `return_call`)**, postcondition-fallback regression (still reverts to plain `call`), analyzer unit tests covering Block-trailing / IfExpr-both-branches / MatchExpr-arm-bodies / let-value-NOT-marked / call-args-NOT-marked / ExprStmt-statement-NOT-marked / IfExpr-condition-NOT-marked / MatchExpr-scrutinee-NOT-marked) |
+| `test_codegen.py` | 1,105 | 18,624 | WASM compilation, arithmetic, Float64, Byte, arrays (incl. compound element types), ADTs, match (incl. nested patterns), generics, State\<T\>, Exn\<E\> handlers, control flow, strings, string escape sequences, IO (read\_line, read\_file, write\_file, args, exit, get\_env, sleep, time, stderr), bounds checking, quantifiers, assert/assume, refinement type aliases, pipe operator, string built-ins, built-in shadowing, parse\_nat Result, GC, Markdown host bindings, Regex host bindings, Map collection, Set collection, Decimal type, Json type, Html type, Http effect, Inference effect, Random effect, example round-trips, GC shadow stack overflow, **WASM tail-call optimization** (#517 — `return_call` emission for tail-position calls, 50K- and 1M-iteration stress, structural assertions on `return_call`/plain `call` boundary, **GC-aware TCO for allocating fns (#549 — `$gc_sp` restore before each `return_call`)**, postcondition-fallback regression (still reverts to plain `call`), analyzer unit tests covering Block-trailing / IfExpr-both-branches / MatchExpr-arm-bodies / let-value-NOT-marked / call-args-NOT-marked / ExprStmt-statement-NOT-marked / IfExpr-condition-NOT-marked / MatchExpr-scrutinee-NOT-marked) |
 | `test_codegen_contracts.py` | 32 | 576 | Runtime pre/postconditions, contract fail messages, old/new state postconditions |
 | `test_codegen_monomorphize.py` | 71 | 1,326 | Generic instantiation, type inference, monomorphization edge cases, ability constraint satisfaction (Eq/Ord/Hash/Show), operation rewriting (eq/compare), show/hash dispatch, ADT auto-derivation, array operations (slice/map/filter/fold) |
 | `test_codegen_closures.py` | 50 | 1,624 | Closure lifting, captured variables, higher-order functions, iterative-builder shadow-stack regressions (#570), closure return-value shadow-push balance for both i32-pair and i32-ADT branches across array_map and array_mapi, plus VERA_EAGER_GC injection self-test (#593), IndexExpr-of-FnCall element-type inference (#614), non-contiguous capture and walker-order miscompiles (#615) |
 | `test_codegen_modules.py` | 23 | 836 | Cross-module guard rail, cross-module codegen, name collision detection (E608/E609/E610) |
 | `test_codegen_coverage.py` | 5 | 250 | Defensive error paths: E600, E601, E605, E606, unknown module calls  |
 | `test_walker_defensive_branches_597.py` | 21 | 296 | Synthetic-AST tests for the 11 defensive `isinstance` branches added by #597 (`_scan_io_ops` / `_scan_expr_for_handlers` / `_infer_expr_wasm_type` / `_infer_vera_type`) plus the 5 pr-review fixes (#2/#3/#8 — ModuleCall/AnonFn/QualifiedCall return None; dead `is not None` guards on Block/HandleExpr removed) |
 | `test_check_walker_coverage_597.py` | 15 | 311 | Unit tests for `scripts/check_walker_coverage.py` parsing logic — Expr subclass extraction, isinstance flattening (incl. tuple form), checklist-block anchoring (incl. CR-3 regression test: `# Foo → bar` outside WALKER_COVERAGE block not counted), section-header tolerance, auto-discovery invariants, end-to-end main exit code |
-| `test_stress.py` | 16 | 551 | Scale-dependent regression tests (#596) — `@pytest.mark.stress`, skipped by default.  9 logical tests × eager-GC lane parametrisation = 16 test instances.  10K `array_map`, 5K nested-array `array_map`, 1K-deep tail recursion with allocating arg, 1M-deep tail recursion with allocating arg (#549 GC-aware TCO), 20×20 nested array-fold-of-array-fold, 100K `array_fold`, 10K String allocations, 1K `State<Int>` get/put cycles, 10K `IO.print` calls.  Pins #570 / #515 / #593 / #549 / #487 / #348 / #573 regression coverage |
+| `test_stress.py` | 16 | 553 | Scale-dependent regression tests (#596) — `@pytest.mark.stress`, skipped by default.  9 logical tests × eager-GC lane parametrisation = 16 test instances.  10K `array_map`, 5K nested-array `array_map`, 1K-deep tail recursion with allocating arg, 1M-deep tail recursion with allocating arg (#549 GC-aware TCO), 20×20 nested array-fold-of-array-fold, 100K `array_fold`, 10K String allocations, 1K `State<Int>` get/put cycles, 10K `IO.print` calls.  Pins #570 / #515 / #593 / #549 / #487 / #348 / #573 regression coverage |
 | `test_errors.py` | 52 | 525 | Error code registry, diagnostic formatting, serialisation, SourceLocation, error display sync (README/HTML/spec) |
 | `test_formatter.py` | 122 | 1,075 | Comment extraction, interior comment positioning, expression/declaration formatting, match arm block bodies, idempotency, parenthesization, spec rules, ability declarations |
 | `test_cli.py` | 217 | 3,021 | CLI commands (check, verify, compile, run, test, fmt, version, quiet), subprocess integration, JSON error paths, runtime traps, arg validation, multi-file resolution, IO exit codes, --explain-slots |
diff --git a/tests/test_codegen.py b/tests/test_codegen.py
@@ -2043,8 +2043,10 @@ def test_allocating_function_uses_gc_aware_tco_549(self) -> None:
         boundary.
 
         Pre-#549: the post-process reverted every `return_call` →
-        `call` when `ctx.needs_alloc` was True, sacrificing TCO
-        for shadow-stack correctness.
+        `call` when `ctx.needs_alloc` was True, sacrificing WASM
+        call-stack depth (tail recursion eventually trapped with
+        `call stack exhausted`) so the GC epilogue could run and
+        bound shadow-stack usage.
 
         Post-#549: the post-process instead PREPENDS a `$gc_sp`
         restore (`local.get $gc_sp_save; global.set $gc_sp`)
@@ -2161,6 +2163,110 @@ def test_allocating_function_uses_gc_aware_tco_549(self) -> None:
                 f"Body:\n{build_body}"
             )
 
+    def test_allocating_function_gc_aware_tco_patches_both_branches(
+        self,
+    ) -> None:
+        """#549: every tail-position `return_call` gets the preamble.
+
+        The single-branch test above pins that the patch fires at
+        a single tail-recursive call site.  This test pins that
+        the patch loop fires at MULTIPLE sites in the same
+        function — a buggy implementation that bails after the
+        first match (e.g. `break` inside the patch loop) or that
+        only handles top-level emissions but not if/else-nested
+        ones would still pass the single-branch test.
+
+        The function below uses a `match` with two ADT arms, each
+        ending in a tail-recursive `build` call.  The analyzer
+        marks both arms as tail position (see
+        `test_analyzer_marks_match_arm_bodies`), so the codegen
+        emits two `return_call $build` sites with DIFFERENT
+        leading-whitespace prefixes (one for each match arm).
+        Both must have the `local.get N; global.set $gc_sp`
+        preamble; the local index N must be the same one captured
+        by the GC prologue.
+        """
+        source = """\
+private data Choice { Left, Right }
+
+private fn build(@Int, @Choice -> @Array<Int>)
+  requires(@Int.0 >= 0) ensures(true) decreases(@Int.0) effects(pure)
+{
+  if @Int.0 == 0 then { [0] }
+  else {
+    match @Choice.0 {
+      Left -> build(@Int.0 - 1, Left),
+      Right -> build(@Int.0 - 1, Right)
+    }
+  }
+}
+
+public fn f(-> @Int)
+  requires(true) ensures(true) effects(pure)
+{
+  array_length(build(3, Left))
+}
+"""
+        result = _compile_ok(source)
+        build_match = re.search(r"\(func \$build\b", result.wat)
+        assert build_match is not None, (
+            f"Could not locate `(func $build` in WAT"
+        )
+        build_start = build_match.start()
+        next_fn = re.search(r"\(func \$", result.wat[build_start + 1:])
+        build_end = (
+            build_start + 1 + next_fn.start()
+            if next_fn is not None
+            else len(result.wat)
+        )
+        build_body = result.wat[build_start:build_end]
+        # Capture the exact $gc_sp_save local from the prologue.
+        lines = build_body.splitlines()
+        prologue_get_idx = next(
+            (i for i, ln in enumerate(lines)
+             if ln.strip() == "global.get $gc_sp"),
+            None,
+        )
+        assert prologue_get_idx is not None, (
+            f"no `global.get $gc_sp` prologue.  Body:\n{build_body}"
+        )
+        prologue_set = lines[prologue_get_idx + 1].strip()
+        assert prologue_set.startswith("local.set ")
+        gc_sp_save_local = prologue_set[len("local.set "):]
+        expected_preamble_get = f"local.get {gc_sp_save_local}"
+        # Find every return_call site and require the same
+        # preamble at each.  This catches a regression where the
+        # patch only fires on the first site (e.g. accidental
+        # `break`) or where only a subset of nested positions get
+        # the restore.
+        return_call_indices = [
+            i for i, line in enumerate(lines)
+            if re.search(r"return_call \$build\b", line)
+        ]
+        assert len(return_call_indices) >= 2, (
+            f"Expected at least 2 return_call sites (one per "
+            f"match arm), got {len(return_call_indices)}.  "
+            f"Body:\n{build_body}"
+        )
+        for idx in return_call_indices:
+            assert idx >= 2, (
+                f"return_call at line {idx} has no room for "
+                f"preamble.  Body:\n{build_body}"
+            )
+            prev1 = lines[idx - 1].strip()
+            prev2 = lines[idx - 2].strip()
+            assert prev1 == "global.set $gc_sp", (
+                f"site {idx} missing `global.set $gc_sp`; got "
+                f"{prev1!r}.  Body:\n{build_body}"
+            )
+            assert prev2 == expected_preamble_get, (
+                f"site {idx} preamble mismatch: expected "
+                f"{expected_preamble_get!r}, got {prev2!r}.  "
+                f"Both return_call sites must reload the SAME "
+                f"local that the prologue saved.  Body:\n"
+                f"{build_body}"
+            )
+
     def test_allocating_function_with_postcondition_still_reverts(
         self,
     ) -> None:
@@ -15208,6 +15314,14 @@ def test_shadow_stack_overflow_traps(self) -> None:
            trap class (e.g. heap exhaustion at a different scale).
         2. Behavioural — `_run_trap` confirms the program actually
            traps at the chosen 2000-iteration depth.
+
+        Iteration count calibration: the 16K shadow stack holds
+        ~4096 pointer slots (4 bytes each).  Each `overflow` frame
+        pushes 2 `@Array<Bool>` params (8 bytes) + 1 array_append
+        tmp root (4 bytes) = 12 bytes/frame, so the guard trips at
+        ~1,365 frames.  2000 chosen with ~1.5× safety margin so
+        the trap fires reliably even if per-frame size shrinks by
+        a slot in a future optimisation.
         """
         src = """
 private fn overflow(@Array<Bool>, @Array<Bool>, @Int -> @Array<Bool>)
diff --git a/tests/test_stress.py b/tests/test_stress.py
@@ -244,12 +244,15 @@ def test_tco_with_allocation_1m_iterations(
     The high-volume companion to
     `test_deep_tail_recursion_with_allocating_arg`.  A single
     million-iteration run gives the GC-aware TCO patch a sharp
-    pin: any shadow-stack leak per iteration would either trap on
-    the overflow guard (16K shadow stack / ~12 bytes per leaked
-    frame ≈ 1300 iterations) or — under eager-GC — accumulate
+    pin: any shadow-stack leak per iteration would trap on the
+    overflow guard (16K shadow stack / 4 bytes per leaked pointer
+    root ≈ 4,096 iterations) or — under eager-GC — accumulate
     enough roots to slow mark/sweep to a crawl.  Completing 1M
     iterations in seconds in both modes proves the shadow stack
-    is bounded across the entire run.
+    is bounded across the entire run.  (The per-iteration push
+    is a single i32 pointer for the `Array<Int>` literal; the
+    two `@Int` params are not pointer-typed and don't push to
+    the shadow stack.)
 
     The pre-#549 fallback (revert `return_call` → plain `call`
     for allocating fns) cannot reach this depth — 1M plain calls
@@ -260,8 +263,7 @@ def test_tco_with_allocation_1m_iterations(
     Iteration count is 1M in both default-GC and eager-GC modes
     because the per-iteration cost is dominated by WASM execution,
     not GC traversal (the array literal is small and short-lived,
-    so each eager-GC collection is O(1) live roots).  Wall-clock
-    on a 2026 MacBook Pro: ~200ms in both modes.
+    so each eager-GC collection is O(1) live roots).
     """
     src = """
 private fn loop(@Int, @Int -> @Int)
diff --git a/vera/codegen/functions.py b/vera/codegen/functions.py