Skip to content

Conway's Life at 12x30 still corrupts strings from gen 1 onwards (additional trigger beyond #588) #593

@aallan

Description

@aallan

Summary

After #588 fixed the captured-Array<T>-indexing-in-closure bug, the BUG_REPORT.md repro_min.vera and repro_nested.vera minimum reproducers run cleanly. However, the full Conway's Game of Life implementation (12×30 grid, 200 generations, recursive run_loop with IO.print and IO.sleep between frames) still produces silent string corruption from generation 1 onwards.

The bug report itself acknowledged this gap:

However, scaling this up to a real Conway's Game of Life implementation (12×30 grid, eight neighbour reads through a wrap-aware cell_at, array_mapi-of-array_mapi for the next-generation step, recursive run_loop with IO.print and IO.sleep between frames) still produces the trap or — more often — silent corruption where render_cell is bypassed and raw Bool bytes (/) leak into IO.print, which then chokes with UnicodeDecodeError. The minimum repros do not reproduce that variant. Either there is an additional trigger I haven't isolated or there are multiple closely-related codegen bugs.

#588's fix removes the UnicodeDecodeError (the v0.0.136 errors="replace" defensive layer surfaces invalid bytes as U+FFFD characters in output rather than crashing Python — see #589), but the underlying memory-corruption trigger remains.

Current symptom (post-#588 fix)

Running vera run life_full_program.vera with the BUG_REPORT.md attachment:

  • Generation 0 renders cleanly (initial grid is correct).
  • Generation 1+ shows widespread U+FFFD characters interleaved with valid block characters.
  • ANSI escape sequences (\u{1B}[2J\u{1B}[H) appear partially stripped — the escape character (0x1B) is replaced/lost while the rest ([2J[H) leaks into stdout as visible text.

Bisection so far

I built a series of progressively-larger Life subsets and confirmed each works post-#588:

  1. step alone (one array_mapi-of-array_mapi next-grid step over a 3×3 grid, no rendering, no recursion) — works.
  2. render_grid (Array<Array> → String, no captures of grids) — works.
  3. step + render_grid + one print step at 3×3 — works.
  4. step + render_grid + one print step at 12×30 with full Life infrastructure (cell_at, count_neighbors, next_cell, make_initial) — works.
  5. Mini Life with recursive run_loop at 3×3 over 2 generations — works.
  6. Full Life at 12×30 with recursive run_loop over 200 generations — fails from generation 1+.

The bisection narrows the trigger to the combination of: 12×30 grid scale + recursive run_loop with allocating next_grid argument across the recursive call boundary. Smaller scales (~5×5) or non-recursive sweeps work cleanly.

Plausible hypothesis

The corruption pattern (full-block characters intact + U+FFFD between them + missing escape bytes) suggests per-byte string corruption inside the rendered output buffer rather than wholesale heap reuse. Plausible mechanisms:

  1. GC reclamation of in-flight strings during string_join / string_concat while the destination buffer is still being filled. The rendered grid is built up via nested array_map over Strings; intermediate Array<String> values held only on the WASM operand stack across a string_join call could be swept if the operand stack isn't fully shadow-stack-rooted.

  2. Shadow-stack overflow at scale — 12×30 = 360 inner-closure invocations × 8 neighbour calls × 5 contract violations or so per generation. If shadow-stack-pushes for in-flight String results aren't paired with pops, the stack could overflow and wrap, corrupting later rooting.

  3. Closure-env corruption across allocating recursive callrun_loop(next_grid(@Array<Array<Bool>>.0), …) allocates a fresh grid as the first arg; if the closures inside next_grid capture pointers that get invalidated by the very allocation that's producing the new grid, captures could see freed memory.

Hypothesis 1 (GC during string_join) is the most likely candidate based on the corruption shape (per-byte rather than per-pointer).

Reproducer

Attached life_full_program.vera from the original BUG_REPORT.md. Bisection scripts reside in /tmp from the #588 investigation but are not preserved. To reproduce:

vera run life_full_program.vera   # interrupt after a few seconds
# Generation 0 renders cleanly; from Generation 1+, output is corrupt.

Acceptance

  • Full Life program runs to completion across 200 generations with a clean rendered grid at every step.
  • A new conformance test that reliably reproduces the residual trigger at the smallest possible scale.
  • Any GC-rooting / shadow-stack / closure-capture invariant that needed strengthening is documented in the codegen comments.

Related

  • #588 — captured-Array<T>-indexing-in-closure (closed in v0.0.137).
  • #589 — host-runtime UTF-8 hygiene (closed in v0.0.136). Provides the errors="replace" defensive layer that prevents the residual corruption from escaping as a Python traceback.
  • #570 — iterative-builder shadow-stack overflow (closed in v0.0.133). Different shape but adjacent to hypothesis 2.
  • BUG_REPORT.md (attached to Indexing a captured Array<T> inside a closure body produces invalid WASM #588) explicitly acknowledged this as an unisolated additional trigger.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions