Stress-test harness: surface scale-dependent codegen/runtime bugs before users do

## Summary

Build a stress-test harness (`tests/test_stress.py` or similar) that exercises Vera programs at scale to surface scale-dependent bugs before users do. Discovered the need while diagnosing [#593](https://github.com/aallan/vera/issues/593) (Conway's Life corruption from gen 1+ at 12×30): the standard test suite has 3,747 small focused tests but no programs that exercise the runtime at the scale where [#593](https://github.com/aallan/vera/issues/593) and [#515](https://github.com/aallan/vera/issues/515)-class GC bugs surface. We rely on user-reported bugs to find scale issues, which is the slowest way to find them.

## Why now

The bug-killing campaign closed twelve runtime/codegen bugs from v0.0.119–v0.0.137. Several of those were scale-dependent:

- [#515](https://github.com/aallan/vera/issues/515) (GC self-fault under sustained allocation) — surfaced by a 40×20 Game of Life running 200 generations
- [#570](https://github.com/aallan/vera/issues/570) (iterative-builder shadow-stack overflow at ~4000 elements) — surfaced by an `array_map` over a 5000-element array
- [#487](https://github.com/aallan/vera/issues/487) / [#348](https://github.com/aallan/vera/issues/348) (worklist + multi-page grow) — surfaced by allocation pressure
- [#593](https://github.com/aallan/vera/issues/593) (still open) — Life corruption from gen 1+ at 12×30; smaller variants pass

Every one of these required a real-world program to surface. None of them would have been caught by the existing test suite, which favours small, focused, fast-running programs.

## Proposed scope

A new test module `tests/test_stress.py` (or a `pytest` marker like `@pytest.mark.stress`) running synthetic programs at scale. Each test:

1. **Compiles + runs a Vera program designed to hit a specific scale axis**
2. **Asserts on observable correctness** (final result, no traps, no memory corruption)
3. **Has a budget** (wall-clock, memory) so the suite runs in bounded time

### Initial test programs

| Program | Scale axis | Targets |
|---|---|---|
| `array_map` over a 10,000-element `Array<Int>` | iteration count | shadow-stack overflow class (#570) |
| `array_map` over a 5,000-element `Array<Array<Bool>>` | nested allocation | per-iteration root accumulation |
| 1,000-deep tail recursion with allocating arg | call-stack + GC | TCO interaction (#549) |
| Conway's Life 20×20 × 100 generations (synthetic regression test, not the bug-report program) | mixed: deep recursion + array_mapi-of-array_mapi + render | #593, #595 territory |
| 100,000-iteration `array_fold` over Map mutations | Map host-store + GC | #573 / #575 / #576 reclamation pressure |
| 10,000 cross-module fn calls with String args | String allocation pressure | #573 wrap-table compaction |
| Long-running `State<T>` handler (1,000 ops) | effect handler scaffolding | handler / resume interaction |
| 10,000 `IO.print`s with `tee_stdout=True` | host-import call-rate | host_print perf and capture buffer growth |

Each test is self-contained with its own assertion. Failures should be diagnosable from the test name + assertion message alone — no need to read the test code.

### Configuration

- **Default**: stress tests **don't** run on every PR (too slow). Marked with `@pytest.mark.stress`.
- **Pre-commit**: skipped (default).
- **CI**: runs nightly via a separate workflow file, OR on PRs that touch `vera/codegen/` / `vera/wasm/` (as a `paths:` filter in CI).
- **Local invocation**: `pytest tests/test_stress.py -v` or `pytest -m stress`.
- **Budget**: full suite under 5 minutes wall-clock on a normal CI runner.

### Coverage hooks

Each test should exercise a SPECIFIC code path that's known to be scale-dependent. Tests should be DOCUMENTED with the issue / class they're guarding against:

```python
@pytest.mark.stress
def test_array_map_over_10k_int_array() -> None:
    """Pre-#570 this would shadow-stack-overflow at ~4000 elements.
    Test pins the iterative-builder fix and acts as an early-warning
    for any future regression in shadow-stack hygiene under map.
    """
```

## Acceptance

- [ ] `tests/test_stress.py` (or equivalent) lands with the eight initial test programs above.
- [ ] All tests pass on `main` post-merge (any that fail on current `main` should be filed as bugs first and fixed before adding the test, OR added with `@pytest.mark.xfail` and a tracking issue).
- [ ] CI integration: nightly workflow + path-filter on `vera/codegen/` / `vera/wasm/` PRs.
- [ ] Documented in `TESTING.md` under a new "Stress tests" subsection.
- [ ] One of the tests reliably reproduces #593 (or #593 is closed before this lands).

## Why this matters

The recent stabilisation discussion identified two unknowns: #593's root cause hasn't been isolated, and we have no harness that would catch the next #593-class bug before it reaches users. This issue closes the second gap. It also gives us a concrete reproducer when narrowing #593.

## Out of scope

- Performance benchmarking (different goal, different harness — VeraBench territory)
- Fuzz testing (different methodology; could be a follow-up using these programs as seed corpus)
- Browser-runtime stress tests (the current proposal is wasmtime-only; browser parity stress would be a separate harness)

## Related

- [#593](https://github.com/aallan/vera/issues/593) — Life corruption at 12×30 scale (would-have-been-caught)
- [#595](https://github.com/aallan/vera/issues/595) — malloc abort in wasmtime trampoline (would-have-been-caught)
- [#515](https://github.com/aallan/vera/issues/515) — GC self-fault under sustained allocation (was-caught-by-real-program-only)
- [#570](https://github.com/aallan/vera/issues/570) — iterative-builder shadow-stack overflow (was-caught-by-real-program-only)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stress-test harness: surface scale-dependent codegen/runtime bugs before users do #596

Summary

Why now

Proposed scope

Initial test programs

Configuration

Coverage hooks

Acceptance

Why this matters

Out of scope

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Program	Scale axis	Targets
`array_map` over a 10,000-element `Array<Int>`	iteration count	shadow-stack overflow class (#570)
`array_map` over a 5,000-element `Array<Array<Bool>>`	nested allocation	per-iteration root accumulation
1,000-deep tail recursion with allocating arg	call-stack + GC	TCO interaction (#549)
Conway's Life 20×20 × 100 generations (synthetic regression test, not the bug-report program)	mixed: deep recursion + array_mapi-of-array_mapi + render	#593, #595 territory
100,000-iteration `array_fold` over Map mutations	Map host-store + GC	#573 / #575 / #576 reclamation pressure
10,000 cross-module fn calls with String args	String allocation pressure	#573 wrap-table compaction
Long-running `State<T>` handler (1,000 ops)	effect handler scaffolding	handler / resume interaction
10,000 `IO.print`s with `tee_stdout=True`	host-import call-rate	host_print perf and capture buffer growth

Stress-test harness: surface scale-dependent codegen/runtime bugs before users do #596

Description

Summary

Why now

Proposed scope

Initial test programs

Configuration

Coverage hooks

Acceptance

Why this matters

Out of scope

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions