Summary
Build a stress-test harness (tests/test_stress.py or similar) that exercises Vera programs at scale to surface scale-dependent bugs before users do. Discovered the need while diagnosing #593 (Conway's Life corruption from gen 1+ at 12×30): the standard test suite has 3,747 small focused tests but no programs that exercise the runtime at the scale where #593 and #515-class GC bugs surface. We rely on user-reported bugs to find scale issues, which is the slowest way to find them.
Why now
The bug-killing campaign closed twelve runtime/codegen bugs from v0.0.119–v0.0.137. Several of those were scale-dependent:
- #515 (GC self-fault under sustained allocation) — surfaced by a 40×20 Game of Life running 200 generations
- #570 (iterative-builder shadow-stack overflow at ~4000 elements) — surfaced by an
array_map over a 5000-element array
- #487 / #348 (worklist + multi-page grow) — surfaced by allocation pressure
- #593 (still open) — Life corruption from gen 1+ at 12×30; smaller variants pass
Every one of these required a real-world program to surface. None of them would have been caught by the existing test suite, which favours small, focused, fast-running programs.
Proposed scope
A new test module tests/test_stress.py (or a pytest marker like @pytest.mark.stress) running synthetic programs at scale. Each test:
- Compiles + runs a Vera program designed to hit a specific scale axis
- Asserts on observable correctness (final result, no traps, no memory corruption)
- Has a budget (wall-clock, memory) so the suite runs in bounded time
Initial test programs
| Program |
Scale axis |
Targets |
array_map over a 10,000-element Array<Int> |
iteration count |
shadow-stack overflow class (#570) |
array_map over a 5,000-element Array<Array<Bool>> |
nested allocation |
per-iteration root accumulation |
| 1,000-deep tail recursion with allocating arg |
call-stack + GC |
TCO interaction (#549) |
| Conway's Life 20×20 × 100 generations (synthetic regression test, not the bug-report program) |
mixed: deep recursion + array_mapi-of-array_mapi + render |
#593, #595 territory |
100,000-iteration array_fold over Map mutations |
Map host-store + GC |
#573 / #575 / #576 reclamation pressure |
| 10,000 cross-module fn calls with String args |
String allocation pressure |
#573 wrap-table compaction |
Long-running State<T> handler (1,000 ops) |
effect handler scaffolding |
handler / resume interaction |
10,000 IO.prints with tee_stdout=True |
host-import call-rate |
host_print perf and capture buffer growth |
Each test is self-contained with its own assertion. Failures should be diagnosable from the test name + assertion message alone — no need to read the test code.
Configuration
- Default: stress tests don't run on every PR (too slow). Marked with
@pytest.mark.stress.
- Pre-commit: skipped (default).
- CI: runs nightly via a separate workflow file, OR on PRs that touch
vera/codegen/ / vera/wasm/ (as a paths: filter in CI).
- Local invocation:
pytest tests/test_stress.py -v or pytest -m stress.
- Budget: full suite under 5 minutes wall-clock on a normal CI runner.
Coverage hooks
Each test should exercise a SPECIFIC code path that's known to be scale-dependent. Tests should be DOCUMENTED with the issue / class they're guarding against:
@pytest.mark.stress
def test_array_map_over_10k_int_array() -> None:
"""Pre-#570 this would shadow-stack-overflow at ~4000 elements.
Test pins the iterative-builder fix and acts as an early-warning
for any future regression in shadow-stack hygiene under map.
"""
Acceptance
Why this matters
The recent stabilisation discussion identified two unknowns: #593's root cause hasn't been isolated, and we have no harness that would catch the next #593-class bug before it reaches users. This issue closes the second gap. It also gives us a concrete reproducer when narrowing #593.
Out of scope
- Performance benchmarking (different goal, different harness — VeraBench territory)
- Fuzz testing (different methodology; could be a follow-up using these programs as seed corpus)
- Browser-runtime stress tests (the current proposal is wasmtime-only; browser parity stress would be a separate harness)
Related
- #593 — Life corruption at 12×30 scale (would-have-been-caught)
- #595 — malloc abort in wasmtime trampoline (would-have-been-caught)
- #515 — GC self-fault under sustained allocation (was-caught-by-real-program-only)
- #570 — iterative-builder shadow-stack overflow (was-caught-by-real-program-only)
Summary
Build a stress-test harness (
tests/test_stress.pyor similar) that exercises Vera programs at scale to surface scale-dependent bugs before users do. Discovered the need while diagnosing #593 (Conway's Life corruption from gen 1+ at 12×30): the standard test suite has 3,747 small focused tests but no programs that exercise the runtime at the scale where #593 and #515-class GC bugs surface. We rely on user-reported bugs to find scale issues, which is the slowest way to find them.Why now
The bug-killing campaign closed twelve runtime/codegen bugs from v0.0.119–v0.0.137. Several of those were scale-dependent:
array_mapover a 5000-element arrayEvery one of these required a real-world program to surface. None of them would have been caught by the existing test suite, which favours small, focused, fast-running programs.
Proposed scope
A new test module
tests/test_stress.py(or apytestmarker like@pytest.mark.stress) running synthetic programs at scale. Each test:Initial test programs
array_mapover a 10,000-elementArray<Int>array_mapover a 5,000-elementArray<Array<Bool>>array_foldover Map mutationsState<T>handler (1,000 ops)IO.prints withtee_stdout=TrueEach test is self-contained with its own assertion. Failures should be diagnosable from the test name + assertion message alone — no need to read the test code.
Configuration
@pytest.mark.stress.vera/codegen//vera/wasm/(as apaths:filter in CI).pytest tests/test_stress.py -vorpytest -m stress.Coverage hooks
Each test should exercise a SPECIFIC code path that's known to be scale-dependent. Tests should be DOCUMENTED with the issue / class they're guarding against:
Acceptance
tests/test_stress.py(or equivalent) lands with the eight initial test programs above.mainpost-merge (any that fail on currentmainshould be filed as bugs first and fixed before adding the test, OR added with@pytest.mark.xfailand a tracking issue).vera/codegen//vera/wasm/PRs.TESTING.mdunder a new "Stress tests" subsection.Why this matters
The recent stabilisation discussion identified two unknowns: #593's root cause hasn't been isolated, and we have no harness that would catch the next #593-class bug before it reaches users. This issue closes the second gap. It also gives us a concrete reproducer when narrowing #593.
Out of scope
Related