Skip to content

Stress-test harness: surface scale-dependent codegen/runtime bugs before users do #596

@aallan

Description

@aallan

Summary

Build a stress-test harness (tests/test_stress.py or similar) that exercises Vera programs at scale to surface scale-dependent bugs before users do. Discovered the need while diagnosing #593 (Conway's Life corruption from gen 1+ at 12×30): the standard test suite has 3,747 small focused tests but no programs that exercise the runtime at the scale where #593 and #515-class GC bugs surface. We rely on user-reported bugs to find scale issues, which is the slowest way to find them.

Why now

The bug-killing campaign closed twelve runtime/codegen bugs from v0.0.119–v0.0.137. Several of those were scale-dependent:

  • #515 (GC self-fault under sustained allocation) — surfaced by a 40×20 Game of Life running 200 generations
  • #570 (iterative-builder shadow-stack overflow at ~4000 elements) — surfaced by an array_map over a 5000-element array
  • #487 / #348 (worklist + multi-page grow) — surfaced by allocation pressure
  • #593 (still open) — Life corruption from gen 1+ at 12×30; smaller variants pass

Every one of these required a real-world program to surface. None of them would have been caught by the existing test suite, which favours small, focused, fast-running programs.

Proposed scope

A new test module tests/test_stress.py (or a pytest marker like @pytest.mark.stress) running synthetic programs at scale. Each test:

  1. Compiles + runs a Vera program designed to hit a specific scale axis
  2. Asserts on observable correctness (final result, no traps, no memory corruption)
  3. Has a budget (wall-clock, memory) so the suite runs in bounded time

Initial test programs

Program Scale axis Targets
array_map over a 10,000-element Array<Int> iteration count shadow-stack overflow class (#570)
array_map over a 5,000-element Array<Array<Bool>> nested allocation per-iteration root accumulation
1,000-deep tail recursion with allocating arg call-stack + GC TCO interaction (#549)
Conway's Life 20×20 × 100 generations (synthetic regression test, not the bug-report program) mixed: deep recursion + array_mapi-of-array_mapi + render #593, #595 territory
100,000-iteration array_fold over Map mutations Map host-store + GC #573 / #575 / #576 reclamation pressure
10,000 cross-module fn calls with String args String allocation pressure #573 wrap-table compaction
Long-running State<T> handler (1,000 ops) effect handler scaffolding handler / resume interaction
10,000 IO.prints with tee_stdout=True host-import call-rate host_print perf and capture buffer growth

Each test is self-contained with its own assertion. Failures should be diagnosable from the test name + assertion message alone — no need to read the test code.

Configuration

  • Default: stress tests don't run on every PR (too slow). Marked with @pytest.mark.stress.
  • Pre-commit: skipped (default).
  • CI: runs nightly via a separate workflow file, OR on PRs that touch vera/codegen/ / vera/wasm/ (as a paths: filter in CI).
  • Local invocation: pytest tests/test_stress.py -v or pytest -m stress.
  • Budget: full suite under 5 minutes wall-clock on a normal CI runner.

Coverage hooks

Each test should exercise a SPECIFIC code path that's known to be scale-dependent. Tests should be DOCUMENTED with the issue / class they're guarding against:

@pytest.mark.stress
def test_array_map_over_10k_int_array() -> None:
    """Pre-#570 this would shadow-stack-overflow at ~4000 elements.
    Test pins the iterative-builder fix and acts as an early-warning
    for any future regression in shadow-stack hygiene under map.
    """

Acceptance

Why this matters

The recent stabilisation discussion identified two unknowns: #593's root cause hasn't been isolated, and we have no harness that would catch the next #593-class bug before it reaches users. This issue closes the second gap. It also gives us a concrete reproducer when narrowing #593.

Out of scope

  • Performance benchmarking (different goal, different harness — VeraBench territory)
  • Fuzz testing (different methodology; could be a follow-up using these programs as seed corpus)
  • Browser-runtime stress tests (the current proposal is wasmtime-only; browser parity stress would be a separate harness)

Related

  • #593 — Life corruption at 12×30 scale (would-have-been-caught)
  • #595 — malloc abort in wasmtime trampoline (would-have-been-caught)
  • #515 — GC self-fault under sustained allocation (was-caught-by-real-program-only)
  • #570 — iterative-builder shadow-stack overflow (was-caught-by-real-program-only)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions