Skip to content

Fix #596: stress-test harness for scale-dependent regression coverage#669

Merged
aallan merged 7 commits into
mainfrom
fix-596-stress-tests
May 13, 2026
Merged

Fix #596: stress-test harness for scale-dependent regression coverage#669
aallan merged 7 commits into
mainfrom
fix-596-stress-tests

Conversation

@aallan

@aallan aallan commented May 13, 2026

Copy link
Copy Markdown
Owner

Summary

Pre-#596 the project relied on user-reported real-world programs to surface scale-dependent codegen/runtime bugs (#570 / #515 / #593 / #595 / #549 / #487 / #348 / #573 all required a real program to find). This PR adds a dedicated stress-test harness so the next bug of this class is caught before users see it.

  • tests/test_stress.py with 8 scale-dependent regression tests covering: 10K array_map, 5K nested-array array_map, 1K-deep tail recursion with allocating arg, 20×20 nested array-fold-of-array-fold (Conway's Life at 12x30 still corrupts strings from gen 1 onwards (additional trigger beyond #588) #593/macOS malloc abort during wasmtime cleanup after Ctrl-C-in-host-import #595 territory), 100K array_fold, 10K String allocations through interpolation, 1K State<Int> get/put cycles in single handler, 10K IO.print with stdout-capture buffer.
  • stress pytest marker in pyproject.toml with default addopts = "-m 'not stress'" so stress tests skip from the per-PR pytest run.
  • .github/workflows/nightly-stress.yml with three triggers: nightly cron (primary safety net), path-filtered PRs (fail-fast on PRs touching vera/codegen/**/vera/wasm/**/test file), workflow_dispatch (manual).
  • "Stress tests" subsection in TESTING.md documenting the 8 programs, triggers, and assertion-shape convention.
  • scripts/check_doc_counts.py fix — added -o addopts="" to the pytest collection call so the doc-counts hook still sees stress tests in its per-file inventory (without this, the default -m 'not stress' filter deselects them and the script reports a missing row).
  • ROADMAP cleanup — removed Stress-test harness: surface scale-dependent codegen/runtime bugs before users do #596 from Order 1; tiers renumbered.

Why this shape

Each test asserts on a SPECIFIC observable (closed-form sum, exact line count, etc.) rather than just "completed without crashing", so a future regression that silently short-circuits or skips iterations would still fail. Iteration counts are tuned to the smallest scale where each bug class historically manifested with ~2-3x safety margin — the goal is reliable detection of the bug class, not benchmarking. Full suite completes in 0.25s in-process today; the 5-minute CI budget leaves plenty of headroom for the suite to grow.

Test plan

  • pytest tests/test_stress.py -m stress -v → 8 passed in 0.25s
  • pytest tests/ -q → 3,849 passed, 14 skipped, 8 deselected (default addopts correctly filter stress)
  • mypy vera/ → clean
  • python scripts/check_walker_coverage.py → 9 walkers, all 29 Expr subclasses covered
  • python scripts/check_doc_counts.py → consistent (3,871 tests, 32 files, 27 hooks)
  • Doc sweep across SKILL / CLAUDE / AGENTS / README / vera/README / CONTRIBUTING / spec / docs / examples → zero stale "stress tests missing" claims

Closes #596

Summary by CodeRabbit

  • New Features

    • Added an opt-in stress test suite with eight scale-focused scenarios (arrays, deep recursion, allocations, state ops, heavy I/O); skipped by default and runnable via a stress marker.
  • CI

    • Added a nightly/path-filtered/manual stress workflow that runs stress tests and files/comments a tracking issue on scheduled failures.
  • Documentation

    • Expanded testing docs, changelog, roadmap and README with guidance, runtimes and test details.
  • Chores

    • Version bumped to 0.0.152.

Review Change Stack

Pre-#596 the project relied on user-reported real-world programs
to surface scale-dependent codegen/runtime bugs — #570 (iterative-
builder shadow-stack overflow at ~4000 elements), #515 (GC self-
fault under sustained allocation), #593 (Conway's Life corruption
at 12×30+), and others.  The standard test suite has thousands of
small focused tests but none that exercise the runtime at scales
where these bugs first manifested.  This PR closes that gap with
a dedicated stress-test harness.

What ships:

1. **`tests/test_stress.py`** with 8 scale-dependent regression
   tests, each hitting a specific scale axis:

   - 10K `array_map` over `Array<Int>` — shadow-stack overflow
     (#570)
   - 5K nested-array `array_map` over `Array<Array<Bool>>` —
     per-iteration root accumulation (#570/#515)
   - 1K-deep tail recursion with allocating String arg — TCO/GC
     interaction (#549)
   - 20×20 nested array-fold-of-array-fold — #593/#595 regression
     coverage (the structural shape that hit those bugs, not the
     bug-report program itself)
   - 100K `array_fold` over Int array — allocation pressure
     (#487/#348)
   - 10K String allocations through interpolation — wrap-table
     compaction (#573)
   - 1K `State<Int>` get/put cycles in single handler — handler/
     resume plumbing under load
   - 10K `IO.print` calls with stdout-capture — host-import call
     rate, capture buffer growth

   Each test asserts on a SPECIFIC observable (closed-form sum,
   exact line count, etc.), not just "completed without
   crashing", so a future regression that silently short-circuits
   or skips iterations would still fail.

2. **`stress` pytest marker** registered in `pyproject.toml` with
   default `addopts = "-m 'not stress'"` — stress tests are
   skipped from the per-PR pytest run.  Local invocation:
   `pytest -m stress` or `pytest tests/test_stress.py -m stress
   -v`.

3. **`.github/workflows/nightly-stress.yml`** with three triggers:

   - Nightly cron at 06:00 UTC (primary safety net)
   - Path-filtered PRs touching `vera/codegen/**` /
     `vera/wasm/**` / `tests/test_stress.py` / the workflow
     itself (fail-fast on PRs likely to break stress invariants)
   - Manual `workflow_dispatch` (Actions tab)

4. **"Stress tests" subsection in `TESTING.md`** documenting the
   8 test programs with their scale axes, the default-skip
   behaviour, the three CI triggers, and the assertion-shape
   convention.

5. **`scripts/check_doc_counts.py` fix** — added `-o addopts=""`
   to the collection-only call so the script sees every test
   file including `test_stress.py` (without this override, the
   default `-m 'not stress'` filter would deselect stress tests
   and the per-file counter would report `test_stress.py` as a
   missing row).

6. **ROADMAP cleanup** — removed #596 from Order 1 in the
   Stabilisation tier (now closed by this PR).  Tier renumbered
   1-5 (was 1-6).  Agent-integration tier renumbered 6-8 (was
   7-9).

Release prep:

- Version bumped to v0.0.152 across `__init__.py`,
  `pyproject.toml`, `uv.lock`, `docs/index.html`, `README.md`.
- CHANGELOG.md entry under `[0.0.152]` with Added + Documentation
  sections.
- HISTORY.md row added (one-line template-compliant).
- TESTING.md overview totals updated (3,863 → 3,871 tests,
  31 → 32 files).
- ROADMAP.md test count updated to 3,871.

Validation:

- `pytest tests/test_stress.py -m stress -v` → 8 passed in 0.25s
- `pytest tests/ -q` → 3,849 passed, 14 skipped, 8 deselected
  (stress filtered out by default addopts)
- `mypy vera/` → clean
- `python scripts/check_walker_coverage.py` → 9 walkers, all 29
  Expr subclasses covered
- `python scripts/check_doc_counts.py` → consistent (3,871
  tests, 32 files, 27 hooks, 7 CI jobs)
- Doc sweep across SKILL / CLAUDE / AGENTS / README / vera/README
  / CONTRIBUTING / spec / docs / examples → zero stale "stress
  tests missing" claims

Closes #596

Co-Authored-By: Claude <noreply@anthropic.invalid>
@coderabbitai

coderabbitai Bot commented May 13, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a pytest stress marker and an eight-program stress regression suite with a shared _run runner, nightly/path-filtered GitHub Actions workflow, pytest collection tweak for docs, version bump to v0.0.152, and corresponding docs/release/roadmap updates.

Changes

Stress-test harness (v0.0.152)

Layer / File(s) Summary
Pytest configuration and version bump
pyproject.toml, vera/__init__.py
stress pytest marker defined with description; addopts configured to exclude stress tests from normal runs. Project version bumped to 0.0.152.
Test helper and stress regression suite
tests/test_stress.py
Added _run helper to compile/execute embedded Vera sources (optional scoped eager-GC) and eight stress tests: 10K array_map (ints), 5K nested array_map, 1000-deep tail recursion, nested-grid count_alive-style fold (20×20), 100K array_fold, 10K string allocations, 1K State<T> ops, and 10K IO.print calls; tests assert deterministic observables and several are parametrised over default vs eager GC.
CI workflow and test collection adjustment
.github/workflows/nightly-stress.yml, scripts/check_doc_counts.py
Added nightly-stress.yml (daily 06:00 UTC cron, PR path filters for vera/codegen/**, vera/wasm/**, vera/checker/**, tests/test_stress.py, and the workflow file, plus manual dispatch). Workflow installs deps and runs pytest -v -m stress for the stress suite; on scheduled failures it ensures a stress-regression label and comments or creates a tracking issue. check_doc_counts.py clears pytest addopts during collection so docs-counting includes stress tests.
Documentation, release notes, and roadmap updates
CHANGELOG.md, HISTORY.md, README.md, ROADMAP.md, TESTING.md
Added 0.0.152 release notes documenting the stress harness, updated history and README version/metrics, removed #596 from ROADMAP Stabilisation tier and renumbered items, and added a Stress Tests subsection and test table row in TESTING.md.

Sequence Diagram(s)

sequenceDiagram
  participant Test
  participant _run
  participant Parser
  participant Compiler
  participant Runtime
  Test->>_run: provide Vera source
  _run->>Parser: write & parse .vera file
  Parser->>Compiler: transform AST
  Compiler->>Compiler: assert no error diagnostics
  Compiler->>Runtime: execute compiled program
  Runtime->>_run: return exec_result.value
  _run->>Test: return value
  _run->>_run: cleanup temp file
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • aallan/vera#488: Adds related scale regression tests (array_map at large sizes) that exercise similar GC/allocation behaviour changed in that PR.

Suggested labels

compiler, tests, ci, docs

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fix #596: stress-test harness for scale-dependent regression coverage' directly and concisely summarizes the main change: implementing a stress-test harness to cover scale-dependent bugs, closing issue #596.
Linked Issues check ✅ Passed The PR fully implements all coding objectives from #596: eight scale-dependent regression tests in tests/test_stress.py, pytest stress marker with default skip via pyproject.toml, nightly-stress.yml workflow with path/schedule/manual triggers, TESTING.md documentation, and scripts/check_doc_counts.py override for collection.
Out of Scope Changes check ✅ Passed All changes are in-scope: test harness implementation, configuration, CI workflow, documentation updates, and version bumps are all required to deliver #596. ROADMAP.md updates appropriately reflect completion of #596 and reorder remaining tasks.
Docstring Coverage ✅ Passed Docstring coverage is 90.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-596-stress-tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented May 13, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.08%. Comparing base (f42691b) to head (8f44cb2).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #669   +/-   ##
=======================================
  Coverage   91.08%   91.08%           
=======================================
  Files          60       60           
  Lines       23283    23283           
  Branches      259      259           
=======================================
  Hits        21207    21207           
  Misses       2069     2069           
  Partials        7        7           
Flag Coverage Δ
javascript 57.36% <ø> (ø)
python 94.86% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

User feedback on #669: the initial Stress Tests subsection had
the summary table + behaviour blurb but not per-test detail.
Expanded the section to cover, for each of the 8 tests:

- The exact program shape (what it allocates, what loop it
  drives, what it folds/maps/iterates over)
- The expected return value and the assertion's *closed-form*
  derivation (e.g. why `test_10k_string_allocations` should
  return 38,890 — broken down by digit-count contribution)
- The specific bug class it guards against, with issue
  references
- Why the chosen iteration count is meaningful relative to
  that bug class's historical failure threshold

Also added an "Adding a stress test" subsection codifying the
conventions:

1. Target a specific scale axis + name the bug class
2. Pick the smallest scale with ~2-3x safety margin (don't
   maximise)
3. Assert on a SPECIFIC observable with closed-form expected
   value (avoid "no exception raised")
4. Use the `_run` helper for tempfile lifecycle
5. `pytestmark = pytest.mark.stress` at module level for
   correct gating

Plus reformatted the local-invocation block as a fenced bash
example so the three useful invocations (full stress, file-
scoped, single-test) are copy-pasteable.

Validation
- `python scripts/check_doc_counts.py` -> consistent (no
  count-affecting numbers changed)

Refs #596 #669

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_stress.py`:
- Around line 165-175: The test test_conways_life_20x20_100_generations
currently builds an all-empty grid and only calls count_alive once, so it does
not run 100 generations as claimed; update the test to either (A) implement the
full generation loop by repeatedly applying the function that advances the board
state (locate the code that constructs the 20x20 grid and the step/advance
function used elsewhere in tests) for 100 iterations and then assert the
expected final live-cell count using count_alive, or (B) if you intended a
single-step regression, rename the test to reflect the actual behavior and
adjust the docstring accordingly; ensure you use the fixed initial pattern (not
an all-empty grid) if asserting a known stabilized population.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a935bc06-d420-4455-b203-a2f2067eb5bf

📥 Commits

Reviewing files that changed from the base of the PR and between f42691b and 0e74bd9.

⛔ Files ignored due to path filters (5)
  • docs/index.html is excluded by !docs/**
  • docs/index.md is excluded by !docs/**
  • docs/llms-full.txt is excluded by !docs/**
  • docs/llms.txt is excluded by !docs/**
  • uv.lock is excluded by !**/*.lock, !uv.lock
📒 Files selected for processing (10)
  • .github/workflows/nightly-stress.yml
  • CHANGELOG.md
  • HISTORY.md
  • README.md
  • ROADMAP.md
  • TESTING.md
  • pyproject.toml
  • scripts/check_doc_counts.py
  • tests/test_stress.py
  • vera/__init__.py

Comment thread tests/test_stress.py Outdated
The #596 issue body explicitly called for a subset of stress
tests to run under `VERA_EAGER_GC=1` to catch GC-rooting
regressions on the very first iteration rather than only at
scale.  I missed that requirement in the initial PR and the
user surfaced it.

Implementation:

- `EAGER_GC_PARAMS = pytest.mark.parametrize(...)` decorator
  with `eager_gc: [False, True]` and IDs `["default_gc",
  "eager_gc"]` exposed at module level for the subset of
  tests targeting GC-rooting bug classes.
- `_run` helper extended with `eager_gc: bool` +
  `monkeypatch: pytest.MonkeyPatch | None` keyword args.  When
  `eager_gc=True`, calls `monkeypatch.setenv("VERA_EAGER_GC",
  "1")` before the `compile()` call.  `VERA_EAGER_GC` is read
  at compile time by `vera/codegen/assembly.py::_emit_alloc`,
  which then emits a `call $gc_collect` as the first
  instruction of the runtime's `$alloc` function — forcing a
  full GC pass on every allocation.
- 6 of the 8 logical tests carry `@EAGER_GC_PARAMS` and the
  paired signature `(eager_gc: bool, monkeypatch: pytest
  .MonkeyPatch)`:
    - `test_array_map_over_10k_int_array` (#570 target)
    - `test_array_map_over_5k_nested_bool_array` (#570/#515
      target)
    - `test_deep_tail_recursion_with_allocating_arg` (#549
      TCO/GC target)
    - `test_conways_life_20x20_100_generations` (#593/#595
      target)
    - `test_10k_string_allocations` (#573 wrap-table target)
    - `test_state_handler_1k_ops` (captured-frame target)
- 2 tests stay un-parametrised — `test_array_fold_100k_
  iterations` (allocation-pressure target, NOT GC-rooting;
  100K forced-GC would inflate suite time) and
  `test_10k_io_print_calls` (host-import target; `host_print`
  bridge doesn't allocate Vera-heap data).
- Total test count: 8 logical × eager lane parametrisation on
  6 of them = 14 test instances.  Wall-clock 0.66s in-process
  for the full suite.

Documentation updates:

- TESTING.md: new "Eager-GC lane" subsection explaining the
  mechanism, the subset rationale, and the parametrize-with-
  monkeypatch implementation.  Per-test descriptions in the
  inventory marked with `**[eager-GC]**` for the parametrised
  6 and explicit "Not in eager-GC lane" notes for the 2
  excluded.  "Adding a stress test" checklist gained a new
  step 6 codifying the opt-in convention.
- CHANGELOG.md v0.0.152: extended with a new bullet describing
  the eager-GC lane mechanism and the subset rationale.
- Test counts auto-updated: 3,871 -> 3,877 (+6 parametrised
  instances), test_stress.py 8 -> 14 tests, 375 -> 469 lines.

Validation:

- `pytest tests/test_stress.py -m stress -v` -> 14 passed in
  0.64s (was 8 passed in 0.25s)
- `python scripts/check_doc_counts.py` -> consistent
- `python scripts/check_walker_coverage.py` -> clean
- `python scripts/check_version_sync.py` -> consistent

Refs #596 #669

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/test_stress.py (1)

232-291: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

The Conway stress case still does not execute 100 generations.

Line 233 names a 20x20_100_generations workload, but Lines 276-285 only construct an all-false grid and run a single count_alive pass. This misses the intended regression surface (state evolution across generations) and weakens coverage for the #593/#595 class.

Please either implement the 100-step evolution loop with a concrete final-state assertion, or rename/re-scope this test to match what it actually validates.

As per coding guidelines tests/**/*.py: “These are pytest test files. Review for correctness and coverage gaps.”

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@TESTING.md`:
- Line 389: Update the measured runtime phrase in TESTING.md where it currently
reads "0.25s in-process on a developer laptop today" to reflect the new
parametrised baseline (≈0.64s) or replace it with a versioned/date-stamped
figure; ensure the sentence in the Budget paragraph (the text mentioning
iteration counts and the 5-minute target) now states the updated runtime and/or
includes a version/date tag so the measured value stays accurate going forward.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8b53febf-685d-40df-9b93-73b812f769dc

📥 Commits

Reviewing files that changed from the base of the PR and between 11d3ee9 and 9c3cc9e.

📒 Files selected for processing (5)
  • CHANGELOG.md
  • README.md
  • ROADMAP.md
  • TESTING.md
  • tests/test_stress.py

Comment thread TESTING.md Outdated
aallan added 2 commits May 13, 2026 14:18
The original #596 spec covered CI integration but didn't cover
reporting — a cron-failure would be visible only to whoever
opened the Actions tab and noticed.  This is fragile for an
early-warning system.

Adds an `Open or update tracking issue on cron failure` step to
`.github/workflows/nightly-stress.yml` that runs only on
`schedule`-triggered failures (skipped on `pull_request` and
`workflow_dispatch` — those have their own reporting surfaces):

1. Ensures the `stress-regression` label exists (idempotent,
   creates on first failure with description and red colour).
2. Searches for an open issue carrying the label.
3. If one exists: appends a comment with the new commit SHA +
   run URL, so the original tracking issue accumulates the
   failure history.
4. If none: files a fresh issue titled "Nightly stress
   regression on main (tracking)" with the SHA, run URL, and
   pointer to `tests/test_stress.py` for the failing test's
   docstring (each test names the bug class it guards
   against — narrows the regression family for the bisector).

Open tracking issues stay open across days until manually
closed.  Auto-closing on success would be wrong while a
maintainer is mid-debug.

Job-level permission `issues: write` added (scoped to the job,
not workflow-wide, so the test-runner step keeps the tighter
`contents: read` default).  Implementation uses
`actions/github-script@v7` which only sees GitHub-controlled
values (`github.sha`, `github.server_url`, `github.repository`,
`github.run_id`) — no user-supplied input in the script body.

Documentation:

- TESTING.md "Stress Tests" section's CI-integration bullets
  updated to describe the auto-filing behaviour + a new
  "Failure reporting (cron only)" subsection describing the
  dedup logic, the issue lifecycle, and the implementation
  details.
- CHANGELOG.md v0.0.152: added a bullet describing the
  failure-reporting wire-up.

Validation:

- YAML parses (`python -c 'import yaml; yaml.safe_load(...)'`)
- 5 steps in the job (was 4); failure-reporting step gated on
  `failure() && github.event_name == 'schedule'`
- Stress suite still 14/14 pass

Refs #596 #669
Three changes batched into one commit (all #669 review-driven):

**1. Widen the workflow's path-filter to include `vera/checker/**`.**

The user's earlier review-question on PR #669 noted a blind
spot in the path-filter: PRs touching `vera/checker/**` could
produce subtly different AST shapes that downstream codegen
mishandles, but the filter only fired on `vera/codegen/**` and
`vera/wasm/**`.  Such a regression would surface only at the
next nightly cron (up to ~24h delay).  Added
`vera/checker/**` to the `pull_request` paths and updated the
workflow's leading comment to explain the rationale.

**2. CR-A: rename `test_conways_life_20x20_100_generations`.**

CodeRabbit flagged that the test name and docstring promised
"100 generations" but the test only built an empty grid and
called `count_alive` once.  Verified — the program loads no
Life-step function and does no iteration.

Renamed to
`test_conways_life_grid_alloc_and_count_alive_20x20` (option B
from the CR suggestion: rename to reflect actual behaviour
rather than implement a full 100-gen Life loop).  Docstring
rewritten to explicitly call out:

- What the test does: build a 20×20 grid + single `count_alive`
  pass
- What it does NOT do: run 100 generations (the original name
  implied that; this rename corrects it)
- Why the structural-shape coverage is still valuable for
  #593 / #595 — those bugs hit the same nested-array-alloc
  + nested-fold + captured-binding-refs code paths

TESTING.md per-test description updated to match the new name
and to add a "**structural-shape** test, not a Life
simulation" clarifier.  Section header comment also updated.

**3. CR-B: refresh the stale "0.25s in-process" figure in
TESTING.md.**

The 0.25s number was measured before the eager-GC lane
parametrisation doubled the relevant subset.  Current wall-
clock is 0.66s (14 instances).  Updated the Budget paragraph
to:

- Use the new measured value (0.66s) with a date stamp
  (2026-05-13) so the figure stays auditable
- Add an explicit "if this drifts more than ~2x, treat as a
  signal" guard rail so future drift gets flagged

Doc counts auto-updated:
- TESTING.md `test_stress.py` lines: 469 → 479 (docstring
  expansion for the renamed test)

Validation:

- `pytest tests/test_stress.py -m stress -v` → 14 passed in
  0.66s (renamed test still passes under both eager/default
  modes)
- `python scripts/check_doc_counts.py` → consistent
- YAML parses (`yaml.safe_load` on the workflow)

Refs #596 #669

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/nightly-stress.yml:
- Around line 131-140: The current dedup logic uses
github.rest.issues.listForRepo and takes existing[0] which may be a PR; update
the selection after the listForRepo call to filter out any items with a non-null
pull_request field and to match the intended issue title (use LABEL and the
tracking issue title constant) before choosing issueNumber; i.e., call
github.rest.issues.listForRepo as before, then set a filtered array =
existing.filter(i => !i.pull_request && i.title === EXPECTED_TITLE) and use
filtered[0].number (or create the issue if none) so the workflow comments only
on the real tracking issue, not a PR.

In `@TESTING.md`:
- Around line 354-355: Summary: The stress-test entry
test_conways_life_20x20_100_generations incorrectly references issue `#595` for
the malloc-during-wasmtime-cleanup bug; update the issue ID/description to match
the actual bug referenced elsewhere. Fix: open TESTING.md, locate the
"test_conways_life_20x20_100_generations" entry and replace the incorrect "#595"
(and/or its description) with the correct issue identifier and short description
that corresponds to the malloc/wasmtime cleanup bug (or vice versa if `#595`
actually refers to IO.sleep/KeyboardInterrupt per line 88); ensure the updated
reference matches the mapping used elsewhere in the document so traceability is
consistent. Verify the change by searching for both "#595" and the
malloc/wasmtime cleanup text to confirm one-to-one mapping.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 09ef58ed-a3f7-4e1c-a670-21180bb2f813

📥 Commits

Reviewing files that changed from the base of the PR and between 9c3cc9e and 2f7faa7.

📒 Files selected for processing (3)
  • .github/workflows/nightly-stress.yml
  • CHANGELOG.md
  • TESTING.md

Comment thread .github/workflows/nightly-stress.yml Outdated
Comment thread TESTING.md Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
.github/workflows/nightly-stress.yml (1)

136-146: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Dedup logic can target the wrong thread.

listForRepo may return pull requests as well as issues, so existing[0] is not guaranteed to be the tracking issue. Filter for non-PR items and match the expected title before choosing issue_number.

Suggested patch
             const { data: existing } = await github.rest.issues.listForRepo({
               owner: context.repo.owner,
               repo: context.repo.repo,
               labels: LABEL,
               state: "open",
-              per_page: 5,
+              per_page: 20,
             });
+            const trackingIssues = existing.filter(
+              (item) => !item.pull_request && item.title === TITLE
+            );
 
-            if (existing.length > 0) {
-              const issueNumber = existing[0].number;
+            if (trackingIssues.length > 0) {
+              const issueNumber = trackingIssues[0].number;
               await github.rest.issues.createComment({
                 owner: context.repo.owner,
                 repo: context.repo.repo,
                 issue_number: issueNumber,
                 body: commentBody,
               });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/nightly-stress.yml around lines 136 - 146, The current
dedup logic assumes existing[0] is the tracking issue but
github.rest.issues.listForRepo can return PRs; update the selection after
calling github.rest.issues.listForRepo to filter out pull requests (items
without a pull_request property) and to match the tracking issue title (e.g.,
compare item.title to the expected ISSUE_TITLE or other known heading) before
picking the issue number; then use that matched item's .number when calling
github.rest.issues.createComment so you always comment on the correct tracking
issue (keep LABEL and the listForRepo call but replace existing[0] with the
first filtered+matched item).
tests/test_stress.py (1)

236-239: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Issue reference for #595 looks inconsistent.

This block links #595 to the wasmtime-cleanup/malloc bug class, but TESTING.md maps #595 to the IO.sleep/KeyboardInterrupt path. Please align the issue ID/description pair so traceability is unambiguous across docs and tests.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_stress.py` around lines 236 - 239, The test docstring in
tests/test_stress.py references issue `#595` as the wasmtime-cleanup/malloc abort
bug but TESTING.md maps `#595` to IO.sleep/KeyboardInterrupt; make these
consistent by either (A) updating the docstring in tests/test_stress.py to
reference the correct issue number that tracks the wasmtime malloc/cleanup bug
(keep `#593` as-is), or (B) if the test actually covers the
IO.sleep/KeyboardInterrupt path, change the docstring description to match
TESTING.md’s mapping; ensure you update the issue ID/description pair in the
triple-quoted header so TESTING.md and tests/test_stress.py reference the same
issue for `#595`.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@TESTING.md`:
- Around line 386-387: Update the "Path-filtered PRs" list in TESTING.md to
include the new path pattern for the checker: add `vera/checker/**` alongside
the existing `vera/codegen/**`, `vera/wasm/**`, `tests/test_stress.py` and the
workflow file so the documentation matches the workflow trigger; edit the
numbered item describing "Path-filtered PRs touching `vera/codegen/**`,
`vera/wasm/**`, `tests/test_stress.py`, or the workflow file" to also mention
`vera/checker/**` (and keep wording consistent with the existing list).

---

Duplicate comments:
In @.github/workflows/nightly-stress.yml:
- Around line 136-146: The current dedup logic assumes existing[0] is the
tracking issue but github.rest.issues.listForRepo can return PRs; update the
selection after calling github.rest.issues.listForRepo to filter out pull
requests (items without a pull_request property) and to match the tracking issue
title (e.g., compare item.title to the expected ISSUE_TITLE or other known
heading) before picking the issue number; then use that matched item's .number
when calling github.rest.issues.createComment so you always comment on the
correct tracking issue (keep LABEL and the listForRepo call but replace
existing[0] with the first filtered+matched item).

In `@tests/test_stress.py`:
- Around line 236-239: The test docstring in tests/test_stress.py references
issue `#595` as the wasmtime-cleanup/malloc abort bug but TESTING.md maps `#595` to
IO.sleep/KeyboardInterrupt; make these consistent by either (A) updating the
docstring in tests/test_stress.py to reference the correct issue number that
tracks the wasmtime malloc/cleanup bug (keep `#593` as-is), or (B) if the test
actually covers the IO.sleep/KeyboardInterrupt path, change the docstring
description to match TESTING.md’s mapping; ensure you update the issue
ID/description pair in the triple-quoted header so TESTING.md and
tests/test_stress.py reference the same issue for `#595`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 26695255-c0da-4012-88cc-09a5e48d50b7

📥 Commits

Reviewing files that changed from the base of the PR and between 2f7faa7 and 072fde9.

📒 Files selected for processing (3)
  • .github/workflows/nightly-stress.yml
  • TESTING.md
  • tests/test_stress.py

Comment thread TESTING.md Outdated
aallan added 2 commits May 13, 2026 14:42
**CR-C** (`.github/workflows/nightly-stress.yml:131-140`) —
`listForRepo` dedup query returned both issues and PRs.

`github.rest.issues.listForRepo` with a label filter returns
any item carrying the label, including PRs (the GitHub API
models PRs as issues with a `pull_request` field).  If a PR
were ever labelled `stress-regression`, the workflow's dedup
branch would post the failure comment on the PR instead of
filing a fresh tracking issue.

Real risk is low — the `stress-regression` label is only
applied by this workflow today — but the fix is cheap and
defensive: filter the result list to exclude items with a
`pull_request` field set.  Renamed the `existing` const to
`candidates` (the raw API result) and built `existing` as the
filtered subset, then the rest of the dedup logic continues
unchanged.

Skipped CR's additional suggestion of filtering by exact title
match (`i.title === EXPECTED_TITLE`).  Rationale: the label
is already the dedup primary key, and a strict title filter
would break dedup if a maintainer renames the tracking issue
(e.g. to "Nightly stress regression — fixed by PR #N").
Defaulting to "label is enough" is more forgiving for the
expected human workflow.

**CR-D** (`TESTING.md:354-355` and parallel sites) —
misattributed #595 citation on the Conway's Life stress test.

CR's investigation surfaced an inconsistency: line 88 maps
#595 to IO.sleep/KeyboardInterrupt handling, but my line 354
described the Conway's Life stress test as covering "#593 and
#595 (malloc abort during wasmtime cleanup)".  Verified
against the CHANGELOG record (entries at lines 248 and 283
trace #595 to a cleanup-path bug fixed by an upstream wasmtime-
py change, not a Conway's Life data-integrity bug): #595's
regression coverage lives in `TestHostSleepKeyboardInterrupt`
in `tests/test_runtime_traps.py`, NOT in this stress test.

The Conway's Life stress test:
- Uses `effects(pure)` (no Ctrl-C, no IO.sleep)
- Completes normally (no wasmtime-cleanup signal escape)
- Therefore exercises NONE of #595's code paths

The misattribution came from the original #596 issue body
which paired the two issues as "scale-dependent runtime bugs";
when I lifted the pairing into the test description I didn't
re-verify whether the structural shapes actually matched.

Fix: drop the #595 reference from this test's description in
4 places — `tests/test_stress.py` docstring (2 sites: header
paragraph + body comment), `TESTING.md:354` (per-test
description), `TESTING.md:69` (test-file-row's `Pins ...`
list), `CHANGELOG.md` (the v0.0.152 Added paragraph).  Added
an explanatory "Note on #595" block in the test docstring +
parallel parenthetical in TESTING.md so a future reader
understands why the citation was dropped (and doesn't re-add
it).  CHANGELOG historical entries (lines 248, 283) that
reference #595 in its OWN bug-history context are unchanged
— those are accurate descriptions of #595 in its proper
domain.

Doc counts: test_stress.py 479 → 487 lines (docstring
expansion for the #595-misattribution note).

Validation:
- YAML parses (`yaml.safe_load` on the workflow)
- `pytest tests/test_stress.py -m stress -q` → 14 passed in
  0.65s
- `python scripts/check_doc_counts.py` → consistent
- No residual #595 references in PR-touched files outside the
  intentional historical "Note on #595" explanatory blocks

Refs #596 #669
CR-E (`TESTING.md:386`) — the "Path-filtered PRs" list under
**CI integration** lagged the workflow's actual `paths:`
block.  The workflow includes `vera/checker/**` (added in
`072fde9`) but the docs still listed only `vera/codegen/**` /
`vera/wasm/**` / `tests/test_stress.py` / "the workflow file
itself".  Updated the docs to match the workflow + carried
over the same rationale comment ("included because the AST
shape it produces flows into codegen").

Skipping two duplicate CR comments that re-flagged
already-fixed issues:

- `tests/test_stress.py:236-239` (#595 misattribution) —
  already fixed in `8a2e237`; the docstring now contains an
  explicit "Note on #595" block explaining the misattribution
  history.
- `.github/workflows/nightly-stress.yml:136-146` (`listForRepo`
  PR-filter) — already fixed in `8a2e237`; the script filters
  `candidates` to `existing = candidates.filter(item =>
  !item.pull_request)` before the dedup branch.

Validation
- `python scripts/check_doc_counts.py` -> consistent

Refs #596 #669
@aallan aallan merged commit 8c79f22 into main May 13, 2026
24 checks passed
@aallan aallan deleted the fix-596-stress-tests branch May 13, 2026 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stress-test harness: surface scale-dependent codegen/runtime bugs before users do

1 participant