Fix #596: stress-test harness for scale-dependent regression coverage by aallan · Pull Request #669 · aallan/vera

aallan · 2026-05-13T12:56:41Z

Summary

Pre-#596 the project relied on user-reported real-world programs to surface scale-dependent codegen/runtime bugs (#570 / #515 / #593 / #595 / #549 / #487 / #348 / #573 all required a real program to find). This PR adds a dedicated stress-test harness so the next bug of this class is caught before users see it.

tests/test_stress.py with 8 scale-dependent regression tests covering: 10K array_map, 5K nested-array array_map, 1K-deep tail recursion with allocating arg, 20×20 nested array-fold-of-array-fold (Conway's Life at 12x30 still corrupts strings from gen 1 onwards (additional trigger beyond #588) #593/macOS malloc abort during wasmtime cleanup after Ctrl-C-in-host-import #595 territory), 100K array_fold, 10K String allocations through interpolation, 1K State<Int> get/put cycles in single handler, 10K IO.print with stdout-capture buffer.
stress pytest marker in pyproject.toml with default addopts = "-m 'not stress'" so stress tests skip from the per-PR pytest run.
.github/workflows/nightly-stress.yml with three triggers: nightly cron (primary safety net), path-filtered PRs (fail-fast on PRs touching vera/codegen/**/vera/wasm/**/test file), workflow_dispatch (manual).
"Stress tests" subsection in TESTING.md documenting the 8 programs, triggers, and assertion-shape convention.
scripts/check_doc_counts.py fix — added -o addopts="" to the pytest collection call so the doc-counts hook still sees stress tests in its per-file inventory (without this, the default -m 'not stress' filter deselects them and the script reports a missing row).
ROADMAP cleanup — removed Stress-test harness: surface scale-dependent codegen/runtime bugs before users do #596 from Order 1; tiers renumbered.

Why this shape

Each test asserts on a SPECIFIC observable (closed-form sum, exact line count, etc.) rather than just "completed without crashing", so a future regression that silently short-circuits or skips iterations would still fail. Iteration counts are tuned to the smallest scale where each bug class historically manifested with ~2-3x safety margin — the goal is reliable detection of the bug class, not benchmarking. Full suite completes in 0.25s in-process today; the 5-minute CI budget leaves plenty of headroom for the suite to grow.

Test plan

pytest tests/test_stress.py -m stress -v → 8 passed in 0.25s
pytest tests/ -q → 3,849 passed, 14 skipped, 8 deselected (default addopts correctly filter stress)
mypy vera/ → clean
python scripts/check_walker_coverage.py → 9 walkers, all 29 Expr subclasses covered
python scripts/check_doc_counts.py → consistent (3,871 tests, 32 files, 27 hooks)
Doc sweep across SKILL / CLAUDE / AGENTS / README / vera/README / CONTRIBUTING / spec / docs / examples → zero stale "stress tests missing" claims

Closes #596

Summary by CodeRabbit

New Features
- Added an opt-in stress test suite with eight scale-focused scenarios (arrays, deep recursion, allocations, state ops, heavy I/O); skipped by default and runnable via a stress marker.
CI
- Added a nightly/path-filtered/manual stress workflow that runs stress tests and files/comments a tracking issue on scheduled failures.
Documentation
- Expanded testing docs, changelog, roadmap and README with guidance, runtimes and test details.
Chores
- Version bumped to 0.0.152.

Pre-#596 the project relied on user-reported real-world programs to surface scale-dependent codegen/runtime bugs — #570 (iterative- builder shadow-stack overflow at ~4000 elements), #515 (GC self- fault under sustained allocation), #593 (Conway's Life corruption at 12×30+), and others. The standard test suite has thousands of small focused tests but none that exercise the runtime at scales where these bugs first manifested. This PR closes that gap with a dedicated stress-test harness. What ships: 1. **`tests/test_stress.py`** with 8 scale-dependent regression tests, each hitting a specific scale axis: - 10K `array_map` over `Array<Int>` — shadow-stack overflow (#570) - 5K nested-array `array_map` over `Array<Array<Bool>>` — per-iteration root accumulation (#570/#515) - 1K-deep tail recursion with allocating String arg — TCO/GC interaction (#549) - 20×20 nested array-fold-of-array-fold — #593/#595 regression coverage (the structural shape that hit those bugs, not the bug-report program itself) - 100K `array_fold` over Int array — allocation pressure (#487/#348) - 10K String allocations through interpolation — wrap-table compaction (#573) - 1K `State<Int>` get/put cycles in single handler — handler/ resume plumbing under load - 10K `IO.print` calls with stdout-capture — host-import call rate, capture buffer growth Each test asserts on a SPECIFIC observable (closed-form sum, exact line count, etc.), not just "completed without crashing", so a future regression that silently short-circuits or skips iterations would still fail. 2. **`stress` pytest marker** registered in `pyproject.toml` with default `addopts = "-m 'not stress'"` — stress tests are skipped from the per-PR pytest run. Local invocation: `pytest -m stress` or `pytest tests/test_stress.py -m stress -v`. 3. **`.github/workflows/nightly-stress.yml`** with three triggers: - Nightly cron at 06:00 UTC (primary safety net) - Path-filtered PRs touching `vera/codegen/**` / `vera/wasm/**` / `tests/test_stress.py` / the workflow itself (fail-fast on PRs likely to break stress invariants) - Manual `workflow_dispatch` (Actions tab) 4. **"Stress tests" subsection in `TESTING.md`** documenting the 8 test programs with their scale axes, the default-skip behaviour, the three CI triggers, and the assertion-shape convention. 5. **`scripts/check_doc_counts.py` fix** — added `-o addopts=""` to the collection-only call so the script sees every test file including `test_stress.py` (without this override, the default `-m 'not stress'` filter would deselect stress tests and the per-file counter would report `test_stress.py` as a missing row). 6. **ROADMAP cleanup** — removed #596 from Order 1 in the Stabilisation tier (now closed by this PR). Tier renumbered 1-5 (was 1-6). Agent-integration tier renumbered 6-8 (was 7-9). Release prep: - Version bumped to v0.0.152 across `__init__.py`, `pyproject.toml`, `uv.lock`, `docs/index.html`, `README.md`. - CHANGELOG.md entry under `[0.0.152]` with Added + Documentation sections. - HISTORY.md row added (one-line template-compliant). - TESTING.md overview totals updated (3,863 → 3,871 tests, 31 → 32 files). - ROADMAP.md test count updated to 3,871. Validation: - `pytest tests/test_stress.py -m stress -v` → 8 passed in 0.25s - `pytest tests/ -q` → 3,849 passed, 14 skipped, 8 deselected (stress filtered out by default addopts) - `mypy vera/` → clean - `python scripts/check_walker_coverage.py` → 9 walkers, all 29 Expr subclasses covered - `python scripts/check_doc_counts.py` → consistent (3,871 tests, 32 files, 27 hooks, 7 CI jobs) - Doc sweep across SKILL / CLAUDE / AGENTS / README / vera/README / CONTRIBUTING / spec / docs / examples → zero stale "stress tests missing" claims Closes #596 Co-Authored-By: Claude <noreply@anthropic.invalid>

coderabbitai · 2026-05-13T12:56:54Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a pytest stress marker and an eight-program stress regression suite with a shared _run runner, nightly/path-filtered GitHub Actions workflow, pytest collection tweak for docs, version bump to v0.0.152, and corresponding docs/release/roadmap updates.

Changes

Stress-test harness (v0.0.152)

Layer / File(s)	Summary
Pytest configuration and version bump `pyproject.toml`, `vera/__init__.py`	`stress` pytest marker defined with description; `addopts` configured to exclude `stress` tests from normal runs. Project version bumped to `0.0.152`.
Test helper and stress regression suite `tests/test_stress.py`	Added `_run` helper to compile/execute embedded Vera sources (optional scoped eager-GC) and eight stress tests: 10K `array_map` (ints), 5K nested `array_map`, 1000-deep tail recursion, nested-grid `count_alive`-style fold (20×20), 100K `array_fold`, 10K string allocations, 1K `State<T>` ops, and 10K `IO.print` calls; tests assert deterministic observables and several are parametrised over default vs eager GC.
CI workflow and test collection adjustment `.github/workflows/nightly-stress.yml`, `scripts/check_doc_counts.py`	Added `nightly-stress.yml` (daily 06:00 UTC cron, PR path filters for `vera/codegen/`, `vera/wasm/`, `vera/checker/**`, `tests/test_stress.py`, and the workflow file, plus manual dispatch). Workflow installs deps and runs `pytest -v -m stress` for the stress suite; on scheduled failures it ensures a `stress-regression` label and comments or creates a tracking issue. `check_doc_counts.py` clears pytest `addopts` during collection so docs-counting includes stress tests.
Documentation, release notes, and roadmap updates `CHANGELOG.md`, `HISTORY.md`, `README.md`, `ROADMAP.md`, `TESTING.md`	Added `0.0.152` release notes documenting the stress harness, updated history and README version/metrics, removed `#596` from ROADMAP Stabilisation tier and renumbered items, and added a Stress Tests subsection and test table row in `TESTING.md`.

Sequence Diagram(s)

sequenceDiagram
  participant Test
  participant _run
  participant Parser
  participant Compiler
  participant Runtime
  Test->>_run: provide Vera source
  _run->>Parser: write & parse .vera file
  Parser->>Compiler: transform AST
  Compiler->>Compiler: assert no error diagnostics
  Compiler->>Runtime: execute compiled program
  Runtime->>_run: return exec_result.value
  _run->>Test: return value
  _run->>_run: cleanup temp file

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

aallan/vera#488: Adds related scale regression tests (array_map at large sizes) that exercise similar GC/allocation behaviour changed in that PR.

Suggested labels

compiler, tests, ci, docs

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Fix `#596`: stress-test harness for scale-dependent regression coverage' directly and concisely summarizes the main change: implementing a stress-test harness to cover scale-dependent bugs, closing issue `#596`.
Linked Issues check	✅ Passed	The PR fully implements all coding objectives from `#596`: eight scale-dependent regression tests in tests/test_stress.py, pytest stress marker with default skip via pyproject.toml, nightly-stress.yml workflow with path/schedule/manual triggers, TESTING.md documentation, and scripts/check_doc_counts.py override for collection.
Out of Scope Changes check	✅ Passed	All changes are in-scope: test harness implementation, configuration, CI workflow, documentation updates, and version bumps are all required to deliver `#596`. ROADMAP.md updates appropriately reflect completion of `#596` and reorder remaining tasks.
Docstring Coverage	✅ Passed	Docstring coverage is 90.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix-596-stress-tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-13T12:57:59Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.08%. Comparing base (f42691b) to head (8f44cb2).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #669   +/-   ##
=======================================
  Coverage   91.08%   91.08%           
=======================================
  Files          60       60           
  Lines       23283    23283           
  Branches      259      259           
=======================================
  Hits        21207    21207           
  Misses       2069     2069           
  Partials        7        7

Flag	Coverage Δ
javascript	`57.36% <ø> (ø)`
python	`94.86% <100.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

User feedback on #669: the initial Stress Tests subsection had the summary table + behaviour blurb but not per-test detail. Expanded the section to cover, for each of the 8 tests: - The exact program shape (what it allocates, what loop it drives, what it folds/maps/iterates over) - The expected return value and the assertion's *closed-form* derivation (e.g. why `test_10k_string_allocations` should return 38,890 — broken down by digit-count contribution) - The specific bug class it guards against, with issue references - Why the chosen iteration count is meaningful relative to that bug class's historical failure threshold Also added an "Adding a stress test" subsection codifying the conventions: 1. Target a specific scale axis + name the bug class 2. Pick the smallest scale with ~2-3x safety margin (don't maximise) 3. Assert on a SPECIFIC observable with closed-form expected value (avoid "no exception raised") 4. Use the `_run` helper for tempfile lifecycle 5. `pytestmark = pytest.mark.stress` at module level for correct gating Plus reformatted the local-invocation block as a fenced bash example so the three useful invocations (full stress, file- scoped, single-test) are copy-pasteable. Validation - `python scripts/check_doc_counts.py` -> consistent (no count-affecting numbers changed) Refs #596 #669

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_stress.py`:
- Around line 165-175: The test test_conways_life_20x20_100_generations
currently builds an all-empty grid and only calls count_alive once, so it does
not run 100 generations as claimed; update the test to either (A) implement the
full generation loop by repeatedly applying the function that advances the board
state (locate the code that constructs the 20x20 grid and the step/advance
function used elsewhere in tests) for 100 iterations and then assert the
expected final live-cell count using count_alive, or (B) if you intended a
single-step regression, rename the test to reflect the actual behavior and
adjust the docstring accordingly; ensure you use the fixed initial pattern (not
an all-empty grid) if asserting a known stabilized population.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a935bc06-d420-4455-b203-a2f2067eb5bf

📥 Commits

Reviewing files that changed from the base of the PR and between f42691b and 0e74bd9.

⛔ Files ignored due to path filters (5)

docs/index.html is excluded by !docs/**
docs/index.md is excluded by !docs/**
docs/llms-full.txt is excluded by !docs/**
docs/llms.txt is excluded by !docs/**
uv.lock is excluded by !**/*.lock, !uv.lock

📒 Files selected for processing (10)

.github/workflows/nightly-stress.yml
CHANGELOG.md
HISTORY.md
README.md
ROADMAP.md
TESTING.md
pyproject.toml
scripts/check_doc_counts.py
tests/test_stress.py
vera/__init__.py

The #596 issue body explicitly called for a subset of stress tests to run under `VERA_EAGER_GC=1` to catch GC-rooting regressions on the very first iteration rather than only at scale. I missed that requirement in the initial PR and the user surfaced it. Implementation: - `EAGER_GC_PARAMS = pytest.mark.parametrize(...)` decorator with `eager_gc: [False, True]` and IDs `["default_gc", "eager_gc"]` exposed at module level for the subset of tests targeting GC-rooting bug classes. - `_run` helper extended with `eager_gc: bool` + `monkeypatch: pytest.MonkeyPatch | None` keyword args. When `eager_gc=True`, calls `monkeypatch.setenv("VERA_EAGER_GC", "1")` before the `compile()` call. `VERA_EAGER_GC` is read at compile time by `vera/codegen/assembly.py::_emit_alloc`, which then emits a `call $gc_collect` as the first instruction of the runtime's `$alloc` function — forcing a full GC pass on every allocation. - 6 of the 8 logical tests carry `@EAGER_GC_PARAMS` and the paired signature `(eager_gc: bool, monkeypatch: pytest .MonkeyPatch)`: - `test_array_map_over_10k_int_array` (#570 target) - `test_array_map_over_5k_nested_bool_array` (#570/#515 target) - `test_deep_tail_recursion_with_allocating_arg` (#549 TCO/GC target) - `test_conways_life_20x20_100_generations` (#593/#595 target) - `test_10k_string_allocations` (#573 wrap-table target) - `test_state_handler_1k_ops` (captured-frame target) - 2 tests stay un-parametrised — `test_array_fold_100k_ iterations` (allocation-pressure target, NOT GC-rooting; 100K forced-GC would inflate suite time) and `test_10k_io_print_calls` (host-import target; `host_print` bridge doesn't allocate Vera-heap data). - Total test count: 8 logical × eager lane parametrisation on 6 of them = 14 test instances. Wall-clock 0.66s in-process for the full suite. Documentation updates: - TESTING.md: new "Eager-GC lane" subsection explaining the mechanism, the subset rationale, and the parametrize-with- monkeypatch implementation. Per-test descriptions in the inventory marked with `**[eager-GC]**` for the parametrised 6 and explicit "Not in eager-GC lane" notes for the 2 excluded. "Adding a stress test" checklist gained a new step 6 codifying the opt-in convention. - CHANGELOG.md v0.0.152: extended with a new bullet describing the eager-GC lane mechanism and the subset rationale. - Test counts auto-updated: 3,871 -> 3,877 (+6 parametrised instances), test_stress.py 8 -> 14 tests, 375 -> 469 lines. Validation: - `pytest tests/test_stress.py -m stress -v` -> 14 passed in 0.64s (was 8 passed in 0.25s) - `python scripts/check_doc_counts.py` -> consistent - `python scripts/check_walker_coverage.py` -> clean - `python scripts/check_version_sync.py` -> consistent Refs #596 #669

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

tests/test_stress.py (1)

232-291: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

The Conway stress case still does not execute 100 generations.

Line 233 names a 20x20_100_generations workload, but Lines 276-285 only construct an all-false grid and run a single count_alive pass. This misses the intended regression surface (state evolution across generations) and weakens coverage for the #593/#595 class.

Please either implement the 100-step evolution loop with a concrete final-state assertion, or rename/re-scope this test to match what it actually validates.

As per coding guidelines tests/**/*.py: “These are pytest test files. Review for correctness and coverage gaps.”

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@TESTING.md`:
- Line 389: Update the measured runtime phrase in TESTING.md where it currently
reads "0.25s in-process on a developer laptop today" to reflect the new
parametrised baseline (≈0.64s) or replace it with a versioned/date-stamped
figure; ensure the sentence in the Budget paragraph (the text mentioning
iteration counts and the 5-minute target) now states the updated runtime and/or
includes a version/date tag so the measured value stays accurate going forward.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8b53febf-685d-40df-9b93-73b812f769dc

📥 Commits

Reviewing files that changed from the base of the PR and between 11d3ee9 and 9c3cc9e.

📒 Files selected for processing (5)

CHANGELOG.md
README.md
ROADMAP.md
TESTING.md
tests/test_stress.py

The original #596 spec covered CI integration but didn't cover reporting — a cron-failure would be visible only to whoever opened the Actions tab and noticed. This is fragile for an early-warning system. Adds an `Open or update tracking issue on cron failure` step to `.github/workflows/nightly-stress.yml` that runs only on `schedule`-triggered failures (skipped on `pull_request` and `workflow_dispatch` — those have their own reporting surfaces): 1. Ensures the `stress-regression` label exists (idempotent, creates on first failure with description and red colour). 2. Searches for an open issue carrying the label. 3. If one exists: appends a comment with the new commit SHA + run URL, so the original tracking issue accumulates the failure history. 4. If none: files a fresh issue titled "Nightly stress regression on main (tracking)" with the SHA, run URL, and pointer to `tests/test_stress.py` for the failing test's docstring (each test names the bug class it guards against — narrows the regression family for the bisector). Open tracking issues stay open across days until manually closed. Auto-closing on success would be wrong while a maintainer is mid-debug. Job-level permission `issues: write` added (scoped to the job, not workflow-wide, so the test-runner step keeps the tighter `contents: read` default). Implementation uses `actions/github-script@v7` which only sees GitHub-controlled values (`github.sha`, `github.server_url`, `github.repository`, `github.run_id`) — no user-supplied input in the script body. Documentation: - TESTING.md "Stress Tests" section's CI-integration bullets updated to describe the auto-filing behaviour + a new "Failure reporting (cron only)" subsection describing the dedup logic, the issue lifecycle, and the implementation details. - CHANGELOG.md v0.0.152: added a bullet describing the failure-reporting wire-up. Validation: - YAML parses (`python -c 'import yaml; yaml.safe_load(...)'`) - 5 steps in the job (was 4); failure-reporting step gated on `failure() && github.event_name == 'schedule'` - Stress suite still 14/14 pass Refs #596 #669

Three changes batched into one commit (all #669 review-driven): **1. Widen the workflow's path-filter to include `vera/checker/**`.** The user's earlier review-question on PR #669 noted a blind spot in the path-filter: PRs touching `vera/checker/**` could produce subtly different AST shapes that downstream codegen mishandles, but the filter only fired on `vera/codegen/**` and `vera/wasm/**`. Such a regression would surface only at the next nightly cron (up to ~24h delay). Added `vera/checker/**` to the `pull_request` paths and updated the workflow's leading comment to explain the rationale. **2. CR-A: rename `test_conways_life_20x20_100_generations`.** CodeRabbit flagged that the test name and docstring promised "100 generations" but the test only built an empty grid and called `count_alive` once. Verified — the program loads no Life-step function and does no iteration. Renamed to `test_conways_life_grid_alloc_and_count_alive_20x20` (option B from the CR suggestion: rename to reflect actual behaviour rather than implement a full 100-gen Life loop). Docstring rewritten to explicitly call out: - What the test does: build a 20×20 grid + single `count_alive` pass - What it does NOT do: run 100 generations (the original name implied that; this rename corrects it) - Why the structural-shape coverage is still valuable for #593 / #595 — those bugs hit the same nested-array-alloc + nested-fold + captured-binding-refs code paths TESTING.md per-test description updated to match the new name and to add a "**structural-shape** test, not a Life simulation" clarifier. Section header comment also updated. **3. CR-B: refresh the stale "0.25s in-process" figure in TESTING.md.** The 0.25s number was measured before the eager-GC lane parametrisation doubled the relevant subset. Current wall- clock is 0.66s (14 instances). Updated the Budget paragraph to: - Use the new measured value (0.66s) with a date stamp (2026-05-13) so the figure stays auditable - Add an explicit "if this drifts more than ~2x, treat as a signal" guard rail so future drift gets flagged Doc counts auto-updated: - TESTING.md `test_stress.py` lines: 469 → 479 (docstring expansion for the renamed test) Validation: - `pytest tests/test_stress.py -m stress -v` → 14 passed in 0.66s (renamed test still passes under both eager/default modes) - `python scripts/check_doc_counts.py` → consistent - YAML parses (`yaml.safe_load` on the workflow) Refs #596 #669

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/nightly-stress.yml:
- Around line 131-140: The current dedup logic uses
github.rest.issues.listForRepo and takes existing[0] which may be a PR; update
the selection after the listForRepo call to filter out any items with a non-null
pull_request field and to match the intended issue title (use LABEL and the
tracking issue title constant) before choosing issueNumber; i.e., call
github.rest.issues.listForRepo as before, then set a filtered array =
existing.filter(i => !i.pull_request && i.title === EXPECTED_TITLE) and use
filtered[0].number (or create the issue if none) so the workflow comments only
on the real tracking issue, not a PR.

In `@TESTING.md`:
- Around line 354-355: Summary: The stress-test entry
test_conways_life_20x20_100_generations incorrectly references issue `#595` for
the malloc-during-wasmtime-cleanup bug; update the issue ID/description to match
the actual bug referenced elsewhere. Fix: open TESTING.md, locate the
"test_conways_life_20x20_100_generations" entry and replace the incorrect "#595"
(and/or its description) with the correct issue identifier and short description
that corresponds to the malloc/wasmtime cleanup bug (or vice versa if `#595`
actually refers to IO.sleep/KeyboardInterrupt per line 88); ensure the updated
reference matches the mapping used elsewhere in the document so traceability is
consistent. Verify the change by searching for both "#595" and the
malloc/wasmtime cleanup text to confirm one-to-one mapping.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 09ef58ed-a3f7-4e1c-a670-21180bb2f813

📥 Commits

Reviewing files that changed from the base of the PR and between 9c3cc9e and 2f7faa7.

📒 Files selected for processing (3)

.github/workflows/nightly-stress.yml
CHANGELOG.md
TESTING.md

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

.github/workflows/nightly-stress.yml (1)

136-146: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Dedup logic can target the wrong thread.

listForRepo may return pull requests as well as issues, so existing[0] is not guaranteed to be the tracking issue. Filter for non-PR items and match the expected title before choosing issue_number.

Suggested patch

             const { data: existing } = await github.rest.issues.listForRepo({
               owner: context.repo.owner,
               repo: context.repo.repo,
               labels: LABEL,
               state: "open",
-              per_page: 5,
+              per_page: 20,
             });
+            const trackingIssues = existing.filter(
+              (item) => !item.pull_request && item.title === TITLE
+            );
 
-            if (existing.length > 0) {
-              const issueNumber = existing[0].number;
+            if (trackingIssues.length > 0) {
+              const issueNumber = trackingIssues[0].number;
               await github.rest.issues.createComment({
                 owner: context.repo.owner,
                 repo: context.repo.repo,
                 issue_number: issueNumber,
                 body: commentBody,
               });

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/nightly-stress.yml around lines 136 - 146, The current
dedup logic assumes existing[0] is the tracking issue but
github.rest.issues.listForRepo can return PRs; update the selection after
calling github.rest.issues.listForRepo to filter out pull requests (items
without a pull_request property) and to match the tracking issue title (e.g.,
compare item.title to the expected ISSUE_TITLE or other known heading) before
picking the issue number; then use that matched item's .number when calling
github.rest.issues.createComment so you always comment on the correct tracking
issue (keep LABEL and the listForRepo call but replace existing[0] with the
first filtered+matched item).

tests/test_stress.py (1)

236-239: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Issue reference for #595 looks inconsistent.

This block links #595 to the wasmtime-cleanup/malloc bug class, but TESTING.md maps #595 to the IO.sleep/KeyboardInterrupt path. Please align the issue ID/description pair so traceability is unambiguous across docs and tests.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_stress.py` around lines 236 - 239, The test docstring in
tests/test_stress.py references issue `#595` as the wasmtime-cleanup/malloc abort
bug but TESTING.md maps `#595` to IO.sleep/KeyboardInterrupt; make these
consistent by either (A) updating the docstring in tests/test_stress.py to
reference the correct issue number that tracks the wasmtime malloc/cleanup bug
(keep `#593` as-is), or (B) if the test actually covers the
IO.sleep/KeyboardInterrupt path, change the docstring description to match
TESTING.md’s mapping; ensure you update the issue ID/description pair in the
triple-quoted header so TESTING.md and tests/test_stress.py reference the same
issue for `#595`.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@TESTING.md`:
- Around line 386-387: Update the "Path-filtered PRs" list in TESTING.md to
include the new path pattern for the checker: add `vera/checker/**` alongside
the existing `vera/codegen/**`, `vera/wasm/**`, `tests/test_stress.py` and the
workflow file so the documentation matches the workflow trigger; edit the
numbered item describing "Path-filtered PRs touching `vera/codegen/**`,
`vera/wasm/**`, `tests/test_stress.py`, or the workflow file" to also mention
`vera/checker/**` (and keep wording consistent with the existing list).

---

Duplicate comments:
In @.github/workflows/nightly-stress.yml:
- Around line 136-146: The current dedup logic assumes existing[0] is the
tracking issue but github.rest.issues.listForRepo can return PRs; update the
selection after calling github.rest.issues.listForRepo to filter out pull
requests (items without a pull_request property) and to match the tracking issue
title (e.g., compare item.title to the expected ISSUE_TITLE or other known
heading) before picking the issue number; then use that matched item's .number
when calling github.rest.issues.createComment so you always comment on the
correct tracking issue (keep LABEL and the listForRepo call but replace
existing[0] with the first filtered+matched item).

In `@tests/test_stress.py`:
- Around line 236-239: The test docstring in tests/test_stress.py references
issue `#595` as the wasmtime-cleanup/malloc abort bug but TESTING.md maps `#595` to
IO.sleep/KeyboardInterrupt; make these consistent by either (A) updating the
docstring in tests/test_stress.py to reference the correct issue number that
tracks the wasmtime malloc/cleanup bug (keep `#593` as-is), or (B) if the test
actually covers the IO.sleep/KeyboardInterrupt path, change the docstring
description to match TESTING.md’s mapping; ensure you update the issue
ID/description pair in the triple-quoted header so TESTING.md and
tests/test_stress.py reference the same issue for `#595`.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 26695255-c0da-4012-88cc-09a5e48d50b7

📥 Commits

Reviewing files that changed from the base of the PR and between 2f7faa7 and 072fde9.

📒 Files selected for processing (3)

.github/workflows/nightly-stress.yml
TESTING.md
tests/test_stress.py

**CR-C** (`.github/workflows/nightly-stress.yml:131-140`) — `listForRepo` dedup query returned both issues and PRs. `github.rest.issues.listForRepo` with a label filter returns any item carrying the label, including PRs (the GitHub API models PRs as issues with a `pull_request` field). If a PR were ever labelled `stress-regression`, the workflow's dedup branch would post the failure comment on the PR instead of filing a fresh tracking issue. Real risk is low — the `stress-regression` label is only applied by this workflow today — but the fix is cheap and defensive: filter the result list to exclude items with a `pull_request` field set. Renamed the `existing` const to `candidates` (the raw API result) and built `existing` as the filtered subset, then the rest of the dedup logic continues unchanged. Skipped CR's additional suggestion of filtering by exact title match (`i.title === EXPECTED_TITLE`). Rationale: the label is already the dedup primary key, and a strict title filter would break dedup if a maintainer renames the tracking issue (e.g. to "Nightly stress regression — fixed by PR #N"). Defaulting to "label is enough" is more forgiving for the expected human workflow. **CR-D** (`TESTING.md:354-355` and parallel sites) — misattributed #595 citation on the Conway's Life stress test. CR's investigation surfaced an inconsistency: line 88 maps #595 to IO.sleep/KeyboardInterrupt handling, but my line 354 described the Conway's Life stress test as covering "#593 and #595 (malloc abort during wasmtime cleanup)". Verified against the CHANGELOG record (entries at lines 248 and 283 trace #595 to a cleanup-path bug fixed by an upstream wasmtime- py change, not a Conway's Life data-integrity bug): #595's regression coverage lives in `TestHostSleepKeyboardInterrupt` in `tests/test_runtime_traps.py`, NOT in this stress test. The Conway's Life stress test: - Uses `effects(pure)` (no Ctrl-C, no IO.sleep) - Completes normally (no wasmtime-cleanup signal escape) - Therefore exercises NONE of #595's code paths The misattribution came from the original #596 issue body which paired the two issues as "scale-dependent runtime bugs"; when I lifted the pairing into the test description I didn't re-verify whether the structural shapes actually matched. Fix: drop the #595 reference from this test's description in 4 places — `tests/test_stress.py` docstring (2 sites: header paragraph + body comment), `TESTING.md:354` (per-test description), `TESTING.md:69` (test-file-row's `Pins ...` list), `CHANGELOG.md` (the v0.0.152 Added paragraph). Added an explanatory "Note on #595" block in the test docstring + parallel parenthetical in TESTING.md so a future reader understands why the citation was dropped (and doesn't re-add it). CHANGELOG historical entries (lines 248, 283) that reference #595 in its OWN bug-history context are unchanged — those are accurate descriptions of #595 in its proper domain. Doc counts: test_stress.py 479 → 487 lines (docstring expansion for the #595-misattribution note). Validation: - YAML parses (`yaml.safe_load` on the workflow) - `pytest tests/test_stress.py -m stress -q` → 14 passed in 0.65s - `python scripts/check_doc_counts.py` → consistent - No residual #595 references in PR-touched files outside the intentional historical "Note on #595" explanatory blocks Refs #596 #669

CR-E (`TESTING.md:386`) — the "Path-filtered PRs" list under **CI integration** lagged the workflow's actual `paths:` block. The workflow includes `vera/checker/**` (added in `072fde9`) but the docs still listed only `vera/codegen/**` / `vera/wasm/**` / `tests/test_stress.py` / "the workflow file itself". Updated the docs to match the workflow + carried over the same rationale comment ("included because the AST shape it produces flows into codegen"). Skipping two duplicate CR comments that re-flagged already-fixed issues: - `tests/test_stress.py:236-239` (#595 misattribution) — already fixed in `8a2e237`; the docstring now contains an explicit "Note on #595" block explaining the misattribution history. - `.github/workflows/nightly-stress.yml:136-146` (`listForRepo` PR-filter) — already fixed in `8a2e237`; the script filters `candidates` to `existing = candidates.filter(item => !item.pull_request)` before the dedup branch. Validation - `python scripts/check_doc_counts.py` -> consistent Refs #596 #669

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread tests/test_stress.py Outdated

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread TESTING.md Outdated

aallan added 2 commits May 13, 2026 14:18

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread .github/workflows/nightly-stress.yml Outdated

Comment thread TESTING.md Outdated

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread TESTING.md Outdated

aallan added 2 commits May 13, 2026 14:42

aallan merged commit 8c79f22 into main May 13, 2026
24 checks passed

aallan deleted the fix-596-stress-tests branch May 13, 2026 13:55

coderabbitai Bot mentioned this pull request May 13, 2026

Fix #549: GC-aware tail-call optimization for allocating functions #671

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #596: stress-test harness for scale-dependent regression coverage#669

Fix #596: stress-test harness for scale-dependent regression coverage#669
aallan merged 7 commits into
mainfrom
fix-596-stress-tests

aallan commented May 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 13, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Uh oh!

codecov Bot commented May 13, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aallan commented May 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this shape

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Uh oh!

codecov Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aallan commented May 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading

codecov Bot commented May 13, 2026 •

edited

Loading