Fix #596: stress-test harness for scale-dependent regression coverage#669
Conversation
Pre-#596 the project relied on user-reported real-world programs to surface scale-dependent codegen/runtime bugs — #570 (iterative- builder shadow-stack overflow at ~4000 elements), #515 (GC self- fault under sustained allocation), #593 (Conway's Life corruption at 12×30+), and others. The standard test suite has thousands of small focused tests but none that exercise the runtime at scales where these bugs first manifested. This PR closes that gap with a dedicated stress-test harness. What ships: 1. **`tests/test_stress.py`** with 8 scale-dependent regression tests, each hitting a specific scale axis: - 10K `array_map` over `Array<Int>` — shadow-stack overflow (#570) - 5K nested-array `array_map` over `Array<Array<Bool>>` — per-iteration root accumulation (#570/#515) - 1K-deep tail recursion with allocating String arg — TCO/GC interaction (#549) - 20×20 nested array-fold-of-array-fold — #593/#595 regression coverage (the structural shape that hit those bugs, not the bug-report program itself) - 100K `array_fold` over Int array — allocation pressure (#487/#348) - 10K String allocations through interpolation — wrap-table compaction (#573) - 1K `State<Int>` get/put cycles in single handler — handler/ resume plumbing under load - 10K `IO.print` calls with stdout-capture — host-import call rate, capture buffer growth Each test asserts on a SPECIFIC observable (closed-form sum, exact line count, etc.), not just "completed without crashing", so a future regression that silently short-circuits or skips iterations would still fail. 2. **`stress` pytest marker** registered in `pyproject.toml` with default `addopts = "-m 'not stress'"` — stress tests are skipped from the per-PR pytest run. Local invocation: `pytest -m stress` or `pytest tests/test_stress.py -m stress -v`. 3. **`.github/workflows/nightly-stress.yml`** with three triggers: - Nightly cron at 06:00 UTC (primary safety net) - Path-filtered PRs touching `vera/codegen/**` / `vera/wasm/**` / `tests/test_stress.py` / the workflow itself (fail-fast on PRs likely to break stress invariants) - Manual `workflow_dispatch` (Actions tab) 4. **"Stress tests" subsection in `TESTING.md`** documenting the 8 test programs with their scale axes, the default-skip behaviour, the three CI triggers, and the assertion-shape convention. 5. **`scripts/check_doc_counts.py` fix** — added `-o addopts=""` to the collection-only call so the script sees every test file including `test_stress.py` (without this override, the default `-m 'not stress'` filter would deselect stress tests and the per-file counter would report `test_stress.py` as a missing row). 6. **ROADMAP cleanup** — removed #596 from Order 1 in the Stabilisation tier (now closed by this PR). Tier renumbered 1-5 (was 1-6). Agent-integration tier renumbered 6-8 (was 7-9). Release prep: - Version bumped to v0.0.152 across `__init__.py`, `pyproject.toml`, `uv.lock`, `docs/index.html`, `README.md`. - CHANGELOG.md entry under `[0.0.152]` with Added + Documentation sections. - HISTORY.md row added (one-line template-compliant). - TESTING.md overview totals updated (3,863 → 3,871 tests, 31 → 32 files). - ROADMAP.md test count updated to 3,871. Validation: - `pytest tests/test_stress.py -m stress -v` → 8 passed in 0.25s - `pytest tests/ -q` → 3,849 passed, 14 skipped, 8 deselected (stress filtered out by default addopts) - `mypy vera/` → clean - `python scripts/check_walker_coverage.py` → 9 walkers, all 29 Expr subclasses covered - `python scripts/check_doc_counts.py` → consistent (3,871 tests, 32 files, 27 hooks, 7 CI jobs) - Doc sweep across SKILL / CLAUDE / AGENTS / README / vera/README / CONTRIBUTING / spec / docs / examples → zero stale "stress tests missing" claims Closes #596 Co-Authored-By: Claude <noreply@anthropic.invalid>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a pytest ChangesStress-test harness (v0.0.152)
Sequence Diagram(s)sequenceDiagram
participant Test
participant _run
participant Parser
participant Compiler
participant Runtime
Test->>_run: provide Vera source
_run->>Parser: write & parse .vera file
Parser->>Compiler: transform AST
Compiler->>Compiler: assert no error diagnostics
Compiler->>Runtime: execute compiled program
Runtime->>_run: return exec_result.value
_run->>Test: return value
_run->>_run: cleanup temp file
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #669 +/- ##
=======================================
Coverage 91.08% 91.08%
=======================================
Files 60 60
Lines 23283 23283
Branches 259 259
=======================================
Hits 21207 21207
Misses 2069 2069
Partials 7 7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
User feedback on #669: the initial Stress Tests subsection had the summary table + behaviour blurb but not per-test detail. Expanded the section to cover, for each of the 8 tests: - The exact program shape (what it allocates, what loop it drives, what it folds/maps/iterates over) - The expected return value and the assertion's *closed-form* derivation (e.g. why `test_10k_string_allocations` should return 38,890 — broken down by digit-count contribution) - The specific bug class it guards against, with issue references - Why the chosen iteration count is meaningful relative to that bug class's historical failure threshold Also added an "Adding a stress test" subsection codifying the conventions: 1. Target a specific scale axis + name the bug class 2. Pick the smallest scale with ~2-3x safety margin (don't maximise) 3. Assert on a SPECIFIC observable with closed-form expected value (avoid "no exception raised") 4. Use the `_run` helper for tempfile lifecycle 5. `pytestmark = pytest.mark.stress` at module level for correct gating Plus reformatted the local-invocation block as a fenced bash example so the three useful invocations (full stress, file- scoped, single-test) are copy-pasteable. Validation - `python scripts/check_doc_counts.py` -> consistent (no count-affecting numbers changed) Refs #596 #669
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/test_stress.py`:
- Around line 165-175: The test test_conways_life_20x20_100_generations
currently builds an all-empty grid and only calls count_alive once, so it does
not run 100 generations as claimed; update the test to either (A) implement the
full generation loop by repeatedly applying the function that advances the board
state (locate the code that constructs the 20x20 grid and the step/advance
function used elsewhere in tests) for 100 iterations and then assert the
expected final live-cell count using count_alive, or (B) if you intended a
single-step regression, rename the test to reflect the actual behavior and
adjust the docstring accordingly; ensure you use the fixed initial pattern (not
an all-empty grid) if asserting a known stabilized population.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: a935bc06-d420-4455-b203-a2f2067eb5bf
⛔ Files ignored due to path filters (5)
docs/index.htmlis excluded by!docs/**docs/index.mdis excluded by!docs/**docs/llms-full.txtis excluded by!docs/**docs/llms.txtis excluded by!docs/**uv.lockis excluded by!**/*.lock,!uv.lock
📒 Files selected for processing (10)
.github/workflows/nightly-stress.ymlCHANGELOG.mdHISTORY.mdREADME.mdROADMAP.mdTESTING.mdpyproject.tomlscripts/check_doc_counts.pytests/test_stress.pyvera/__init__.py
The #596 issue body explicitly called for a subset of stress tests to run under `VERA_EAGER_GC=1` to catch GC-rooting regressions on the very first iteration rather than only at scale. I missed that requirement in the initial PR and the user surfaced it. Implementation: - `EAGER_GC_PARAMS = pytest.mark.parametrize(...)` decorator with `eager_gc: [False, True]` and IDs `["default_gc", "eager_gc"]` exposed at module level for the subset of tests targeting GC-rooting bug classes. - `_run` helper extended with `eager_gc: bool` + `monkeypatch: pytest.MonkeyPatch | None` keyword args. When `eager_gc=True`, calls `monkeypatch.setenv("VERA_EAGER_GC", "1")` before the `compile()` call. `VERA_EAGER_GC` is read at compile time by `vera/codegen/assembly.py::_emit_alloc`, which then emits a `call $gc_collect` as the first instruction of the runtime's `$alloc` function — forcing a full GC pass on every allocation. - 6 of the 8 logical tests carry `@EAGER_GC_PARAMS` and the paired signature `(eager_gc: bool, monkeypatch: pytest .MonkeyPatch)`: - `test_array_map_over_10k_int_array` (#570 target) - `test_array_map_over_5k_nested_bool_array` (#570/#515 target) - `test_deep_tail_recursion_with_allocating_arg` (#549 TCO/GC target) - `test_conways_life_20x20_100_generations` (#593/#595 target) - `test_10k_string_allocations` (#573 wrap-table target) - `test_state_handler_1k_ops` (captured-frame target) - 2 tests stay un-parametrised — `test_array_fold_100k_ iterations` (allocation-pressure target, NOT GC-rooting; 100K forced-GC would inflate suite time) and `test_10k_io_print_calls` (host-import target; `host_print` bridge doesn't allocate Vera-heap data). - Total test count: 8 logical × eager lane parametrisation on 6 of them = 14 test instances. Wall-clock 0.66s in-process for the full suite. Documentation updates: - TESTING.md: new "Eager-GC lane" subsection explaining the mechanism, the subset rationale, and the parametrize-with- monkeypatch implementation. Per-test descriptions in the inventory marked with `**[eager-GC]**` for the parametrised 6 and explicit "Not in eager-GC lane" notes for the 2 excluded. "Adding a stress test" checklist gained a new step 6 codifying the opt-in convention. - CHANGELOG.md v0.0.152: extended with a new bullet describing the eager-GC lane mechanism and the subset rationale. - Test counts auto-updated: 3,871 -> 3,877 (+6 parametrised instances), test_stress.py 8 -> 14 tests, 375 -> 469 lines. Validation: - `pytest tests/test_stress.py -m stress -v` -> 14 passed in 0.64s (was 8 passed in 0.25s) - `python scripts/check_doc_counts.py` -> consistent - `python scripts/check_walker_coverage.py` -> clean - `python scripts/check_version_sync.py` -> consistent Refs #596 #669
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tests/test_stress.py (1)
232-291:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftThe Conway stress case still does not execute 100 generations.
Line 233 names a
20x20_100_generationsworkload, but Lines 276-285 only construct an all-false grid and run a singlecount_alivepass. This misses the intended regression surface (state evolution across generations) and weakens coverage for the#593/#595 class.Please either implement the 100-step evolution loop with a concrete final-state assertion, or rename/re-scope this test to match what it actually validates.
As per coding guidelines
tests/**/*.py: “These are pytest test files. Review for correctness and coverage gaps.”
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@TESTING.md`:
- Line 389: Update the measured runtime phrase in TESTING.md where it currently
reads "0.25s in-process on a developer laptop today" to reflect the new
parametrised baseline (≈0.64s) or replace it with a versioned/date-stamped
figure; ensure the sentence in the Budget paragraph (the text mentioning
iteration counts and the 5-minute target) now states the updated runtime and/or
includes a version/date tag so the measured value stays accurate going forward.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8b53febf-685d-40df-9b93-73b812f769dc
📒 Files selected for processing (5)
CHANGELOG.mdREADME.mdROADMAP.mdTESTING.mdtests/test_stress.py
The original #596 spec covered CI integration but didn't cover reporting — a cron-failure would be visible only to whoever opened the Actions tab and noticed. This is fragile for an early-warning system. Adds an `Open or update tracking issue on cron failure` step to `.github/workflows/nightly-stress.yml` that runs only on `schedule`-triggered failures (skipped on `pull_request` and `workflow_dispatch` — those have their own reporting surfaces): 1. Ensures the `stress-regression` label exists (idempotent, creates on first failure with description and red colour). 2. Searches for an open issue carrying the label. 3. If one exists: appends a comment with the new commit SHA + run URL, so the original tracking issue accumulates the failure history. 4. If none: files a fresh issue titled "Nightly stress regression on main (tracking)" with the SHA, run URL, and pointer to `tests/test_stress.py` for the failing test's docstring (each test names the bug class it guards against — narrows the regression family for the bisector). Open tracking issues stay open across days until manually closed. Auto-closing on success would be wrong while a maintainer is mid-debug. Job-level permission `issues: write` added (scoped to the job, not workflow-wide, so the test-runner step keeps the tighter `contents: read` default). Implementation uses `actions/github-script@v7` which only sees GitHub-controlled values (`github.sha`, `github.server_url`, `github.repository`, `github.run_id`) — no user-supplied input in the script body. Documentation: - TESTING.md "Stress Tests" section's CI-integration bullets updated to describe the auto-filing behaviour + a new "Failure reporting (cron only)" subsection describing the dedup logic, the issue lifecycle, and the implementation details. - CHANGELOG.md v0.0.152: added a bullet describing the failure-reporting wire-up. Validation: - YAML parses (`python -c 'import yaml; yaml.safe_load(...)'`) - 5 steps in the job (was 4); failure-reporting step gated on `failure() && github.event_name == 'schedule'` - Stress suite still 14/14 pass Refs #596 #669
Three changes batched into one commit (all #669 review-driven): **1. Widen the workflow's path-filter to include `vera/checker/**`.** The user's earlier review-question on PR #669 noted a blind spot in the path-filter: PRs touching `vera/checker/**` could produce subtly different AST shapes that downstream codegen mishandles, but the filter only fired on `vera/codegen/**` and `vera/wasm/**`. Such a regression would surface only at the next nightly cron (up to ~24h delay). Added `vera/checker/**` to the `pull_request` paths and updated the workflow's leading comment to explain the rationale. **2. CR-A: rename `test_conways_life_20x20_100_generations`.** CodeRabbit flagged that the test name and docstring promised "100 generations" but the test only built an empty grid and called `count_alive` once. Verified — the program loads no Life-step function and does no iteration. Renamed to `test_conways_life_grid_alloc_and_count_alive_20x20` (option B from the CR suggestion: rename to reflect actual behaviour rather than implement a full 100-gen Life loop). Docstring rewritten to explicitly call out: - What the test does: build a 20×20 grid + single `count_alive` pass - What it does NOT do: run 100 generations (the original name implied that; this rename corrects it) - Why the structural-shape coverage is still valuable for #593 / #595 — those bugs hit the same nested-array-alloc + nested-fold + captured-binding-refs code paths TESTING.md per-test description updated to match the new name and to add a "**structural-shape** test, not a Life simulation" clarifier. Section header comment also updated. **3. CR-B: refresh the stale "0.25s in-process" figure in TESTING.md.** The 0.25s number was measured before the eager-GC lane parametrisation doubled the relevant subset. Current wall- clock is 0.66s (14 instances). Updated the Budget paragraph to: - Use the new measured value (0.66s) with a date stamp (2026-05-13) so the figure stays auditable - Add an explicit "if this drifts more than ~2x, treat as a signal" guard rail so future drift gets flagged Doc counts auto-updated: - TESTING.md `test_stress.py` lines: 469 → 479 (docstring expansion for the renamed test) Validation: - `pytest tests/test_stress.py -m stress -v` → 14 passed in 0.66s (renamed test still passes under both eager/default modes) - `python scripts/check_doc_counts.py` → consistent - YAML parses (`yaml.safe_load` on the workflow) Refs #596 #669
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/nightly-stress.yml:
- Around line 131-140: The current dedup logic uses
github.rest.issues.listForRepo and takes existing[0] which may be a PR; update
the selection after the listForRepo call to filter out any items with a non-null
pull_request field and to match the intended issue title (use LABEL and the
tracking issue title constant) before choosing issueNumber; i.e., call
github.rest.issues.listForRepo as before, then set a filtered array =
existing.filter(i => !i.pull_request && i.title === EXPECTED_TITLE) and use
filtered[0].number (or create the issue if none) so the workflow comments only
on the real tracking issue, not a PR.
In `@TESTING.md`:
- Around line 354-355: Summary: The stress-test entry
test_conways_life_20x20_100_generations incorrectly references issue `#595` for
the malloc-during-wasmtime-cleanup bug; update the issue ID/description to match
the actual bug referenced elsewhere. Fix: open TESTING.md, locate the
"test_conways_life_20x20_100_generations" entry and replace the incorrect "#595"
(and/or its description) with the correct issue identifier and short description
that corresponds to the malloc/wasmtime cleanup bug (or vice versa if `#595`
actually refers to IO.sleep/KeyboardInterrupt per line 88); ensure the updated
reference matches the mapping used elsewhere in the document so traceability is
consistent. Verify the change by searching for both "#595" and the
malloc/wasmtime cleanup text to confirm one-to-one mapping.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 09ef58ed-a3f7-4e1c-a670-21180bb2f813
📒 Files selected for processing (3)
.github/workflows/nightly-stress.ymlCHANGELOG.mdTESTING.md
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
.github/workflows/nightly-stress.yml (1)
136-146:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDedup logic can target the wrong thread.
listForRepomay return pull requests as well as issues, soexisting[0]is not guaranteed to be the tracking issue. Filter for non-PR items and match the expected title before choosingissue_number.Suggested patch
const { data: existing } = await github.rest.issues.listForRepo({ owner: context.repo.owner, repo: context.repo.repo, labels: LABEL, state: "open", - per_page: 5, + per_page: 20, }); + const trackingIssues = existing.filter( + (item) => !item.pull_request && item.title === TITLE + ); - if (existing.length > 0) { - const issueNumber = existing[0].number; + if (trackingIssues.length > 0) { + const issueNumber = trackingIssues[0].number; await github.rest.issues.createComment({ owner: context.repo.owner, repo: context.repo.repo, issue_number: issueNumber, body: commentBody, });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/nightly-stress.yml around lines 136 - 146, The current dedup logic assumes existing[0] is the tracking issue but github.rest.issues.listForRepo can return PRs; update the selection after calling github.rest.issues.listForRepo to filter out pull requests (items without a pull_request property) and to match the tracking issue title (e.g., compare item.title to the expected ISSUE_TITLE or other known heading) before picking the issue number; then use that matched item's .number when calling github.rest.issues.createComment so you always comment on the correct tracking issue (keep LABEL and the listForRepo call but replace existing[0] with the first filtered+matched item).tests/test_stress.py (1)
236-239:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winIssue reference for
#595looks inconsistent.This block links
#595to the wasmtime-cleanup/malloc bug class, butTESTING.mdmaps#595to theIO.sleep/KeyboardInterruptpath. Please align the issue ID/description pair so traceability is unambiguous across docs and tests.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_stress.py` around lines 236 - 239, The test docstring in tests/test_stress.py references issue `#595` as the wasmtime-cleanup/malloc abort bug but TESTING.md maps `#595` to IO.sleep/KeyboardInterrupt; make these consistent by either (A) updating the docstring in tests/test_stress.py to reference the correct issue number that tracks the wasmtime malloc/cleanup bug (keep `#593` as-is), or (B) if the test actually covers the IO.sleep/KeyboardInterrupt path, change the docstring description to match TESTING.md’s mapping; ensure you update the issue ID/description pair in the triple-quoted header so TESTING.md and tests/test_stress.py reference the same issue for `#595`.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@TESTING.md`:
- Around line 386-387: Update the "Path-filtered PRs" list in TESTING.md to
include the new path pattern for the checker: add `vera/checker/**` alongside
the existing `vera/codegen/**`, `vera/wasm/**`, `tests/test_stress.py` and the
workflow file so the documentation matches the workflow trigger; edit the
numbered item describing "Path-filtered PRs touching `vera/codegen/**`,
`vera/wasm/**`, `tests/test_stress.py`, or the workflow file" to also mention
`vera/checker/**` (and keep wording consistent with the existing list).
---
Duplicate comments:
In @.github/workflows/nightly-stress.yml:
- Around line 136-146: The current dedup logic assumes existing[0] is the
tracking issue but github.rest.issues.listForRepo can return PRs; update the
selection after calling github.rest.issues.listForRepo to filter out pull
requests (items without a pull_request property) and to match the tracking issue
title (e.g., compare item.title to the expected ISSUE_TITLE or other known
heading) before picking the issue number; then use that matched item's .number
when calling github.rest.issues.createComment so you always comment on the
correct tracking issue (keep LABEL and the listForRepo call but replace
existing[0] with the first filtered+matched item).
In `@tests/test_stress.py`:
- Around line 236-239: The test docstring in tests/test_stress.py references
issue `#595` as the wasmtime-cleanup/malloc abort bug but TESTING.md maps `#595` to
IO.sleep/KeyboardInterrupt; make these consistent by either (A) updating the
docstring in tests/test_stress.py to reference the correct issue number that
tracks the wasmtime malloc/cleanup bug (keep `#593` as-is), or (B) if the test
actually covers the IO.sleep/KeyboardInterrupt path, change the docstring
description to match TESTING.md’s mapping; ensure you update the issue
ID/description pair in the triple-quoted header so TESTING.md and
tests/test_stress.py reference the same issue for `#595`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 26695255-c0da-4012-88cc-09a5e48d50b7
📒 Files selected for processing (3)
.github/workflows/nightly-stress.ymlTESTING.mdtests/test_stress.py
**CR-C** (`.github/workflows/nightly-stress.yml:131-140`) — `listForRepo` dedup query returned both issues and PRs. `github.rest.issues.listForRepo` with a label filter returns any item carrying the label, including PRs (the GitHub API models PRs as issues with a `pull_request` field). If a PR were ever labelled `stress-regression`, the workflow's dedup branch would post the failure comment on the PR instead of filing a fresh tracking issue. Real risk is low — the `stress-regression` label is only applied by this workflow today — but the fix is cheap and defensive: filter the result list to exclude items with a `pull_request` field set. Renamed the `existing` const to `candidates` (the raw API result) and built `existing` as the filtered subset, then the rest of the dedup logic continues unchanged. Skipped CR's additional suggestion of filtering by exact title match (`i.title === EXPECTED_TITLE`). Rationale: the label is already the dedup primary key, and a strict title filter would break dedup if a maintainer renames the tracking issue (e.g. to "Nightly stress regression — fixed by PR #N"). Defaulting to "label is enough" is more forgiving for the expected human workflow. **CR-D** (`TESTING.md:354-355` and parallel sites) — misattributed #595 citation on the Conway's Life stress test. CR's investigation surfaced an inconsistency: line 88 maps #595 to IO.sleep/KeyboardInterrupt handling, but my line 354 described the Conway's Life stress test as covering "#593 and #595 (malloc abort during wasmtime cleanup)". Verified against the CHANGELOG record (entries at lines 248 and 283 trace #595 to a cleanup-path bug fixed by an upstream wasmtime- py change, not a Conway's Life data-integrity bug): #595's regression coverage lives in `TestHostSleepKeyboardInterrupt` in `tests/test_runtime_traps.py`, NOT in this stress test. The Conway's Life stress test: - Uses `effects(pure)` (no Ctrl-C, no IO.sleep) - Completes normally (no wasmtime-cleanup signal escape) - Therefore exercises NONE of #595's code paths The misattribution came from the original #596 issue body which paired the two issues as "scale-dependent runtime bugs"; when I lifted the pairing into the test description I didn't re-verify whether the structural shapes actually matched. Fix: drop the #595 reference from this test's description in 4 places — `tests/test_stress.py` docstring (2 sites: header paragraph + body comment), `TESTING.md:354` (per-test description), `TESTING.md:69` (test-file-row's `Pins ...` list), `CHANGELOG.md` (the v0.0.152 Added paragraph). Added an explanatory "Note on #595" block in the test docstring + parallel parenthetical in TESTING.md so a future reader understands why the citation was dropped (and doesn't re-add it). CHANGELOG historical entries (lines 248, 283) that reference #595 in its OWN bug-history context are unchanged — those are accurate descriptions of #595 in its proper domain. Doc counts: test_stress.py 479 → 487 lines (docstring expansion for the #595-misattribution note). Validation: - YAML parses (`yaml.safe_load` on the workflow) - `pytest tests/test_stress.py -m stress -q` → 14 passed in 0.65s - `python scripts/check_doc_counts.py` → consistent - No residual #595 references in PR-touched files outside the intentional historical "Note on #595" explanatory blocks Refs #596 #669
CR-E (`TESTING.md:386`) — the "Path-filtered PRs" list under
**CI integration** lagged the workflow's actual `paths:`
block. The workflow includes `vera/checker/**` (added in
`072fde9`) but the docs still listed only `vera/codegen/**` /
`vera/wasm/**` / `tests/test_stress.py` / "the workflow file
itself". Updated the docs to match the workflow + carried
over the same rationale comment ("included because the AST
shape it produces flows into codegen").
Skipping two duplicate CR comments that re-flagged
already-fixed issues:
- `tests/test_stress.py:236-239` (#595 misattribution) —
already fixed in `8a2e237`; the docstring now contains an
explicit "Note on #595" block explaining the misattribution
history.
- `.github/workflows/nightly-stress.yml:136-146` (`listForRepo`
PR-filter) — already fixed in `8a2e237`; the script filters
`candidates` to `existing = candidates.filter(item =>
!item.pull_request)` before the dedup branch.
Validation
- `python scripts/check_doc_counts.py` -> consistent
Refs #596 #669
Summary
Pre-#596 the project relied on user-reported real-world programs to surface scale-dependent codegen/runtime bugs (#570 / #515 / #593 / #595 / #549 / #487 / #348 / #573 all required a real program to find). This PR adds a dedicated stress-test harness so the next bug of this class is caught before users see it.
tests/test_stress.pywith 8 scale-dependent regression tests covering: 10Karray_map, 5K nested-arrayarray_map, 1K-deep tail recursion with allocating arg, 20×20 nested array-fold-of-array-fold (Conway's Life at 12x30 still corrupts strings from gen 1 onwards (additional trigger beyond #588) #593/macOS malloc abort during wasmtime cleanup after Ctrl-C-in-host-import #595 territory), 100Karray_fold, 10K String allocations through interpolation, 1KState<Int>get/put cycles in single handler, 10KIO.printwith stdout-capture buffer.stresspytest marker inpyproject.tomlwith defaultaddopts = "-m 'not stress'"so stress tests skip from the per-PR pytest run..github/workflows/nightly-stress.ymlwith three triggers: nightly cron (primary safety net), path-filtered PRs (fail-fast on PRs touchingvera/codegen/**/vera/wasm/**/test file),workflow_dispatch(manual).scripts/check_doc_counts.pyfix — added-o addopts=""to the pytest collection call so the doc-counts hook still sees stress tests in its per-file inventory (without this, the default-m 'not stress'filter deselects them and the script reports a missing row).Why this shape
Each test asserts on a SPECIFIC observable (closed-form sum, exact line count, etc.) rather than just "completed without crashing", so a future regression that silently short-circuits or skips iterations would still fail. Iteration counts are tuned to the smallest scale where each bug class historically manifested with ~2-3x safety margin — the goal is reliable detection of the bug class, not benchmarking. Full suite completes in 0.25s in-process today; the 5-minute CI budget leaves plenty of headroom for the suite to grow.
Test plan
pytest tests/test_stress.py -m stress -v→ 8 passed in 0.25spytest tests/ -q→ 3,849 passed, 14 skipped, 8 deselected (default addopts correctly filter stress)mypy vera/→ cleanpython scripts/check_walker_coverage.py→ 9 walkers, all 29 Expr subclasses coveredpython scripts/check_doc_counts.py→ consistent (3,871 tests, 32 files, 27 hooks)Closes #596
Summary by CodeRabbit
New Features
CI
Documentation
Chores