Skip to content

test(ci): harden two flaky tests (cron yaml-warning, browser supervisor teardown)#33675

Merged
teknium1 merged 1 commit into
mainfrom
fix/ci-flake-hardening
May 28, 2026
Merged

test(ci): harden two flaky tests (cron yaml-warning, browser supervisor teardown)#33675
teknium1 merged 1 commit into
mainfrom
fix/ci-flake-hardening

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

Hardens two flaky tests that caused transient CI failures on PR #33661's initial run (both passed on rerun, but the underlying flakiness exists on main).

Changes

tests/cron/test_scheduler.py::TestRunJobConfigLogging — add mocks for resolve_runtime_provider() and discover_mcp_tools().

The two yaml-warning tests intend to verify only the warning log line, but _run_job_impl continues into provider resolution and MCP discovery AFTER the warning fires. Both code paths can spawn subprocesses (MCP stdio servers) or hit the network (provider auth), and pushed the test over the 30s pytest-timeout budget on GHA under load. Local: 0.57s. CI failure: >30s. Now mocked at the right boundary so the test does only what it claims to test.

tests/tools/test_browser_supervisor.py — harden Chrome teardown against the stdlib subprocess._wait() race (bpo-38630).

The POSIX _wait() impl has a longstanding race: when SIGCHLD arrives during proc.wait(), _try_wait(WNOHANG) can return a foreign pid and the assert pid == self.pid or pid == 0 fires. CI hit this on slice 1 during fixture teardown, killing the whole job with 0 actual test failures. Fixture now:

  • catches AssertionError / TimeoutExpired on proc.wait()
  • force-kills on failure
  • always reaps with a second wait() so no zombie escapes

Same hardening applied to the early-skip branch (lines 91-94) which had the same pattern.

Validation

  • py_compile clean
  • pytest tests/cron/test_scheduler.py::TestRunJobConfigLogging --durations=0 → 0.67s, 2 passed
  • pytest tests/tools/test_browser_supervisor.py --collect-only → 22 tests collected, no import errors (real Chrome teardown runs only in CI / on machines with google-chrome installed)

Context

Follow-up to PR #33661. Both flakes pre-existed; they surfaced together on that run's CI. Filed as a separate PR per workflow ("one change per branch push").

Two unrelated transient failures on PR #33661's initial CI run, both
pre-existing on main and recovered on rerun. Hardening:

1. tests/cron/test_scheduler.py::TestRunJobConfigLogging — added mocks for
   resolve_runtime_provider() and discover_mcp_tools(). The yaml-warning
   tests intend to exercise only the warning-log path, but
   _run_job_impl continues into provider resolution and MCP discovery
   after the warning. Both can spawn subprocesses / hit the network and
   pushed the test over its 30s budget under GHA load.

2. tests/tools/test_browser_supervisor.py — wrapped Chrome teardown
   against the stdlib subprocess._wait() race (bpo-38630). When SIGCHLD
   arrives during proc.wait(), _try_wait(WNOHANG) can return a foreign
   pid and the 'assert pid == self.pid or pid == 0' fires. Fixture now
   catches AssertionError/TimeoutExpired, force-kills, and always reaps
   so no zombie escapes. Same hardening applied to the early-skip branch.
@github-actions

Copy link
Copy Markdown
Contributor

🔎 Lint report: fix/ci-flake-hardening vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9520 on HEAD, 9520 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 5016 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@alt-glitch alt-glitch added type/test Test coverage or test infrastructure comp/cron Cron scheduler and job management tool/browser Browser automation (CDP, Playwright) P3 Low — cosmetic, nice to have labels May 28, 2026
@teknium1 teknium1 merged commit 4e702fe into main May 28, 2026
26 of 28 checks passed
@teknium1 teknium1 deleted the fix/ci-flake-hardening branch May 28, 2026 06:15
mathias3 pushed a commit to mathias3/hermes-agent that referenced this pull request May 28, 2026
Two unrelated transient failures on PR NousResearch#33661's initial CI run, both
pre-existing on main and recovered on rerun. Hardening:

1. tests/cron/test_scheduler.py::TestRunJobConfigLogging — added mocks for
   resolve_runtime_provider() and discover_mcp_tools(). The yaml-warning
   tests intend to exercise only the warning-log path, but
   _run_job_impl continues into provider resolution and MCP discovery
   after the warning. Both can spawn subprocesses / hit the network and
   pushed the test over its 30s budget under GHA load.

2. tests/tools/test_browser_supervisor.py — wrapped Chrome teardown
   against the stdlib subprocess._wait() race (bpo-38630). When SIGCHLD
   arrives during proc.wait(), _try_wait(WNOHANG) can return a foreign
   pid and the 'assert pid == self.pid or pid == 0' fires. Fixture now
   catches AssertionError/TimeoutExpired, force-kills, and always reaps
   so no zombie escapes. Same hardening applied to the early-skip branch.
Bryce-huang pushed a commit to wbkunlun/hermes-agent that referenced this pull request May 29, 2026
Two unrelated transient failures on PR NousResearch#33661's initial CI run, both
pre-existing on main and recovered on rerun. Hardening:

1. tests/cron/test_scheduler.py::TestRunJobConfigLogging — added mocks for
   resolve_runtime_provider() and discover_mcp_tools(). The yaml-warning
   tests intend to exercise only the warning-log path, but
   _run_job_impl continues into provider resolution and MCP discovery
   after the warning. Both can spawn subprocesses / hit the network and
   pushed the test over its 30s budget under GHA load.

2. tests/tools/test_browser_supervisor.py — wrapped Chrome teardown
   against the stdlib subprocess._wait() race (bpo-38630). When SIGCHLD
   arrives during proc.wait(), _try_wait(WNOHANG) can return a foreign
   pid and the 'assert pid == self.pid or pid == 0' fires. Fixture now
   catches AssertionError/TimeoutExpired, force-kills, and always reaps
   so no zombie escapes. Same hardening applied to the early-skip branch.
#AI commit#
zwolniony pushed a commit to zwolniony/hermes-agent that referenced this pull request May 29, 2026
Two unrelated transient failures on PR NousResearch#33661's initial CI run, both
pre-existing on main and recovered on rerun. Hardening:

1. tests/cron/test_scheduler.py::TestRunJobConfigLogging — added mocks for
   resolve_runtime_provider() and discover_mcp_tools(). The yaml-warning
   tests intend to exercise only the warning-log path, but
   _run_job_impl continues into provider resolution and MCP discovery
   after the warning. Both can spawn subprocesses / hit the network and
   pushed the test over its 30s budget under GHA load.

2. tests/tools/test_browser_supervisor.py — wrapped Chrome teardown
   against the stdlib subprocess._wait() race (bpo-38630). When SIGCHLD
   arrives during proc.wait(), _try_wait(WNOHANG) can return a foreign
   pid and the 'assert pid == self.pid or pid == 0' fires. Fixture now
   catches AssertionError/TimeoutExpired, force-kills, and always reaps
   so no zombie escapes. Same hardening applied to the early-skip branch.
mosaiq-systems pushed a commit to mosaiq-systems/hermes-agent that referenced this pull request May 29, 2026
Two unrelated transient failures on PR NousResearch#33661's initial CI run, both
pre-existing on main and recovered on rerun. Hardening:

1. tests/cron/test_scheduler.py::TestRunJobConfigLogging — added mocks for
   resolve_runtime_provider() and discover_mcp_tools(). The yaml-warning
   tests intend to exercise only the warning-log path, but
   _run_job_impl continues into provider resolution and MCP discovery
   after the warning. Both can spawn subprocesses / hit the network and
   pushed the test over its 30s budget under GHA load.

2. tests/tools/test_browser_supervisor.py — wrapped Chrome teardown
   against the stdlib subprocess._wait() race (bpo-38630). When SIGCHLD
   arrives during proc.wait(), _try_wait(WNOHANG) can return a foreign
   pid and the 'assert pid == self.pid or pid == 0' fires. Fixture now
   catches AssertionError/TimeoutExpired, force-kills, and always reaps
   so no zombie escapes. Same hardening applied to the early-skip branch.
KKT-OPT pushed a commit to KKT-OPT/hermes-agent that referenced this pull request May 31, 2026
Two unrelated transient failures on PR NousResearch#33661's initial CI run, both
pre-existing on main and recovered on rerun. Hardening:

1. tests/cron/test_scheduler.py::TestRunJobConfigLogging — added mocks for
   resolve_runtime_provider() and discover_mcp_tools(). The yaml-warning
   tests intend to exercise only the warning-log path, but
   _run_job_impl continues into provider resolution and MCP discovery
   after the warning. Both can spawn subprocesses / hit the network and
   pushed the test over its 30s budget under GHA load.

2. tests/tools/test_browser_supervisor.py — wrapped Chrome teardown
   against the stdlib subprocess._wait() race (bpo-38630). When SIGCHLD
   arrives during proc.wait(), _try_wait(WNOHANG) can return a foreign
   pid and the 'assert pid == self.pid or pid == 0' fires. Fixture now
   catches AssertionError/TimeoutExpired, force-kills, and always reaps
   so no zombie escapes. Same hardening applied to the early-skip branch.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
Two unrelated transient failures on PR NousResearch#33661's initial CI run, both
pre-existing on main and recovered on rerun. Hardening:

1. tests/cron/test_scheduler.py::TestRunJobConfigLogging — added mocks for
   resolve_runtime_provider() and discover_mcp_tools(). The yaml-warning
   tests intend to exercise only the warning-log path, but
   _run_job_impl continues into provider resolution and MCP discovery
   after the warning. Both can spawn subprocesses / hit the network and
   pushed the test over its 30s budget under GHA load.

2. tests/tools/test_browser_supervisor.py — wrapped Chrome teardown
   against the stdlib subprocess._wait() race (bpo-38630). When SIGCHLD
   arrives during proc.wait(), _try_wait(WNOHANG) can return a foreign
   pid and the 'assert pid == self.pid or pid == 0' fires. Fixture now
   catches AssertionError/TimeoutExpired, force-kills, and always reaps
   so no zombie escapes. Same hardening applied to the early-skip branch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cron Cron scheduler and job management P3 Low — cosmetic, nice to have tool/browser Browser automation (CDP, Playwright) type/test Test coverage or test infrastructure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants