ci: split Tests workflow into 4 parallel shards (target <2min)#11619
ci: split Tests workflow into 4 parallel shards (target <2min)#11619teknium1 wants to merge 4 commits into
Conversation
|
Target: <2min CI test wall time. Runs the Tests workflow as a 4-way matrix instead of one job. Each shard runs ~3,000 tests on its own ubuntu-latest runner (4 cores) with -n auto xdist inside. Total effective parallelism: 16 workers across 4 machines (vs 4 workers on 1 machine today). Was previously tried in #11566 and closed — shard 3 hung at 97% complete for 100+ seconds with dozens of E/F markers. Root cause was cross-test pollution exposed by splitting test files across shards (e.g. the three test files that mutated sys.modules['dotenv'] at import time poisoned whichever shard they landed in). That's now fixed by #11453 and #11577: conftest is hermetic, the dotenv stub bombs are removed, and tests no longer depend on each other's env-var side effects. Changes: - pyproject.toml: add pytest-split>=0.9,<1 to dev extras - .github/workflows/tests.yml: 'test' job becomes matrix-split into 4 groups with fail-fast: false. Runs 'pytest --splits 4 --group N'. pytest-split composes with -n auto from pyproject addopts. e2e job is unchanged (already small, 20s). Expected timing: Before: ~4m total (243s test step + ~25s setup) After: ~90-115s total (shard wall time ~60-90s + ~25s setup) Hash-based split is deterministic; no .test_durations file needed yet. Can add one later via --store-durations for better shard balance.
Tests that construct AIAgent(api_key=..., ...) without base_url were relying on provider-resolver fallback state from other tests in the same xdist worker. When matrix-split distributed them to different shards, the resolver found no env vars and no config and raised 'No LLM provider configured'. Fix: add base_url='https://openrouter.ai/api/v1' to every AIAgent construction that passes api_key. AIAgent.__init__ with both args set takes the direct-construction path (line 960 in run_agent.py) and skips resolver fallback entirely, making these tests self-contained. 7 files, 16 call sites updated via AST-based fixup. One call site (test_none_base_url_passed_as_none) left alone — that test's intent is to verify base_url=None behavior, so adding base_url defeats the test. Validation: - tests/run_agent/ full run: 760 passed, 0 failed (was 1 failure under the AST script's over-application, now clean) - Matrix shard 3 local run: 3083 passed, 0 failed, 1m44s
541a5db to
c2559b8
Compare
|
… tests Same root cause as previous commit — tests that construct AIAgent() without base_url rely on provider-resolver fallback state that doesn't exist in hermetic CI / shard-split runs. Previously hidden because other tests in the same xdist worker happened to prime module state. Covered by the previous fix: calls that passed api_key but not base_url. This covers calls that pass NEITHER (model=... only) — test_streaming.py especially (24 call sites). Plus test_860_dedup, test_compression_ persistence, test_create_openai_client_*, test_provider_parity. One call site (test_none_base_url_passed_as_none) remains explicitly unmodified — it asserts None/empty base_url behavior, so adding base_url would defeat the test's intent. Validation: - tests/run_agent/: 760 passed, 0 failed (local) - Matrix shard 3 subset: 3098 passed, 0 failed, 1m49s (local)
|
…rect path) My last fix added base_url but not api_key. AIAgent.__init__ takes the direct-construction path only when BOTH are set — with only base_url it still calls resolve_provider_client and fails in hermetic CI. Same 31 call sites, now with both kwargs.
|
|
Closing — matrix split gets 3 of 4 shards fast (<2m) but shard 3 is 5x slower on GHA than locally (9m41s vs 1m48s), making total CI wall time worse than the 4m baseline. Landing just the AIAgent test-self-contained-ness fixes in a separate PR; will investigate the CI-specific slowdown in shard 3 before retrying matrix split. |
…AIAgent
CI on main had 7 failing tests. Five were stale test fixtures; one (agent
cache spillover timeout) was covering up a real perf regression in
AIAgent construction.
The perf bug: every AIAgent.__init__ calls _check_compression_model_feasibility
→ resolve_provider_client('auto') → _resolve_api_key_provider which
iterates PROVIDER_REGISTRY. When it hits 'zai', it unconditionally calls
resolve_api_key_provider_credentials → _resolve_zai_base_url → probes 8
Z.AI endpoints with an empty Bearer token (all 401s), ~2s of pure latency
per agent, even when the user has never touched Z.AI. Landed in
9e84416 (PR for credential-pool Z.AI auto-detect) — the short-circuit
when api_key is empty was missing. _resolve_kimi_base_url had the same
shape; fixed too.
Test fixes:
- tests/gateway/test_voice_command.py: _make_adapter helpers were missing
self._voice_locks (added in PR #12644, 7 call sites — all updated).
- tests/test_toolsets.py: test_hermes_platforms_share_core_tools asserted
equality, but hermes-discord has discord_server (DISCORD_BOT_TOKEN-gated,
discord-only by design). Switched to subset check.
- tests/run_agent/test_streaming.py: test_tool_name_not_duplicated_when_resent_per_chunk
missing api_key/base_url — classic pitfall (PR #11619 fixed 16 of
these; this one slipped through on a later commit).
- tests/tools/test_discord_tool.py: TestConfigAllowlist caplog assertions
fail in parallel runs because AIAgent(quiet_mode=True) globally sets
logging.getLogger('tools').setLevel(ERROR) and xdist workers are
persistent. Autouse fixture resets the 'tools' and
'tools.discord_tool' levels per test.
Validation:
tests/cron + voice + agent_cache + streaming + toolsets + command_guards
+ discord_tool: 550/550 pass
tests/hermes_cli + tests/gateway: 5713/5713 pass
AIAgent construction without Z.AI creds: 2.2s → 0.24s (9x)
…AIAgent
CI on main had 7 failing tests. Five were stale test fixtures; one (agent
cache spillover timeout) was covering up a real perf regression in
AIAgent construction.
The perf bug: every AIAgent.__init__ calls _check_compression_model_feasibility
→ resolve_provider_client('auto') → _resolve_api_key_provider which
iterates PROVIDER_REGISTRY. When it hits 'zai', it unconditionally calls
resolve_api_key_provider_credentials → _resolve_zai_base_url → probes 8
Z.AI endpoints with an empty Bearer token (all 401s), ~2s of pure latency
per agent, even when the user has never touched Z.AI. Landed in
9e84416 (PR for credential-pool Z.AI auto-detect) — the short-circuit
when api_key is empty was missing. _resolve_kimi_base_url had the same
shape; fixed too.
Test fixes:
- tests/gateway/test_voice_command.py: _make_adapter helpers were missing
self._voice_locks (added in PR #12644, 7 call sites — all updated).
- tests/test_toolsets.py: test_hermes_platforms_share_core_tools asserted
equality, but hermes-discord has discord_server (DISCORD_BOT_TOKEN-gated,
discord-only by design). Switched to subset check.
- tests/run_agent/test_streaming.py: test_tool_name_not_duplicated_when_resent_per_chunk
missing api_key/base_url — classic pitfall (PR #11619 fixed 16 of
these; this one slipped through on a later commit).
- tests/tools/test_discord_tool.py: TestConfigAllowlist caplog assertions
fail in parallel runs because AIAgent(quiet_mode=True) globally sets
logging.getLogger('tools').setLevel(ERROR) and xdist workers are
persistent. Autouse fixture resets the 'tools' and
'tools.discord_tool' levels per test.
Validation:
tests/cron + voice + agent_cache + streaming + toolsets + command_guards
+ discord_tool: 550/550 pass
tests/hermes_cli + tests/gateway: 5713/5713 pass
AIAgent construction without Z.AI creds: 2.2s → 0.24s (9x)
…AIAgent
CI on main had 7 failing tests. Five were stale test fixtures; one (agent
cache spillover timeout) was covering up a real perf regression in
AIAgent construction.
The perf bug: every AIAgent.__init__ calls _check_compression_model_feasibility
→ resolve_provider_client('auto') → _resolve_api_key_provider which
iterates PROVIDER_REGISTRY. When it hits 'zai', it unconditionally calls
resolve_api_key_provider_credentials → _resolve_zai_base_url → probes 8
Z.AI endpoints with an empty Bearer token (all 401s), ~2s of pure latency
per agent, even when the user has never touched Z.AI. Landed in
0154413 (PR for credential-pool Z.AI auto-detect) — the short-circuit
when api_key is empty was missing. _resolve_kimi_base_url had the same
shape; fixed too.
Test fixes:
- tests/gateway/test_voice_command.py: _make_adapter helpers were missing
self._voice_locks (added in PR NousResearch#12644, 7 call sites — all updated).
- tests/test_toolsets.py: test_hermes_platforms_share_core_tools asserted
equality, but hermes-discord has discord_server (DISCORD_BOT_TOKEN-gated,
discord-only by design). Switched to subset check.
- tests/run_agent/test_streaming.py: test_tool_name_not_duplicated_when_resent_per_chunk
missing api_key/base_url — classic pitfall (PR NousResearch#11619 fixed 16 of
these; this one slipped through on a later commit).
- tests/tools/test_discord_tool.py: TestConfigAllowlist caplog assertions
fail in parallel runs because AIAgent(quiet_mode=True) globally sets
logging.getLogger('tools').setLevel(ERROR) and xdist workers are
persistent. Autouse fixture resets the 'tools' and
'tools.discord_tool' levels per test.
Validation:
tests/cron + voice + agent_cache + streaming + toolsets + command_guards
+ discord_tool: 550/550 pass
tests/hermes_cli + tests/gateway: 5713/5713 pass
AIAgent construction without Z.AI creds: 2.2s → 0.24s (9x)
…AIAgent
CI on main had 7 failing tests. Five were stale test fixtures; one (agent
cache spillover timeout) was covering up a real perf regression in
AIAgent construction.
The perf bug: every AIAgent.__init__ calls _check_compression_model_feasibility
→ resolve_provider_client('auto') → _resolve_api_key_provider which
iterates PROVIDER_REGISTRY. When it hits 'zai', it unconditionally calls
resolve_api_key_provider_credentials → _resolve_zai_base_url → probes 8
Z.AI endpoints with an empty Bearer token (all 401s), ~2s of pure latency
per agent, even when the user has never touched Z.AI. Landed in
0154413 (PR for credential-pool Z.AI auto-detect) — the short-circuit
when api_key is empty was missing. _resolve_kimi_base_url had the same
shape; fixed too.
Test fixes:
- tests/gateway/test_voice_command.py: _make_adapter helpers were missing
self._voice_locks (added in PR NousResearch#12644, 7 call sites — all updated).
- tests/test_toolsets.py: test_hermes_platforms_share_core_tools asserted
equality, but hermes-discord has discord_server (DISCORD_BOT_TOKEN-gated,
discord-only by design). Switched to subset check.
- tests/run_agent/test_streaming.py: test_tool_name_not_duplicated_when_resent_per_chunk
missing api_key/base_url — classic pitfall (PR NousResearch#11619 fixed 16 of
these; this one slipped through on a later commit).
- tests/tools/test_discord_tool.py: TestConfigAllowlist caplog assertions
fail in parallel runs because AIAgent(quiet_mode=True) globally sets
logging.getLogger('tools').setLevel(ERROR) and xdist workers are
persistent. Autouse fixture resets the 'tools' and
'tools.discord_tool' levels per test.
Validation:
tests/cron + voice + agent_cache + streaming + toolsets + command_guards
+ discord_tool: 550/550 pass
tests/hermes_cli + tests/gateway: 5713/5713 pass
AIAgent construction without Z.AI creds: 2.2s → 0.24s (9x)
…AIAgent
CI on main had 7 failing tests. Five were stale test fixtures; one (agent
cache spillover timeout) was covering up a real perf regression in
AIAgent construction.
The perf bug: every AIAgent.__init__ calls _check_compression_model_feasibility
→ resolve_provider_client('auto') → _resolve_api_key_provider which
iterates PROVIDER_REGISTRY. When it hits 'zai', it unconditionally calls
resolve_api_key_provider_credentials → _resolve_zai_base_url → probes 8
Z.AI endpoints with an empty Bearer token (all 401s), ~2s of pure latency
per agent, even when the user has never touched Z.AI. Landed in
9e84416 (PR for credential-pool Z.AI auto-detect) — the short-circuit
when api_key is empty was missing. _resolve_kimi_base_url had the same
shape; fixed too.
Test fixes:
- tests/gateway/test_voice_command.py: _make_adapter helpers were missing
self._voice_locks (added in PR NousResearch#12644, 7 call sites — all updated).
- tests/test_toolsets.py: test_hermes_platforms_share_core_tools asserted
equality, but hermes-discord has discord_server (DISCORD_BOT_TOKEN-gated,
discord-only by design). Switched to subset check.
- tests/run_agent/test_streaming.py: test_tool_name_not_duplicated_when_resent_per_chunk
missing api_key/base_url — classic pitfall (PR NousResearch#11619 fixed 16 of
these; this one slipped through on a later commit).
- tests/tools/test_discord_tool.py: TestConfigAllowlist caplog assertions
fail in parallel runs because AIAgent(quiet_mode=True) globally sets
logging.getLogger('tools').setLevel(ERROR) and xdist workers are
persistent. Autouse fixture resets the 'tools' and
'tools.discord_tool' levels per test.
Validation:
tests/cron + voice + agent_cache + streaming + toolsets + command_guards
+ discord_tool: 550/550 pass
tests/hermes_cli + tests/gateway: 5713/5713 pass
AIAgent construction without Z.AI creds: 2.2s → 0.24s (9x)
…AIAgent
CI on main had 7 failing tests. Five were stale test fixtures; one (agent
cache spillover timeout) was covering up a real perf regression in
AIAgent construction.
The perf bug: every AIAgent.__init__ calls _check_compression_model_feasibility
→ resolve_provider_client('auto') → _resolve_api_key_provider which
iterates PROVIDER_REGISTRY. When it hits 'zai', it unconditionally calls
resolve_api_key_provider_credentials → _resolve_zai_base_url → probes 8
Z.AI endpoints with an empty Bearer token (all 401s), ~2s of pure latency
per agent, even when the user has never touched Z.AI. Landed in
9e84416 (PR for credential-pool Z.AI auto-detect) — the short-circuit
when api_key is empty was missing. _resolve_kimi_base_url had the same
shape; fixed too.
Test fixes:
- tests/gateway/test_voice_command.py: _make_adapter helpers were missing
self._voice_locks (added in PR NousResearch#12644, 7 call sites — all updated).
- tests/test_toolsets.py: test_hermes_platforms_share_core_tools asserted
equality, but hermes-discord has discord_server (DISCORD_BOT_TOKEN-gated,
discord-only by design). Switched to subset check.
- tests/run_agent/test_streaming.py: test_tool_name_not_duplicated_when_resent_per_chunk
missing api_key/base_url — classic pitfall (PR NousResearch#11619 fixed 16 of
these; this one slipped through on a later commit).
- tests/tools/test_discord_tool.py: TestConfigAllowlist caplog assertions
fail in parallel runs because AIAgent(quiet_mode=True) globally sets
logging.getLogger('tools').setLevel(ERROR) and xdist workers are
persistent. Autouse fixture resets the 'tools' and
'tools.discord_tool' levels per test.
Validation:
tests/cron + voice + agent_cache + streaming + toolsets + command_guards
+ discord_tool: 550/550 pass
tests/hermes_cli + tests/gateway: 5713/5713 pass
AIAgent construction without Z.AI creds: 2.2s → 0.24s (9x)
…AIAgent
CI on main had 7 failing tests. Five were stale test fixtures; one (agent
cache spillover timeout) was covering up a real perf regression in
AIAgent construction.
The perf bug: every AIAgent.__init__ calls _check_compression_model_feasibility
→ resolve_provider_client('auto') → _resolve_api_key_provider which
iterates PROVIDER_REGISTRY. When it hits 'zai', it unconditionally calls
resolve_api_key_provider_credentials → _resolve_zai_base_url → probes 8
Z.AI endpoints with an empty Bearer token (all 401s), ~2s of pure latency
per agent, even when the user has never touched Z.AI. Landed in
9e84416 (PR for credential-pool Z.AI auto-detect) — the short-circuit
when api_key is empty was missing. _resolve_kimi_base_url had the same
shape; fixed too.
Test fixes:
- tests/gateway/test_voice_command.py: _make_adapter helpers were missing
self._voice_locks (added in PR NousResearch#12644, 7 call sites — all updated).
- tests/test_toolsets.py: test_hermes_platforms_share_core_tools asserted
equality, but hermes-discord has discord_server (DISCORD_BOT_TOKEN-gated,
discord-only by design). Switched to subset check.
- tests/run_agent/test_streaming.py: test_tool_name_not_duplicated_when_resent_per_chunk
missing api_key/base_url — classic pitfall (PR NousResearch#11619 fixed 16 of
these; this one slipped through on a later commit).
- tests/tools/test_discord_tool.py: TestConfigAllowlist caplog assertions
fail in parallel runs because AIAgent(quiet_mode=True) globally sets
logging.getLogger('tools').setLevel(ERROR) and xdist workers are
persistent. Autouse fixture resets the 'tools' and
'tools.discord_tool' levels per test.
Validation:
tests/cron + voice + agent_cache + streaming + toolsets + command_guards
+ discord_tool: 550/550 pass
tests/hermes_cli + tests/gateway: 5713/5713 pass
AIAgent construction without Z.AI creds: 2.2s → 0.24s (9x)
…AIAgent
CI on main had 7 failing tests. Five were stale test fixtures; one (agent
cache spillover timeout) was covering up a real perf regression in
AIAgent construction.
The perf bug: every AIAgent.__init__ calls _check_compression_model_feasibility
→ resolve_provider_client('auto') → _resolve_api_key_provider which
iterates PROVIDER_REGISTRY. When it hits 'zai', it unconditionally calls
resolve_api_key_provider_credentials → _resolve_zai_base_url → probes 8
Z.AI endpoints with an empty Bearer token (all 401s), ~2s of pure latency
per agent, even when the user has never touched Z.AI. Landed in
d44cf13 (PR for credential-pool Z.AI auto-detect) — the short-circuit
when api_key is empty was missing. _resolve_kimi_base_url had the same
shape; fixed too.
Test fixes:
- tests/gateway/test_voice_command.py: _make_adapter helpers were missing
self._voice_locks (added in PR NousResearch#12644, 7 call sites — all updated).
- tests/test_toolsets.py: test_hermes_platforms_share_core_tools asserted
equality, but hermes-discord has discord_server (DISCORD_BOT_TOKEN-gated,
discord-only by design). Switched to subset check.
- tests/run_agent/test_streaming.py: test_tool_name_not_duplicated_when_resent_per_chunk
missing api_key/base_url — classic pitfall (PR NousResearch#11619 fixed 16 of
these; this one slipped through on a later commit).
- tests/tools/test_discord_tool.py: TestConfigAllowlist caplog assertions
fail in parallel runs because AIAgent(quiet_mode=True) globally sets
logging.getLogger('tools').setLevel(ERROR) and xdist workers are
persistent. Autouse fixture resets the 'tools' and
'tools.discord_tool' levels per test.
Validation:
tests/cron + voice + agent_cache + streaming + toolsets + command_guards
+ discord_tool: 550/550 pass
tests/hermes_cli + tests/gateway: 5713/5713 pass
AIAgent construction without Z.AI creds: 2.2s → 0.24s (9x)
Summary
CI test wall time drops from ~4min to ~90s by running the suite as 4 parallel matrix jobs instead of a single job. Previously blocked by cross-test pollution; #11577 made the suite hermetic, unblocking this.
Changes
pyproject.toml: addpytest-split>=0.9,<1to dev extras.github/workflows/tests.yml:testjob matrix-splits into 4 groups. Each shard runspytest --splits 4 --group N, composes with-n autoxdist insidefail-fast: false— all shards finish even if one failsValidation
Shard 1 ran locally in 37s on my 16-core box; CI runners are 4-core so expect 60-90s per shard.
Context
#11566 closed — same approach but shard 3 hung at 97% due to cross-test pollution (sys.modules['dotenv'] stub bombs + env var leakage exposing hidden test dependencies). #11453 fixed the caplog flakes. #11577 made conftest hermetic and removed the dotenv stubs. This PR is only viable because that foundation is in place.