Skip to content

test: remove 169 change-detector tests (batch 1 of suite reduction)#11472

Merged
teknium1 merged 1 commit into
mainfrom
fix/test-reduction-batch-1
Apr 17, 2026
Merged

test: remove 169 change-detector tests (batch 1 of suite reduction)#11472
teknium1 merged 1 commit into
mainfrom
fix/test-reduction-batch-1

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

First batch of the test-suite reduction discussed with Teknium. Deletes 169 tests across 21 files that fell into confident change-detector patterns — tests that verify 'nothing was changed' rather than 'something works'.

Deletion categories

  1. Source-grep tests (gateway/test_feishu.py, test_email.py) — tests that call inspect.getsource() on production modules and grep for string literals. Break on any refactor/rename even when behavior is correct.

  2. Platform enum tautologies (every gateway/test_X.py) — Platform.X.value == 'x' duplicated across ~9 adapter test files.

  3. Registry-presence checks (toolset/PLATFORM_HINTS/setup wizard) — tests that only verify a key exists in a dict. Data-layout tests, not behavior.

  4. Argparse wiring tests (test_argparse_flag_propagation, test_subparser_routing_fallback) — parser.parse_args([...]) \u2192 assert args.field. Tests Python's argparse, not our code.

  5. Pure dispatch tests (test_plugins_cmd.TestPluginsCommandDispatch) — patch handler, call dispatcher with matching action, assert mock called. Tests the if/elif chain.

  6. Kwarg-to-mock verification (test_auxiliary_client ~45 tests, test_web_tools_config, test_gemini_cloudcode, test_retaindb_plugin) — mock external API client, call our function, assert exact kwargs. Break on refactor.

  7. Schedule-internal "function-was-called" tests (acp/test_server scheduling tests) — patch own helper method, assert it was called.

What was kept

  • Error paths (pytest.raises)
  • Security tests (path traversal, SSRF, redaction, injection scanning)
  • Message alternation invariants
  • Provider API format conversion tests (Anthropic adapter, Bedrock adapter, Codex Responses) \u2014 entirely untouched
  • Streaming logic tests \u2014 entirely untouched
  • Memory provider contract tests \u2014 entirely untouched
  • Credential pool tests \u2014 entirely untouched
  • Real config load/merge tests (profile-awareness, root-level legacy fallback)

Metrics

Before After Delta
Tests collected 12,522 12,353 -169
Empty classes \u2014 38 removed
Test LOC 194,963 193,018 -1,945
CI test runtime (main, last run) 3m57s TBD

Methodology

  • Three parallel subagents audited tests/gateway/, tests/hermes_cli/+tools/+cli/, tests/agent/+run_agent/+acp/+plugins/+cron/+skills/ for change-detector patterns.
  • Strict "when in doubt, keep" rule \u2014 each deletion has a specific 1-line reason in the manifest.
  • AST-based deletion script removed named test functions plus any class that became empty of tests (38 classes).
  • tests/run_agent/test_run_agent.py was OFF LIMITS throughout \u2014 core agent loop coverage preserved.

Test plan

  • All 21 modified files compile (py_compile).
  • Running the 21 affected files gives 988/991 passing locally; 3 failures are pre-existing cross-test pollution in TestSignalPhoneRedaction (caplog/logger state \u2014 same class as the flakes fixed in PR fix(tests): attach caplog to specific logger in 3 order-dependent tests #11453, unrelated to these deletions).
  • CI is the source of truth \u2014 will monitor the Tests job on this PR.

Next

This is batch 1 of several. After review, I'll propose batch 2 targeting broader patterns (collapse per-platform test duplication, trim large mock-heavy files like test_voice_command.py where appropriate).

First pass of test-suite reduction to address flaky CI and bloat.

Removed tests that fall into these change-detector patterns:

1. Source-grep tests (tests/gateway/test_feishu.py, test_email.py): tests
   that call inspect.getsource() on production modules and grep for string
   literals. Break on any refactor/rename even when behavior is correct.

2. Platform enum tautologies (every gateway/test_X.py): assertions like
   `Platform.X.value == 'x'` duplicated across ~9 adapter test files.

3. Toolset/PLATFORM_HINTS/setup-wizard registry-presence checks: tests that
   only verify a key exists in a dict. Data-layout tests, not behavior.

4. Argparse wiring tests (test_argparse_flag_propagation, test_subparser_routing
   _fallback): tests that do parser.parse_args([...]) then assert args.field.
   Tests Python's argparse, not our code.

5. Pure dispatch tests (test_plugins_cmd.TestPluginsCommandDispatch): patch
   cmd_X, call plugins_command with matching action, assert mock called.
   Tests the if/elif chain, not behavior.

6. Kwarg-to-mock verification (test_auxiliary_client ~45 tests,
   test_web_tools_config, test_gemini_cloudcode, test_retaindb_plugin): tests
   that mock the external API client, call our function, and assert exact
   kwargs. Break on refactor even when behavior is preserved.

7. Schedule-internal "function-was-called" tests (acp/test_server scheduling
   tests): tests that patch own helper method, then assert it was called.

Kept behavioral tests throughout: error paths (pytest.raises), security
tests (path traversal, SSRF, redaction), message alternation invariants,
provider API format conversion, streaming logic, memory contract, real
config load/merge tests.

Net reduction: 169 tests removed. 38 empty classes cleaned up.

Collected before: 12,522 tests
Collected after:  12,353 tests
@teknium1 teknium1 force-pushed the fix/test-reduction-batch-1 branch from e3583ab to 69440dd Compare April 17, 2026 07:54
@teknium1 teknium1 merged commit 2367c6f into main Apr 17, 2026
5 checks passed
@teknium1 teknium1 deleted the fix/test-reduction-batch-1 branch April 17, 2026 08:05
Scorpion1221 pushed a commit to Scorpion1221/hermes-agent that referenced this pull request Apr 24, 2026
* merge-upstream-2026-04-17: (243 commits)
  fix(feishu): reduce CardKit streaming frequency and add backoff on errors
  fix(feishu): refine CardKit streaming card polish
  fix(feishu): prevent double finalize and add loading spinner icon
  fix(feishu): fix /stop regression and streaming card finalization
  feat(feishu): add CardKit streaming card output
  feat(feishu): split inbound policy and comment flow
  fix(feishu): fetch merge-forward submessages eagerly
  feat(feishu): add drive comment routing
  feat(feishu): add sender cache and rollout sync
  fix(feishu): hide opaque sender ids in merge forwards
  fix(feishu): prefer embedded sender names
  feat(feishu): add inbound bridge and media index
  refactor(feishu): extract inbound parse module
  feat(feishu): preserve merge-forward media context
  feat(feishu): hydrate quoted merge forwards
  fix(gateway): persist canonical quoted context
  feat(feishu): add inbound quoted context pipeline
  fix(feishu): render outbound messages as Card 2.0 for full markdown support
  test: remove 169 change-detector tests across 21 files (NousResearch#11472)
  fix(insights): hide cache read/write and cost metrics from display (NousResearch#11477)
  ...
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
…11472)

First pass of test-suite reduction to address flaky CI and bloat.

Removed tests that fall into these change-detector patterns:

1. Source-grep tests (tests/gateway/test_feishu.py, test_email.py): tests
   that call inspect.getsource() on production modules and grep for string
   literals. Break on any refactor/rename even when behavior is correct.

2. Platform enum tautologies (every gateway/test_X.py): assertions like
   `Platform.X.value == 'x'` duplicated across ~9 adapter test files.

3. Toolset/PLATFORM_HINTS/setup-wizard registry-presence checks: tests that
   only verify a key exists in a dict. Data-layout tests, not behavior.

4. Argparse wiring tests (test_argparse_flag_propagation, test_subparser_routing
   _fallback): tests that do parser.parse_args([...]) then assert args.field.
   Tests Python's argparse, not our code.

5. Pure dispatch tests (test_plugins_cmd.TestPluginsCommandDispatch): patch
   cmd_X, call plugins_command with matching action, assert mock called.
   Tests the if/elif chain, not behavior.

6. Kwarg-to-mock verification (test_auxiliary_client ~45 tests,
   test_web_tools_config, test_gemini_cloudcode, test_retaindb_plugin): tests
   that mock the external API client, call our function, and assert exact
   kwargs. Break on refactor even when behavior is preserved.

7. Schedule-internal "function-was-called" tests (acp/test_server scheduling
   tests): tests that patch own helper method, then assert it was called.

Kept behavioral tests throughout: error paths (pytest.raises), security
tests (path traversal, SSRF, redaction), message alternation invariants,
provider API format conversion, streaming logic, memory contract, real
config load/merge tests.

Net reduction: 169 tests removed. 38 empty classes cleaned up.

Collected before: 12,522 tests
Collected after:  12,353 tests
aj-nt pushed a commit to aj-nt/hermes-agent that referenced this pull request May 1, 2026
…11472)

First pass of test-suite reduction to address flaky CI and bloat.

Removed tests that fall into these change-detector patterns:

1. Source-grep tests (tests/gateway/test_feishu.py, test_email.py): tests
   that call inspect.getsource() on production modules and grep for string
   literals. Break on any refactor/rename even when behavior is correct.

2. Platform enum tautologies (every gateway/test_X.py): assertions like
   `Platform.X.value == 'x'` duplicated across ~9 adapter test files.

3. Toolset/PLATFORM_HINTS/setup-wizard registry-presence checks: tests that
   only verify a key exists in a dict. Data-layout tests, not behavior.

4. Argparse wiring tests (test_argparse_flag_propagation, test_subparser_routing
   _fallback): tests that do parser.parse_args([...]) then assert args.field.
   Tests Python's argparse, not our code.

5. Pure dispatch tests (test_plugins_cmd.TestPluginsCommandDispatch): patch
   cmd_X, call plugins_command with matching action, assert mock called.
   Tests the if/elif chain, not behavior.

6. Kwarg-to-mock verification (test_auxiliary_client ~45 tests,
   test_web_tools_config, test_gemini_cloudcode, test_retaindb_plugin): tests
   that mock the external API client, call our function, and assert exact
   kwargs. Break on refactor even when behavior is preserved.

7. Schedule-internal "function-was-called" tests (acp/test_server scheduling
   tests): tests that patch own helper method, then assert it was called.

Kept behavioral tests throughout: error paths (pytest.raises), security
tests (path traversal, SSRF, redaction), message alternation invariants,
provider API format conversion, streaming logic, memory contract, real
config load/merge tests.

Net reduction: 169 tests removed. 38 empty classes cleaned up.

Collected before: 12,522 tests
Collected after:  12,353 tests
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…11472)

First pass of test-suite reduction to address flaky CI and bloat.

Removed tests that fall into these change-detector patterns:

1. Source-grep tests (tests/gateway/test_feishu.py, test_email.py): tests
   that call inspect.getsource() on production modules and grep for string
   literals. Break on any refactor/rename even when behavior is correct.

2. Platform enum tautologies (every gateway/test_X.py): assertions like
   `Platform.X.value == 'x'` duplicated across ~9 adapter test files.

3. Toolset/PLATFORM_HINTS/setup-wizard registry-presence checks: tests that
   only verify a key exists in a dict. Data-layout tests, not behavior.

4. Argparse wiring tests (test_argparse_flag_propagation, test_subparser_routing
   _fallback): tests that do parser.parse_args([...]) then assert args.field.
   Tests Python's argparse, not our code.

5. Pure dispatch tests (test_plugins_cmd.TestPluginsCommandDispatch): patch
   cmd_X, call plugins_command with matching action, assert mock called.
   Tests the if/elif chain, not behavior.

6. Kwarg-to-mock verification (test_auxiliary_client ~45 tests,
   test_web_tools_config, test_gemini_cloudcode, test_retaindb_plugin): tests
   that mock the external API client, call our function, and assert exact
   kwargs. Break on refactor even when behavior is preserved.

7. Schedule-internal "function-was-called" tests (acp/test_server scheduling
   tests): tests that patch own helper method, then assert it was called.

Kept behavioral tests throughout: error paths (pytest.raises), security
tests (path traversal, SSRF, redaction), message alternation invariants,
provider API format conversion, streaming logic, memory contract, real
config load/merge tests.

Net reduction: 169 tests removed. 38 empty classes cleaned up.

Collected before: 12,522 tests
Collected after:  12,353 tests
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…11472)

First pass of test-suite reduction to address flaky CI and bloat.

Removed tests that fall into these change-detector patterns:

1. Source-grep tests (tests/gateway/test_feishu.py, test_email.py): tests
   that call inspect.getsource() on production modules and grep for string
   literals. Break on any refactor/rename even when behavior is correct.

2. Platform enum tautologies (every gateway/test_X.py): assertions like
   `Platform.X.value == 'x'` duplicated across ~9 adapter test files.

3. Toolset/PLATFORM_HINTS/setup-wizard registry-presence checks: tests that
   only verify a key exists in a dict. Data-layout tests, not behavior.

4. Argparse wiring tests (test_argparse_flag_propagation, test_subparser_routing
   _fallback): tests that do parser.parse_args([...]) then assert args.field.
   Tests Python's argparse, not our code.

5. Pure dispatch tests (test_plugins_cmd.TestPluginsCommandDispatch): patch
   cmd_X, call plugins_command with matching action, assert mock called.
   Tests the if/elif chain, not behavior.

6. Kwarg-to-mock verification (test_auxiliary_client ~45 tests,
   test_web_tools_config, test_gemini_cloudcode, test_retaindb_plugin): tests
   that mock the external API client, call our function, and assert exact
   kwargs. Break on refactor even when behavior is preserved.

7. Schedule-internal "function-was-called" tests (acp/test_server scheduling
   tests): tests that patch own helper method, then assert it was called.

Kept behavioral tests throughout: error paths (pytest.raises), security
tests (path traversal, SSRF, redaction), message alternation invariants,
provider API format conversion, streaming logic, memory contract, real
config load/merge tests.

Net reduction: 169 tests removed. 38 empty classes cleaned up.

Collected before: 12,522 tests
Collected after:  12,353 tests
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…11472)

First pass of test-suite reduction to address flaky CI and bloat.

Removed tests that fall into these change-detector patterns:

1. Source-grep tests (tests/gateway/test_feishu.py, test_email.py): tests
   that call inspect.getsource() on production modules and grep for string
   literals. Break on any refactor/rename even when behavior is correct.

2. Platform enum tautologies (every gateway/test_X.py): assertions like
   `Platform.X.value == 'x'` duplicated across ~9 adapter test files.

3. Toolset/PLATFORM_HINTS/setup-wizard registry-presence checks: tests that
   only verify a key exists in a dict. Data-layout tests, not behavior.

4. Argparse wiring tests (test_argparse_flag_propagation, test_subparser_routing
   _fallback): tests that do parser.parse_args([...]) then assert args.field.
   Tests Python's argparse, not our code.

5. Pure dispatch tests (test_plugins_cmd.TestPluginsCommandDispatch): patch
   cmd_X, call plugins_command with matching action, assert mock called.
   Tests the if/elif chain, not behavior.

6. Kwarg-to-mock verification (test_auxiliary_client ~45 tests,
   test_web_tools_config, test_gemini_cloudcode, test_retaindb_plugin): tests
   that mock the external API client, call our function, and assert exact
   kwargs. Break on refactor even when behavior is preserved.

7. Schedule-internal "function-was-called" tests (acp/test_server scheduling
   tests): tests that patch own helper method, then assert it was called.

Kept behavioral tests throughout: error paths (pytest.raises), security
tests (path traversal, SSRF, redaction), message alternation invariants,
provider API format conversion, streaming logic, memory contract, real
config load/merge tests.

Net reduction: 169 tests removed. 38 empty classes cleaned up.

Collected before: 12,522 tests
Collected after:  12,353 tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant