perf(compression): defer feasibility check to first compression attempt — cut 170-290ms off every chat invocation by teknium1 · Pull Request #28957 · NousResearch/hermes-agent

teknium1 · 2026-05-19T21:56:28Z

Summary

AIAgent.__init__ was eagerly calling _check_compression_model_feasibility() which probes the auxiliary provider chain and runs get_model_context_length() (potentially network-bound) to decide whether the configured auxiliary model can fit a full compression-threshold window. That cost ~440ms cold on every agent construction.

Most chat -q invocations finish in 1-5 seconds and never accumulate enough context to trip the compression threshold, so the feasibility check is pure overhead. The result is also only consumed when compression actually fires (the function adjusts the live threshold downward if the aux model can't fit; absent that mutation, the gate in conversation_loop.py:442 would never fire anyway).

This is the wall-clock perf win on every agent invocation that the previous per-turn optimizations (#28866) didn't actually deliver — those cut function-call count and CPU but didn't move the perceived speed.

Changes

agent/agent_init.py: replace eager agent._check_compression_model_feasibility() with agent._compression_feasibility_checked = False sentinel.

agent/conversation_compression.py: compress_context() checks the sentinel on entry and runs the feasibility probe just-in-time on the first compression pass. Result is cached for the agent's lifetime.

tests/run_agent/test_compression_feasibility.py: one test asserted get_model_context_length was called during AIAgent(). Updated to drive the now-lazy check explicitly via agent._check_compression_model_feasibility(). All other 15 tests pass unchanged.

Validation

End-to-end timing, chat -q 'hi' --provider openrouter -m google/gemini-3-flash-preview --max-turns 2, 3 runs each, isolated HERMES_HOME with real creds:

	BEFORE	AFTER	delta
median wall	2.03s	1.86s	-8% (-169ms)
min wall	1.92s	1.63s	-15% (-293ms)

The min-time delta is the cleaner deterministic signal — 293ms removed from every cold agent invocation that doesn't trip compression. The median variance is from LLM latency.

Microbench of the deferred check in isolation (real HERMES_HOME, real provider state):

cold call: 440ms
warm call (cached aux client + cached context length): 200ms

So we save ~440ms first-time per agent. Subsequent agents in the same process (gateway, batch) save ~200ms each since the underlying caches survive.

UX trade-off

Users with broken auxiliary-provider config no longer see the warning at session start — they see it when compression first fires (which is exactly when it matters). For users with working config (the vast majority), the warning never fires anyway, so the deferral is invisible.

Documented this trade-off in the code comment.

Tests

tests/run_agent/test_compression_feasibility.py — 16/16 pass (1 updated)
tests/run_agent/ (full module) — 1412 passed, 3 skipped, 1 pre-existing flake (test_marker_message_inserted_when_missing in tool_call_args_sanitizer — confirmed fails on bare origin/main too, unrelated)
Live tmux session: 2-turn conversation + tool call, zero errors in agent.log, all responses correct

Where the win comes from in the phase breakdown

For a chat -q 'hi' (3.28s wall on BEFORE branch):

+0.000s  process start
+0.068s  plugin discovery complete       [68ms]
+0.368s  .env loaded                     [300ms more = python imports]
+0.817s  agent_init OpenAI client        [449ms = AIAgent ctor]
+0.839s  vision auto-detect              [22ms]
+1.332s  aux auto-detect (compression)   [493ms  <-- THIS IS WHAT WE DEFER]
+1.532s  conversation turn begins        [200ms more]
+2.962s  API call complete (1.4s net)
+2.970s  Turn ended
+3.280s  process exits                   [310ms tail]

The +0.839s → +1.332s gap (the Auxiliary auto-detect: using main provider openrouter (google/gemini-3-flash-preview) log line) is the feasibility check + aux client resolution + context-length lookup. Deferring it removes that 493ms from cold startup of every short session.

`AIAgent.__init__` was eagerly calling `_check_compression_model_feasibility()` which probes the auxiliary provider chain and runs `get_model_context_length()` (potentially network-bound) to decide whether the configured auxiliary model can fit a full compression-threshold window. That cost ~440ms cold on every agent construction. Most `chat -q` invocations finish in 1-5 seconds and never accumulate enough context to trip the compression threshold, so the feasibility check is pure overhead. The result is also only consumed when compression actually fires (the function adjusts the live threshold downward if the aux model can't fit; absent that mutation, the gate in `conversation_loop.py:442` would never fire anyway). Defer to first `compress_context()` call via `agent._compression_feasibility_checked` sentinel. Runs at most once per agent lifetime, just before the first compression pass. The warning storage (`_compression_warning`) and gateway replay machinery is unchanged — it still emits to status_callback on the first turn that actually needs compression. E2E timing (chat -q 'hi', 3 runs each): BEFORE AFTER delta median wall 2.03s 1.86s -8% (-169ms) min wall 1.92s 1.63s -15% (-293ms) Real cold-start observation (synthetic 31-turn agent loop): identical behavior since feasibility check fires once on first compression and caches. No semantic difference for sessions that DO compress. UX trade-off: users with broken auxiliary-provider config no longer see the warning at session start. They see it when compression first fires — which is exactly when it matters. For users with working config (the vast majority), the warning never fires anyway, so the deferral is invisible. Tests: - tests/run_agent/test_compression_feasibility.py — 16/16 pass (the one test that asserted call-at-init was updated to drive the lazy check explicitly via agent._check_compression_model_feasibility()) - Live tmux session: 2-turn conversation + tool call completes clean, zero errors in agent.log

github-actions · 2026-05-19T21:57:06Z

🔎 Lint report: `perf/lazy-compression-feasibility` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 8985 on HEAD, 8985 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 4736 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

…pt (NousResearch#28957) `AIAgent.__init__` was eagerly calling `_check_compression_model_feasibility()` which probes the auxiliary provider chain and runs `get_model_context_length()` (potentially network-bound) to decide whether the configured auxiliary model can fit a full compression-threshold window. That cost ~440ms cold on every agent construction. Most `chat -q` invocations finish in 1-5 seconds and never accumulate enough context to trip the compression threshold, so the feasibility check is pure overhead. The result is also only consumed when compression actually fires (the function adjusts the live threshold downward if the aux model can't fit; absent that mutation, the gate in `conversation_loop.py:442` would never fire anyway). Defer to first `compress_context()` call via `agent._compression_feasibility_checked` sentinel. Runs at most once per agent lifetime, just before the first compression pass. The warning storage (`_compression_warning`) and gateway replay machinery is unchanged — it still emits to status_callback on the first turn that actually needs compression. E2E timing (chat -q 'hi', 3 runs each): BEFORE AFTER delta median wall 2.03s 1.86s -8% (-169ms) min wall 1.92s 1.63s -15% (-293ms) Real cold-start observation (synthetic 31-turn agent loop): identical behavior since feasibility check fires once on first compression and caches. No semantic difference for sessions that DO compress. UX trade-off: users with broken auxiliary-provider config no longer see the warning at session start. They see it when compression first fires — which is exactly when it matters. For users with working config (the vast majority), the warning never fires anyway, so the deferral is invisible. Tests: - tests/run_agent/test_compression_feasibility.py — 16/16 pass (the one test that asserted call-at-init was updated to drive the lazy check explicitly via agent._check_compression_model_feasibility()) - Live tmux session: 2-turn conversation + tool call completes clean, zero errors in agent.log

…pt (NousResearch#28957) `AIAgent.__init__` was eagerly calling `_check_compression_model_feasibility()` which probes the auxiliary provider chain and runs `get_model_context_length()` (potentially network-bound) to decide whether the configured auxiliary model can fit a full compression-threshold window. That cost ~440ms cold on every agent construction. Most `chat -q` invocations finish in 1-5 seconds and never accumulate enough context to trip the compression threshold, so the feasibility check is pure overhead. The result is also only consumed when compression actually fires (the function adjusts the live threshold downward if the aux model can't fit; absent that mutation, the gate in `conversation_loop.py:442` would never fire anyway). Defer to first `compress_context()` call via `agent._compression_feasibility_checked` sentinel. Runs at most once per agent lifetime, just before the first compression pass. The warning storage (`_compression_warning`) and gateway replay machinery is unchanged — it still emits to status_callback on the first turn that actually needs compression. E2E timing (chat -q 'hi', 3 runs each): BEFORE AFTER delta median wall 2.03s 1.86s -8% (-169ms) min wall 1.92s 1.63s -15% (-293ms) Real cold-start observation (synthetic 31-turn agent loop): identical behavior since feasibility check fires once on first compression and caches. No semantic difference for sessions that DO compress. UX trade-off: users with broken auxiliary-provider config no longer see the warning at session start. They see it when compression first fires — which is exactly when it matters. For users with working config (the vast majority), the warning never fires anyway, so the deferral is invisible. Tests: - tests/run_agent/test_compression_feasibility.py — 16/16 pass (the one test that asserted call-at-init was updated to drive the lazy check explicitly via agent._check_compression_model_feasibility()) - Live tmux session: 2-turn conversation + tool call completes clean, zero errors in agent.log #AI commit#

…pt (NousResearch#28957) `AIAgent.__init__` was eagerly calling `_check_compression_model_feasibility()` which probes the auxiliary provider chain and runs `get_model_context_length()` (potentially network-bound) to decide whether the configured auxiliary model can fit a full compression-threshold window. That cost ~440ms cold on every agent construction. Most `chat -q` invocations finish in 1-5 seconds and never accumulate enough context to trip the compression threshold, so the feasibility check is pure overhead. The result is also only consumed when compression actually fires (the function adjusts the live threshold downward if the aux model can't fit; absent that mutation, the gate in `conversation_loop.py:442` would never fire anyway). Defer to first `compress_context()` call via `agent._compression_feasibility_checked` sentinel. Runs at most once per agent lifetime, just before the first compression pass. The warning storage (`_compression_warning`) and gateway replay machinery is unchanged — it still emits to status_callback on the first turn that actually needs compression. E2E timing (chat -q 'hi', 3 runs each): BEFORE AFTER delta median wall 2.03s 1.86s -8% (-169ms) min wall 1.92s 1.63s -15% (-293ms) Real cold-start observation (synthetic 31-turn agent loop): identical behavior since feasibility check fires once on first compression and caches. No semantic difference for sessions that DO compress. UX trade-off: users with broken auxiliary-provider config no longer see the warning at session start. They see it when compression first fires — which is exactly when it matters. For users with working config (the vast majority), the warning never fires anyway, so the deferral is invisible. Tests: - tests/run_agent/test_compression_feasibility.py — 16/16 pass (the one test that asserted call-at-init was updated to drive the lazy check explicitly via agent._check_compression_model_feasibility()) - Live tmux session: 2-turn conversation + tool call completes clean, zero errors in agent.log

alt-glitch added type/perf Performance improvement or optimization P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels May 19, 2026

teknium1 merged commit 6cb9917 into main May 20, 2026
20 of 21 checks passed

teknium1 deleted the perf/lazy-compression-feasibility branch May 20, 2026 00:27

teknium1 mentioned this pull request May 20, 2026

perf(terminal): adaptive subprocess poll — cut ~195ms off every tool call, 1+ second per turn #29006

Merged

Haderach-Ram mentioned this pull request May 20, 2026

Ecosystem Digest — 2026-05-20 Haderach-Ram/openclaw-radar#13

Open

This was referenced May 25, 2026

fix(test): re-add explicit feasibility check call in lazy-check integration test #32001

Open

fix(agent): propagate custom provider context_length to compression feasibility check #13540

Open

BrewTestBot mentioned this pull request May 28, 2026

hermes-agent 2026.5.28 Homebrew/homebrew-core#285115

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(compression): defer feasibility check to first compression attempt — cut 170-290ms off every chat invocation#28957

perf(compression): defer feasibility check to first compression attempt — cut 170-290ms off every chat invocation#28957
teknium1 merged 1 commit into
mainfrom
perf/lazy-compression-feasibility

teknium1 commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

teknium1 commented May 19, 2026

Summary

Changes

Validation

UX trade-off

Tests

Where the win comes from in the phase breakdown

Uh oh!

github-actions Bot commented May 19, 2026

🔎 Lint report: perf/lazy-compression-feasibility vs origin/main

ruff

ty (type checker)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🔎 Lint report: `perf/lazy-compression-feasibility` vs `origin/main`