Skip to content

perf(compression): defer feasibility check to first compression attempt — cut 170-290ms off every chat invocation#28957

Merged
teknium1 merged 1 commit into
mainfrom
perf/lazy-compression-feasibility
May 20, 2026
Merged

perf(compression): defer feasibility check to first compression attempt — cut 170-290ms off every chat invocation#28957
teknium1 merged 1 commit into
mainfrom
perf/lazy-compression-feasibility

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

AIAgent.__init__ was eagerly calling _check_compression_model_feasibility() which probes the auxiliary provider chain and runs get_model_context_length() (potentially network-bound) to decide whether the configured auxiliary model can fit a full compression-threshold window. That cost ~440ms cold on every agent construction.

Most chat -q invocations finish in 1-5 seconds and never accumulate enough context to trip the compression threshold, so the feasibility check is pure overhead. The result is also only consumed when compression actually fires (the function adjusts the live threshold downward if the aux model can't fit; absent that mutation, the gate in conversation_loop.py:442 would never fire anyway).

This is the wall-clock perf win on every agent invocation that the previous per-turn optimizations (#28866) didn't actually deliver — those cut function-call count and CPU but didn't move the perceived speed.

Changes

agent/agent_init.py: replace eager agent._check_compression_model_feasibility() with agent._compression_feasibility_checked = False sentinel.

agent/conversation_compression.py: compress_context() checks the sentinel on entry and runs the feasibility probe just-in-time on the first compression pass. Result is cached for the agent's lifetime.

tests/run_agent/test_compression_feasibility.py: one test asserted get_model_context_length was called during AIAgent(). Updated to drive the now-lazy check explicitly via agent._check_compression_model_feasibility(). All other 15 tests pass unchanged.

Validation

End-to-end timing, chat -q 'hi' --provider openrouter -m google/gemini-3-flash-preview --max-turns 2, 3 runs each, isolated HERMES_HOME with real creds:

BEFORE AFTER delta
median wall 2.03s 1.86s -8% (-169ms)
min wall 1.92s 1.63s -15% (-293ms)

The min-time delta is the cleaner deterministic signal — 293ms removed from every cold agent invocation that doesn't trip compression. The median variance is from LLM latency.

Microbench of the deferred check in isolation (real HERMES_HOME, real provider state):

  • cold call: 440ms
  • warm call (cached aux client + cached context length): 200ms

So we save ~440ms first-time per agent. Subsequent agents in the same process (gateway, batch) save ~200ms each since the underlying caches survive.

UX trade-off

Users with broken auxiliary-provider config no longer see the warning at session start — they see it when compression first fires (which is exactly when it matters). For users with working config (the vast majority), the warning never fires anyway, so the deferral is invisible.

Documented this trade-off in the code comment.

Tests

  • tests/run_agent/test_compression_feasibility.py — 16/16 pass (1 updated)
  • tests/run_agent/ (full module) — 1412 passed, 3 skipped, 1 pre-existing flake (test_marker_message_inserted_when_missing in tool_call_args_sanitizer — confirmed fails on bare origin/main too, unrelated)
  • Live tmux session: 2-turn conversation + tool call, zero errors in agent.log, all responses correct

Where the win comes from in the phase breakdown

For a chat -q 'hi' (3.28s wall on BEFORE branch):

+0.000s  process start
+0.068s  plugin discovery complete       [68ms]
+0.368s  .env loaded                     [300ms more = python imports]
+0.817s  agent_init OpenAI client        [449ms = AIAgent ctor]
+0.839s  vision auto-detect              [22ms]
+1.332s  aux auto-detect (compression)   [493ms  <-- THIS IS WHAT WE DEFER]
+1.532s  conversation turn begins        [200ms more]
+2.962s  API call complete (1.4s net)
+2.970s  Turn ended
+3.280s  process exits                   [310ms tail]

The +0.839s → +1.332s gap (the Auxiliary auto-detect: using main provider openrouter (google/gemini-3-flash-preview) log line) is the feasibility check + aux client resolution + context-length lookup. Deferring it removes that 493ms from cold startup of every short session.

`AIAgent.__init__` was eagerly calling
`_check_compression_model_feasibility()` which probes the auxiliary
provider chain and runs `get_model_context_length()` (potentially
network-bound) to decide whether the configured auxiliary model can
fit a full compression-threshold window. That cost ~440ms cold on
every agent construction.

Most `chat -q` invocations finish in 1-5 seconds and never accumulate
enough context to trip the compression threshold, so the feasibility
check is pure overhead. The result is also only consumed when
compression actually fires (the function adjusts the live threshold
downward if the aux model can't fit; absent that mutation, the gate
in `conversation_loop.py:442` would never fire anyway).

Defer to first `compress_context()` call via
`agent._compression_feasibility_checked` sentinel. Runs at most once
per agent lifetime, just before the first compression pass. The
warning storage (`_compression_warning`) and gateway replay
machinery is unchanged — it still emits to status_callback on the
first turn that actually needs compression.

E2E timing (chat -q 'hi', 3 runs each):
                BEFORE   AFTER    delta
  median wall   2.03s    1.86s    -8% (-169ms)
  min wall      1.92s    1.63s    -15% (-293ms)

Real cold-start observation (synthetic 31-turn agent loop): identical
behavior since feasibility check fires once on first compression and
caches. No semantic difference for sessions that DO compress.

UX trade-off: users with broken auxiliary-provider config no longer
see the warning at session start. They see it when compression first
fires — which is exactly when it matters. For users with working
config (the vast majority), the warning never fires anyway, so the
deferral is invisible.

Tests:
- tests/run_agent/test_compression_feasibility.py — 16/16 pass
  (the one test that asserted call-at-init was updated to drive the
  lazy check explicitly via agent._check_compression_model_feasibility())
- Live tmux session: 2-turn conversation + tool call completes clean,
  zero errors in agent.log
@github-actions

Copy link
Copy Markdown
Contributor

🔎 Lint report: perf/lazy-compression-feasibility vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 8985 on HEAD, 8985 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 4736 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@alt-glitch alt-glitch added type/perf Performance improvement or optimization P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels May 19, 2026
@teknium1 teknium1 merged commit 6cb9917 into main May 20, 2026
20 of 21 checks passed
@teknium1 teknium1 deleted the perf/lazy-compression-feasibility branch May 20, 2026 00:27
Lillard01 pushed a commit to Lillard01/hermes-agent that referenced this pull request May 21, 2026
…pt (NousResearch#28957)

`AIAgent.__init__` was eagerly calling
`_check_compression_model_feasibility()` which probes the auxiliary
provider chain and runs `get_model_context_length()` (potentially
network-bound) to decide whether the configured auxiliary model can
fit a full compression-threshold window. That cost ~440ms cold on
every agent construction.

Most `chat -q` invocations finish in 1-5 seconds and never accumulate
enough context to trip the compression threshold, so the feasibility
check is pure overhead. The result is also only consumed when
compression actually fires (the function adjusts the live threshold
downward if the aux model can't fit; absent that mutation, the gate
in `conversation_loop.py:442` would never fire anyway).

Defer to first `compress_context()` call via
`agent._compression_feasibility_checked` sentinel. Runs at most once
per agent lifetime, just before the first compression pass. The
warning storage (`_compression_warning`) and gateway replay
machinery is unchanged — it still emits to status_callback on the
first turn that actually needs compression.

E2E timing (chat -q 'hi', 3 runs each):
                BEFORE   AFTER    delta
  median wall   2.03s    1.86s    -8% (-169ms)
  min wall      1.92s    1.63s    -15% (-293ms)

Real cold-start observation (synthetic 31-turn agent loop): identical
behavior since feasibility check fires once on first compression and
caches. No semantic difference for sessions that DO compress.

UX trade-off: users with broken auxiliary-provider config no longer
see the warning at session start. They see it when compression first
fires — which is exactly when it matters. For users with working
config (the vast majority), the warning never fires anyway, so the
deferral is invisible.

Tests:
- tests/run_agent/test_compression_feasibility.py — 16/16 pass
  (the one test that asserted call-at-init was updated to drive the
  lazy check explicitly via agent._check_compression_model_feasibility())
- Live tmux session: 2-turn conversation + tool call completes clean,
  zero errors in agent.log
clckmedia pushed a commit to clckmedia/hermes-agent that referenced this pull request May 21, 2026
…pt (NousResearch#28957)

`AIAgent.__init__` was eagerly calling
`_check_compression_model_feasibility()` which probes the auxiliary
provider chain and runs `get_model_context_length()` (potentially
network-bound) to decide whether the configured auxiliary model can
fit a full compression-threshold window. That cost ~440ms cold on
every agent construction.

Most `chat -q` invocations finish in 1-5 seconds and never accumulate
enough context to trip the compression threshold, so the feasibility
check is pure overhead. The result is also only consumed when
compression actually fires (the function adjusts the live threshold
downward if the aux model can't fit; absent that mutation, the gate
in `conversation_loop.py:442` would never fire anyway).

Defer to first `compress_context()` call via
`agent._compression_feasibility_checked` sentinel. Runs at most once
per agent lifetime, just before the first compression pass. The
warning storage (`_compression_warning`) and gateway replay
machinery is unchanged — it still emits to status_callback on the
first turn that actually needs compression.

E2E timing (chat -q 'hi', 3 runs each):
                BEFORE   AFTER    delta
  median wall   2.03s    1.86s    -8% (-169ms)
  min wall      1.92s    1.63s    -15% (-293ms)

Real cold-start observation (synthetic 31-turn agent loop): identical
behavior since feasibility check fires once on first compression and
caches. No semantic difference for sessions that DO compress.

UX trade-off: users with broken auxiliary-provider config no longer
see the warning at session start. They see it when compression first
fires — which is exactly when it matters. For users with working
config (the vast majority), the warning never fires anyway, so the
deferral is invisible.

Tests:
- tests/run_agent/test_compression_feasibility.py — 16/16 pass
  (the one test that asserted call-at-init was updated to drive the
  lazy check explicitly via agent._check_compression_model_feasibility())
- Live tmux session: 2-turn conversation + tool call completes clean,
  zero errors in agent.log
Gpapas pushed a commit to Gpapas/hermes-agent that referenced this pull request May 23, 2026
…pt (NousResearch#28957)

`AIAgent.__init__` was eagerly calling
`_check_compression_model_feasibility()` which probes the auxiliary
provider chain and runs `get_model_context_length()` (potentially
network-bound) to decide whether the configured auxiliary model can
fit a full compression-threshold window. That cost ~440ms cold on
every agent construction.

Most `chat -q` invocations finish in 1-5 seconds and never accumulate
enough context to trip the compression threshold, so the feasibility
check is pure overhead. The result is also only consumed when
compression actually fires (the function adjusts the live threshold
downward if the aux model can't fit; absent that mutation, the gate
in `conversation_loop.py:442` would never fire anyway).

Defer to first `compress_context()` call via
`agent._compression_feasibility_checked` sentinel. Runs at most once
per agent lifetime, just before the first compression pass. The
warning storage (`_compression_warning`) and gateway replay
machinery is unchanged — it still emits to status_callback on the
first turn that actually needs compression.

E2E timing (chat -q 'hi', 3 runs each):
                BEFORE   AFTER    delta
  median wall   2.03s    1.86s    -8% (-169ms)
  min wall      1.92s    1.63s    -15% (-293ms)

Real cold-start observation (synthetic 31-turn agent loop): identical
behavior since feasibility check fires once on first compression and
caches. No semantic difference for sessions that DO compress.

UX trade-off: users with broken auxiliary-provider config no longer
see the warning at session start. They see it when compression first
fires — which is exactly when it matters. For users with working
config (the vast majority), the warning never fires anyway, so the
deferral is invisible.

Tests:
- tests/run_agent/test_compression_feasibility.py — 16/16 pass
  (the one test that asserted call-at-init was updated to drive the
  lazy check explicitly via agent._check_compression_model_feasibility())
- Live tmux session: 2-turn conversation + tool call completes clean,
  zero errors in agent.log
Mucky010 pushed a commit to Mucky010/hermes-agent that referenced this pull request May 24, 2026
…pt (NousResearch#28957)

`AIAgent.__init__` was eagerly calling
`_check_compression_model_feasibility()` which probes the auxiliary
provider chain and runs `get_model_context_length()` (potentially
network-bound) to decide whether the configured auxiliary model can
fit a full compression-threshold window. That cost ~440ms cold on
every agent construction.

Most `chat -q` invocations finish in 1-5 seconds and never accumulate
enough context to trip the compression threshold, so the feasibility
check is pure overhead. The result is also only consumed when
compression actually fires (the function adjusts the live threshold
downward if the aux model can't fit; absent that mutation, the gate
in `conversation_loop.py:442` would never fire anyway).

Defer to first `compress_context()` call via
`agent._compression_feasibility_checked` sentinel. Runs at most once
per agent lifetime, just before the first compression pass. The
warning storage (`_compression_warning`) and gateway replay
machinery is unchanged — it still emits to status_callback on the
first turn that actually needs compression.

E2E timing (chat -q 'hi', 3 runs each):
                BEFORE   AFTER    delta
  median wall   2.03s    1.86s    -8% (-169ms)
  min wall      1.92s    1.63s    -15% (-293ms)

Real cold-start observation (synthetic 31-turn agent loop): identical
behavior since feasibility check fires once on first compression and
caches. No semantic difference for sessions that DO compress.

UX trade-off: users with broken auxiliary-provider config no longer
see the warning at session start. They see it when compression first
fires — which is exactly when it matters. For users with working
config (the vast majority), the warning never fires anyway, so the
deferral is invisible.

Tests:
- tests/run_agent/test_compression_feasibility.py — 16/16 pass
  (the one test that asserted call-at-init was updated to drive the
  lazy check explicitly via agent._check_compression_model_feasibility())
- Live tmux session: 2-turn conversation + tool call completes clean,
  zero errors in agent.log
Bryce-huang pushed a commit to wbkunlun/hermes-agent that referenced this pull request May 29, 2026
…pt (NousResearch#28957)

`AIAgent.__init__` was eagerly calling
`_check_compression_model_feasibility()` which probes the auxiliary
provider chain and runs `get_model_context_length()` (potentially
network-bound) to decide whether the configured auxiliary model can
fit a full compression-threshold window. That cost ~440ms cold on
every agent construction.

Most `chat -q` invocations finish in 1-5 seconds and never accumulate
enough context to trip the compression threshold, so the feasibility
check is pure overhead. The result is also only consumed when
compression actually fires (the function adjusts the live threshold
downward if the aux model can't fit; absent that mutation, the gate
in `conversation_loop.py:442` would never fire anyway).

Defer to first `compress_context()` call via
`agent._compression_feasibility_checked` sentinel. Runs at most once
per agent lifetime, just before the first compression pass. The
warning storage (`_compression_warning`) and gateway replay
machinery is unchanged — it still emits to status_callback on the
first turn that actually needs compression.

E2E timing (chat -q 'hi', 3 runs each):
                BEFORE   AFTER    delta
  median wall   2.03s    1.86s    -8% (-169ms)
  min wall      1.92s    1.63s    -15% (-293ms)

Real cold-start observation (synthetic 31-turn agent loop): identical
behavior since feasibility check fires once on first compression and
caches. No semantic difference for sessions that DO compress.

UX trade-off: users with broken auxiliary-provider config no longer
see the warning at session start. They see it when compression first
fires — which is exactly when it matters. For users with working
config (the vast majority), the warning never fires anyway, so the
deferral is invisible.

Tests:
- tests/run_agent/test_compression_feasibility.py — 16/16 pass
  (the one test that asserted call-at-init was updated to drive the
  lazy check explicitly via agent._check_compression_model_feasibility())
- Live tmux session: 2-turn conversation + tool call completes clean,
  zero errors in agent.log
#AI commit#
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…pt (NousResearch#28957)

`AIAgent.__init__` was eagerly calling
`_check_compression_model_feasibility()` which probes the auxiliary
provider chain and runs `get_model_context_length()` (potentially
network-bound) to decide whether the configured auxiliary model can
fit a full compression-threshold window. That cost ~440ms cold on
every agent construction.

Most `chat -q` invocations finish in 1-5 seconds and never accumulate
enough context to trip the compression threshold, so the feasibility
check is pure overhead. The result is also only consumed when
compression actually fires (the function adjusts the live threshold
downward if the aux model can't fit; absent that mutation, the gate
in `conversation_loop.py:442` would never fire anyway).

Defer to first `compress_context()` call via
`agent._compression_feasibility_checked` sentinel. Runs at most once
per agent lifetime, just before the first compression pass. The
warning storage (`_compression_warning`) and gateway replay
machinery is unchanged — it still emits to status_callback on the
first turn that actually needs compression.

E2E timing (chat -q 'hi', 3 runs each):
                BEFORE   AFTER    delta
  median wall   2.03s    1.86s    -8% (-169ms)
  min wall      1.92s    1.63s    -15% (-293ms)

Real cold-start observation (synthetic 31-turn agent loop): identical
behavior since feasibility check fires once on first compression and
caches. No semantic difference for sessions that DO compress.

UX trade-off: users with broken auxiliary-provider config no longer
see the warning at session start. They see it when compression first
fires — which is exactly when it matters. For users with working
config (the vast majority), the warning never fires anyway, so the
deferral is invisible.

Tests:
- tests/run_agent/test_compression_feasibility.py — 16/16 pass
  (the one test that asserted call-at-init was updated to drive the
  lazy check explicitly via agent._check_compression_model_feasibility())
- Live tmux session: 2-turn conversation + tool call completes clean,
  zero errors in agent.log
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/perf Performance improvement or optimization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants