fix: forward auth + use LM Studio native endpoint for local ctx probes by teknium1 · Pull Request #13293 · NousResearch/hermes-agent

teknium1 · 2026-04-21T03:50:07Z

Local model users get the actual runtime context length of their loaded LM Studio model (from loaded_instances[].config.context_length) instead of the model's theoretical max or a stale 128k fallback — fixes a common source of compression failures and context-overrun crashes on local setups.

Salvage of #3185 (@tannerfokkens-maker) onto current main. The original branch was 2472 commits behind and had merge conflicts; his commit is preserved via cherry-pick with authorship intact.

Changes

agent/model_metadata.py: new _auth_headers() helper; thread api_key through detect_local_server_type, _query_local_context_length, and query_ollama_num_ctx (the last one added here for symmetry with the Ollama auth-proxy case). New LM-Studio-native fast path in fetch_endpoint_model_metadata — gated on is_local_endpoint() AND detect_local_server_type() == 'lm-studio', so ollama/vLLM/llama.cpp fall through unchanged.
gateway/run.py: the @ context-reference ctx-length lookup now forwards the runtime base_url + api_key (previously only self._base_url, no api_key).
run_agent.py: the startup Ollama num_ctx detection now forwards self.api_key (small follow-up for parity with the LM Studio fix).
scripts/release.py: AUTHOR_MAP entry for @tannerfokkens-maker's local-hostname commit email.
tests/agent/test_model_metadata_local_ctx.py: new tests for auth forwarding + LM Studio native endpoint.

Validation

	Before	After
Targeted tests (test_model_metadata_local_ctx + test_ollama_num_ctx)	—	30/30 pass
Metadata/ctx-length area (`-k 'model_metadata or context_length or ollama'`)	—	108/108 pass
E2E: stub LM Studio w/ loaded_instance ctx=24576, max=131072	returns 131072	returns 24576
E2E: vLLM stub (non-LM-Studio)	works	works (no regression)
E2E: no api_key configured	no auth header	no auth header (baseline preserved)

Closes #3185.

Pass the user's configured api_key through local-server detection and context-length probes (detect_local_server_type, _query_local_context_length, query_ollama_num_ctx) and use LM Studio's native /api/v1/models endpoint in fetch_endpoint_model_metadata when a loaded instance is present — so the probed context length is the actual runtime value the user loaded the model at, not just the model's theoretical max. Helps local-LLM users whose auto-detected context length was wrong, causing compression failures and context-overrun crashes.

@tannerfokkens-maker

Follow-up for salvaged PR #3185: - run_agent.py: pass self.api_key to query_ollama_num_ctx() so Ollama behind an auth proxy (same issue class as the LM Studio fix) can be probed successfully. - scripts/release.py AUTHOR_MAP: map @tannerfokkens-maker's local-hostname commit email.

Tanner Fokkens and others added 2 commits April 20, 2026 20:49

teknium1 merged commit e00d963 into main Apr 21, 2026
6 of 7 checks passed

teknium1 deleted the hermes/hermes-7394be4f branch April 21, 2026 03:51

teknium1 mentioned this pull request Apr 21, 2026

fix: forward auth when probing local model metadata #3185

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: forward auth + use LM Studio native endpoint for local ctx probes#13293

fix: forward auth + use LM Studio native endpoint for local ctx probes#13293
teknium1 merged 2 commits into
mainfrom
hermes/hermes-7394be4f

teknium1 commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

teknium1 commented Apr 21, 2026

Changes

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant