fix: forward auth + use LM Studio native endpoint for local ctx probes#13293
Merged
Conversation
Pass the user's configured api_key through local-server detection and context-length probes (detect_local_server_type, _query_local_context_length, query_ollama_num_ctx) and use LM Studio's native /api/v1/models endpoint in fetch_endpoint_model_metadata when a loaded instance is present — so the probed context length is the actual runtime value the user loaded the model at, not just the model's theoretical max. Helps local-LLM users whose auto-detected context length was wrong, causing compression failures and context-overrun crashes.
Follow-up for salvaged PR #3185: - run_agent.py: pass self.api_key to query_ollama_num_ctx() so Ollama behind an auth proxy (same issue class as the LM Studio fix) can be probed successfully. - scripts/release.py AUTHOR_MAP: map @tannerfokkens-maker's local-hostname commit email.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Local model users get the actual runtime context length of their loaded LM Studio model (from
loaded_instances[].config.context_length) instead of the model's theoretical max or a stale 128k fallback — fixes a common source of compression failures and context-overrun crashes on local setups.Salvage of #3185 (@tannerfokkens-maker) onto current main. The original branch was 2472 commits behind and had merge conflicts; his commit is preserved via cherry-pick with authorship intact.
Changes
agent/model_metadata.py: new_auth_headers()helper; threadapi_keythroughdetect_local_server_type,_query_local_context_length, andquery_ollama_num_ctx(the last one added here for symmetry with the Ollama auth-proxy case). New LM-Studio-native fast path infetch_endpoint_model_metadata— gated onis_local_endpoint() AND detect_local_server_type() == 'lm-studio', so ollama/vLLM/llama.cpp fall through unchanged.gateway/run.py: the@context-reference ctx-length lookup now forwards the runtime base_url + api_key (previously onlyself._base_url, no api_key).run_agent.py: the startup Ollama num_ctx detection now forwardsself.api_key(small follow-up for parity with the LM Studio fix).scripts/release.py: AUTHOR_MAP entry for @tannerfokkens-maker's local-hostname commit email.tests/agent/test_model_metadata_local_ctx.py: new tests for auth forwarding + LM Studio native endpoint.Validation
-k 'model_metadata or context_length or ollama')Closes #3185.