Skip to content

[Feature]: Manual context length entry on initial setup for custom endpoints #2007

@simpolism

Description

@simpolism

Problem or Use Case

Just walked through the setup against a local Qwen3.5-9B @ Q4_K_M GGUF running via llama.cpp, with ~150k tokens context size (so I have a little wiggle room on my 4070 Ti Super).

The /v1/models endpoint doesn't expose a context size, so I (believe that I) got fuzzy matched to 32k and had to edit the context_length_cache.yaml file to expose the configured value.

Proposed Solution

Would be nice if this were something I could have entered directly in initial setup for a custom endpoint (maybe with local vs remote endpoints as separate options to expose a more nuanced flow). Even the probe for unknown models would've forced me into a bucket that doesn't actually match with the local configuration I'm using.

Alternatives Considered

The alternative approach is to do nothing about it, or patch llama.cpp to expose the value we want via llama-server, but it did make setup less ergonomic for me.

An environment variable would also work, but then the question is how to handle it vs the existing cached context size. The environment variable would likely be an overwrite, so that's a more general solution, but then how does that get persisted to the gateway? Less ergonomic overall than a simple setup option.

Feature Type

Configuration option

Scope

Small (single file, < 50 lines)

Contribution

  • I'd like to implement this myself and submit a PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions