The code seems to default to assuming 128k context:
return metadata[model].get("context_length", 128000)
This leads to complete failures, such as:
⚠️ Error code: 400 - {'error': {'code': 400, 'message': 'request (67374 tokens) exceeds the available context size (65536 tokens), try increasing it', 'type': 'exceed_context_size_error', 'n_prompt_tokens': 67374, 'n_ctx': 65536}}
Which cannot (and should not attempt to) be recovered even by reducing the compaction threshold and restarting the gateway, because the context is already exceeding the amount that the model can process and compact.
128k is a typical model maximum (from a year or so ago). It shouldn't be assumed as a default length. In absolute terms, the safer default would be something like 4096, but, more reasonably, the README.md should probably specify a minimum supported context length, based on the size of prompts sent to the model from internal prompt templates. Then that should be as the default value used, too, because it's the minimum supported and the minimum assumed, exactly per the docs.
Fundamentally, the problem is that the codebase tries to guess this using heuristics and lookup tables for a few well-known models, instead of requiring the information when the model is not recognised.
The code seems to default to assuming 128k context:
This leads to complete failures, such as:
Which cannot (and should not attempt to) be recovered even by reducing the compaction threshold and restarting the gateway, because the context is already exceeding the amount that the model can process and compact.
128k is a typical model maximum (from a year or so ago). It shouldn't be assumed as a default length. In absolute terms, the safer default would be something like 4096, but, more reasonably, the README.md should probably specify a minimum supported context length, based on the size of prompts sent to the model from internal prompt templates. Then that should be as the default value used, too, because it's the minimum supported and the minimum assumed, exactly per the docs.
Fundamentally, the problem is that the codebase tries to guess this using heuristics and lookup tables for a few well-known models, instead of requiring the information when the model is not recognised.