Similar to how vllm has the 'max_model_length' when starting the server. Can we have this here too?
This would help when trying to host models on smaller gpus. For example with vllm Mistral 7b with 32k context doesn't fit on a single 24GB GPU. Whereas with 8k context it does.
Similar to how vllm has the 'max_model_length' when starting the server. Can we have this here too?
This would help when trying to host models on smaller gpus. For example with vllm Mistral 7b with 32k context doesn't fit on a single 24GB GPU. Whereas with 8k context it does.