Checklist
Motivation
It would be great to support this new model: https://huggingface.co/CohereForAI/c4ai-command-a-03-2025
What's special about this model is that they use an unusual architecture where some layers require sliding windows and some don't:
The model features three layers with sliding window attention (window size 4096) and RoPE for efficient local context modeling and relative positional encoding. A fourth layer uses global attention without positional embeddings, enabling unrestricted token interactions across the entire sequence.
I've found a Cohere2ForCausalLM in this project already but it appears to be a stub that is not implemented yet:
|
class Cohere2ForCausalLM(CohereForCausalLM): |
I previously attempted to implement this model in TensorRT-LLM ( NVIDIA/TensorRT-LLM#2912 ) but ultimately failed as they do not support layers using sliding windows without forcing a cyclic kv cache which breaks prefix caching, and the code that would need changing to fix it is missing. Extremely frustrating. Will there be better luck in this library?
Related resources
Transformers impl here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/cohere2/modular_cohere2.py
vLLM impl here (note for some reason they merged the models and added the sliding window support for CohereForCausalLM): https://github.com/vllm-project/vllm/blob/61f412187d972a006aef1653bfe348aeaefb6a0b/vllm/model_executor/models/commandr.py#L336
Checklist
Motivation
It would be great to support this new model: https://huggingface.co/CohereForAI/c4ai-command-a-03-2025
What's special about this model is that they use an unusual architecture where some layers require sliding windows and some don't:
I've found a
Cohere2ForCausalLMin this project already but it appears to be a stub that is not implemented yet:sglang/python/sglang/srt/models/commandr.py
Line 413 in 90532b7
I previously attempted to implement this model in TensorRT-LLM ( NVIDIA/TensorRT-LLM#2912 ) but ultimately failed as they do not support layers using sliding windows without forcing a cyclic kv cache which breaks prefix caching, and the code that would need changing to fix it is missing. Extremely frustrating. Will there be better luck in this library?
Related resources
Transformers impl here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/cohere2/modular_cohere2.py
vLLM impl here (note for some reason they merged the models and added the sliding window support for CohereForCausalLM): https://github.com/vllm-project/vllm/blob/61f412187d972a006aef1653bfe348aeaefb6a0b/vllm/model_executor/models/commandr.py#L336