I am using the LiteLLM Proxy with the LiteLLM (https://github.com/BerriAI/litellm) to create contextual embeddings. However, the cost is quite high. I understand that prompt caching is enabled by default in Azure OpenAI, and no additional code is required to enable it.
We have multiple Azure OpenAI instances, and the LiteLLM Proxy acts as a load balancer, distributing requests of type chat or embeddings among these instances. To take advantage of prompt caching, it is necessary to send repeated requests to the same Azure OpenAI instance.
How can I configure the LiteLLM Proxy to ensure that requests are consistently sent to the same Azure OpenAI instance to leverage prompt caching effectively?
Or How can I modify the code in such a way that we will save the cost.