0

I am using the LiteLLM Proxy with the LiteLLM (https://github.com/BerriAI/litellm) to create contextual embeddings. However, the cost is quite high. I understand that prompt caching is enabled by default in Azure OpenAI, and no additional code is required to enable it.

We have multiple Azure OpenAI instances, and the LiteLLM Proxy acts as a load balancer, distributing requests of type chat or embeddings among these instances. To take advantage of prompt caching, it is necessary to send repeated requests to the same Azure OpenAI instance.

How can I configure the LiteLLM Proxy to ensure that requests are consistently sent to the same Azure OpenAI instance to leverage prompt caching effectively?

Or How can I modify the code in such a way that we will save the cost.

1 Answer 1

0

Understanding your are using LiteLLM as a LLM Wrappers, you will need to handle something like Caching detection strategy, along with Prompt Caching works in the Azure OpenAI backend. I'm imagining that you will need to track at least these information for each request in a Vector Store (e.g., FAISS, Hnswlib, PGVector, Chroma, CosmosDB, etc.,):

  • Instance ID or endpoint url
  • Model deployment name
  • Prompt text
  • Response text

When user queries, the application route to the Vector Store searching for relevant prompt text, if cached is hit, then you can either go:

  • Indicate to Azure OpenAI instance, or
  • Return "Response text" directly

There is an article for that Caching, so called Semantic Caching you may take a look: https://techcommunity.microsoft.com/blog/azurearchitectureblog/optimize-azure-openai-applications-with-semantic-caching/4106867

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.