RAG Saving Cost with Contextual Embeddings [closed]

Question

Closed. This question is not about programming or software development. It is not currently accepting answers.

This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.

Closed 2 days ago.

Improve this question

I am using the LiteLLM Proxy with the LiteLLM (https://github.com/BerriAI/litellm) to create contextual embeddings. However, the cost is quite high. I understand that prompt caching is enabled by default in Azure OpenAI, and no additional code is required to enable it.

We have multiple Azure OpenAI instances, and the LiteLLM Proxy acts as a load balancer, distributing requests of type chat or embeddings among these instances. To take advantage of prompt caching, it is necessary to send repeated requests to the same Azure OpenAI instance.

How can I configure the LiteLLM Proxy to ensure that requests are consistently sent to the same Azure OpenAI instance to leverage prompt caching effectively?

Or How can I modify the code in such a way that we will save the cost.

Alfred Luu · Accepted Answer · 2024-12-01 08:55:01Z

0

Understanding your are using LiteLLM as a LLM Wrappers, you will need to handle something like Caching detection strategy, along with Prompt Caching works in the Azure OpenAI backend. I'm imagining that you will need to track at least these information for each request in a Vector Store (e.g., FAISS, Hnswlib, PGVector, Chroma, CosmosDB, etc.,):

Instance ID or endpoint url
Model deployment name
Prompt text
Response text

When user queries, the application route to the Vector Store searching for relevant prompt text, if cached is hit, then you can either go:

Indicate to Azure OpenAI instance, or
Return "Response text" directly

There is an article for that Caching, so called Semantic Caching you may take a look: https://techcommunity.microsoft.com/blog/azurearchitectureblog/optimize-azure-openai-applications-with-semantic-caching/4106867

answered Dec 1, 2024 at 8:55

Alfred Luu

2,1144 gold badges20 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

RAG Saving Cost with Contextual Embeddings [closed]

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related