What happened?
When a streaming request is too large for a context window limit on a Bedrock model invoked via the Azure OpenAI endpoint on the proxy, it appears to have a few issues with the response that break some clients (see log output with differences below):
- The status code is 200 (Azure OpenAI model invocations return 400)
- Returns a text/event-stream response (Azure OpenAI returns a single JSON response as
application/json)
- Response format is also a bit different
This ends up breaking interactions with the Azure OpenAI client SDK -- it continues to sit and wait for subsequent streaming responses instead of erroring out.
This was tested on the following setup with the repro script below:
- LiteLLM Proxy v1.52.0
- Talking to Claude Sonnet 3.5 (v1/v2) via Bedrock
- The "good" example below was talking to Azure OpenAI for gpt-4o
import os
import httpx
def test_model(model):
print(f"\nTesting {model}")
print("=" * 40)
messages = [
{"role": "user", "content": "This is a long test message. " * 10000},
{"role": "assistant", "content": "This is a long response message. " * 10000},
{"role": "user", "content": "This is another long test message. " * 10000},
]
response = httpx.post(
f"{os.getenv('LITELLM_BASE_URL')}/openai/deployments/{model}/chat/completions?api-version=2024-09-01-preview",
headers={
"Authorization": f"Bearer {os.getenv('LITELLM_API_KEY')}",
"Content-Type": "application/json"
},
json={
"model": model,
"messages": messages,
"stream": True
},
timeout=None
)
print(f"Status: {response.status_code}")
print(f"Content-Type: {response.headers.get('content-type')}")
print(f"Content-Length: {response.headers.get('content-length')}")
print(f"Transfer-Encoding: {response.headers.get('transfer-encoding')}")
print(f"Body: {response.text}")
if not os.getenv('LITELLM_BASE_URL') or not os.getenv('LITELLM_API_KEY'):
raise ValueError("Please set LITELLM_BASE_URL and LITELLM_API_KEY environment variables")
test_model("claude-3-5-sonnet-20240620-v1:0")
test_model("gpt-4o")
Relevant log output
Testing claude-3-5-sonnet-20240620-v1:0
========================================
Status: 200
Content-Type: text/event-stream; charset=utf-8
Content-Length: None
Transfer-Encoding: chunked
Body: data: {"error": {"message": "litellm.BadRequestError: litellm.ContextWindowExceededError: BedrockException: Context Window Error - Bad response code, expected 200: {'status_code': 400, 'headers': {':exception-type': 'validationException', ':content-type': 'application/json', ':message-type': 'exception'}, 'body': b'{\"message\":\"The model returned the following errors: Input is too long for requested model.\"}'}", "type": null, "param": null, "code": "400"}}
Testing gpt-4o
========================================
Status: 400
Content-Type: application/json
Content-Length: 658
Transfer-Encoding: None
Body: {"error":{"message":"litellm.BadRequestError: litellm.ContextWindowExceededError: AzureException ContextWindowExceededError - Error code: 400 - {'error': {'message': \"This model's maximum context length is 128000 tokens. However, your messages resulted in 210018 tokens. Please reduce the length of the messages.\", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}\nmodel=gpt-4o. context_window_fallbacks=None. fallbacks=None.\n\nSet 'context_window_fallback' - https://docs.litellm.ai/docs/routing#fallbacks\nReceived Model Group=gpt-4o\nAvailable Model Group Fallbacks=None","type":null,"param":null,"code":"400"}}
Twitter / LinkedIn details
No response
What happened?
When a streaming request is too large for a context window limit on a Bedrock model invoked via the Azure OpenAI endpoint on the proxy, it appears to have a few issues with the response that break some clients (see log output with differences below):
application/json)This ends up breaking interactions with the Azure OpenAI client SDK -- it continues to sit and wait for subsequent streaming responses instead of erroring out.
This was tested on the following setup with the repro script below:
Relevant log output
Twitter / LinkedIn details
No response