Skip to content

Add support for OpenAI API parallel sampling#640

Merged
merrymercy merged 4 commits intosgl-project:mainfrom
yichuan-w:openai_parallel_sampling
Jul 20, 2024
Merged

Add support for OpenAI API parallel sampling#640
merrymercy merged 4 commits intosgl-project:mainfrom
yichuan-w:openai_parallel_sampling

Conversation

@yichuan-w
Copy link
Copy Markdown
Contributor

@yichuan-w yichuan-w commented Jul 17, 2024

Add support for OpenAI API parallel sampling:

  1. add support for request.n>1 when using OpenAI API ; First send one prefilling request to increase cache hit rate, then async send n decoding request in parallel
  2. Do not support when there are m prompts organized as a List in OpenAI API
  3. Do not support request.n>1 while streaming

Example code:

import openai

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")

# Text completion
response = client.completions.create(
    model="default",
    prompt="I am a robot and I want to study like humans. Now let's tell a story. Once upon a time, there was a little",
    n=3,
    temperature=0.8,
    max_tokens=32,
)
print(response)

# Chat completion
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0.8,
    max_tokens=64,
    logprobs=True,
    n=3,
)
print(response)

The result of running python sglang/examples/usage/openai_parallel_sample.py is

Completion(id='6782cf18dd97421bbb72462eb404d93a', choices=[CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=0, logprobs=None, text=' robot named Bob. Bob was designed to learn and grow, but he was stuck in a rut. He kept repeating the same tasks over and over again,'), CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=1, logprobs=None, text=' robot named Robby who lived in a big factory with lots of other robots. Robby was very curious and wanted to learn new things, but the other'), CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=2, logprobs=None, text=' robot named Robby. Robby was different from the other robots, for he had a big dream: to learn how to think and learn like a human')], created=1721238118, model='default', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=96, prompt_tokens=30, total_tokens=126))
ChatCompletion(id='4e589636d6bb4cae9ce307f4e0be4203', choices=[Choice(finish_reason='FINISH_MATCHED_TOKEN: 2', index=0, logprobs=None, message=ChatCompletionMessage(content='  Of course! Here are three countries and their capitals:\n\n1. Country: France\nCapital: Paris\n2. Country: Japan\nCapital: Tokyo\n3. Country: Brazil\nCapital: Brasília', role='assistant', function_call=None, tool_calls=None)), Choice(finish_reason='FINISH_LENGTH: 64', index=1, logprobs=None, message=ChatCompletionMessage(content='  Of course! Here are three countries and their capitals:\n\n1. Country: France\nCapital: Paris\n2. Country: Japan\nCapital: Tokyo\n3. Country: Brazil\nCapital: Brasília\n\nI hope this helps! Let me know if you need anything else.', role='assistant', function_call=None, tool_calls=None)), Choice(finish_reason='FINISH_MATCHED_TOKEN: 2', index=2, logprobs=None, message=ChatCompletionMessage(content="  Of course, I'd be happy to help! Here are three countries and their capitals:\n\n1. Italy - Rome\n2. Japan - Tokyo\n3. Mexico - Mexico City", role='assistant', function_call=None, tool_calls=None))], created=1721238120, model='default', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=155, prompt_tokens=38, total_tokens=193))

cc @Ying1123 @merrymercy @hnyls2002 and thanks @hnyls2002 for help in some guidance

@Ying1123 Ying1123 mentioned this pull request Jul 17, 2024
29 tasks
@yichuan-w
Copy link
Copy Markdown
Contributor Author

Now it supports when there are m prompts organized as a List in OpenAI API, and these m prompts can parallel sampling.

When we run this code

import openai

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")

# Text completion
response = client.completions.create(
    model="default",
    prompt="I am a robot and I want to study like humans. Now let's tell a story. Once upon a time, there was a little",
    n=1,
    temperature=0.8,
    max_tokens=32,
)
print(response)


# Text completion
response = client.completions.create(
    model="default",
    prompt="I am a robot and I want to study like humans. Now let's tell a story. Once upon a time, there was a little",
    n=3,
    temperature=0.8,
    max_tokens=32,
)
print(response)


# Text completion
response = client.completions.create(
    model="default",
    prompt=["The name of the famous soccer player is ", "The capital of US is"],
    n=1,
    temperature=0.8,
    max_tokens=32,
)
print(response)


# Text completion
response = client.completions.create(
    model="default",
    prompt=["The name of the famous soccer player is ", "The capital of US is"],
    n=3,
    temperature=0.8,
    max_tokens=32,
)
print(response)


# Chat completion
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0.8,
    max_tokens=64,
    logprobs=True,
    n=4,
)
print(response)

The result will be

Completion(id='0e8758bc5e964ffc893c78ec4a805484', choices=[CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=0, logprobs=None, text=' robot named Bob. Bob was different from the other robots because he was curious about the world beyond his factory floor. He wanted to learn how to think like')], created=1721380457, model='default', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=32, prompt_tokens=30, total_tokens=62))
Completion(id='1ca8cb8cd4b74613858935311ecdbefc', choices=[CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=0, logprobs=None, text=' robot named Zeta. Zeta lived in a big factory where she helped make other robots. But Zeta had big dreams. She wanted to learn'), CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=1, logprobs=None, text=' robot named Zeta. Zeta loved to learn and play with his robot friends, but he wanted to learn more. He wanted to learn like humans! So'), CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=2, logprobs=None, text=' robot named Robby. Robby lived in a big factory where he did lots of work, but he was not happy. He wanted to learn more and be')], created=1721380458, model='default', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=96, prompt_tokens=30, total_tokens=126))
Completion(id='b6f62139cb8148d6ac1868a932923790', choices=[CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=0, logprobs=None, text=' Medi éllo Gór Mahrez. It is a good name for a soccer player because it is unique and easy to pronounce. It is also'), CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=1, logprobs=None, text=' Washington DC, which stands for District of Columbia. It is located on the East Coast of the United States and is home to many national landmarks and institutions,')], created=1721380459, model='default', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=64, prompt_tokens=11, total_tokens=75))
Completion(id='a4b18b086b1b4c9f8a6a59317444a53c', choices=[CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=0, logprobs=None, text=' \n\nAnswer:\nThe famous soccer player is Lionel Messi.\nThe capital of the United States of America is Washington, D.C'), CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=1, logprobs=None, text=' \nDirections: Read the text and answer the questions that follow.\n\nThe name of the famous soccer player is David Beckham. He'), CompletionChoice(finish_reason='FINISH_MATCHED_TOKEN: 2', index=2, logprobs=None, text='\n\nAnswer:\nThe famous soccer player is Lionel Messi, and the capital of the United States is Washington D.C.'), CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=3, logprobs=None, text=' __________ \nAnswer:\nThe famous soccer player is Lionel Messi.\nThe capital of the United States is Washington D.C.'), CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=4, logprobs=None, text='  Washington DC\nSoccer players from around the world travel to Brazil to compete in the FIFA World Cup, the most prestigious international soccer tournament'), CompletionChoice(finish_reason='FINISH_MATCHED_TOKEN: 2', index=5, logprobs=None, text=' \n\nAnswer:\nThe famous soccer player is Lionel Messi.\nThe capital of the United States is Washington D.C.')], created=1721380460, model='default', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=189, prompt_tokens=17, total_tokens=206))
ChatCompletion(id='06ca8f81157c4f1887be08129621f84e', choices=[Choice(finish_reason='FINISH_MATCHED_TOKEN: 2', index=0, logprobs=None, message=ChatCompletionMessage(content='  Of course! Here are three countries and their capitals:\n\n1. Country: Brazil\nCapital: Brasília\n2. Country: China\nCapital: Beijing\n3. Country: Germany\nCapital: Berlin', role='assistant', function_call=None, tool_calls=None)), Choice(finish_reason='FINISH_MATCHED_TOKEN: 2', index=1, logprobs=None, message=ChatCompletionMessage(content='  Of course! Here are three countries and their capitals:\n\n1. Country: France\nCapital: Paris\n2. Country: Japan\nCapital: Tokyo\n3. Country: Brazil\nCapital: Brasília', role='assistant', function_call=None, tool_calls=None)), Choice(finish_reason='FINISH_MATCHED_TOKEN: 2', index=2, logprobs=None, message=ChatCompletionMessage(content='  Of course! Here are three countries and their capitals:\n\n1. Country: Japan - Capital: Tokyo\n2. Country: France - Capital: Paris\n3. Country: Brazil - Capital: Brasília', role='assistant', function_call=None, tool_calls=None)), Choice(finish_reason='FINISH_LENGTH: 64', index=3, logprobs=None, message=ChatCompletionMessage(content='  Of course! Here are three countries and their capitals:\n\n1. Country: France\nCapital: Paris\n2. Country: China\nCapital: Beijing\n3. Country: Brazil\nCapital: Brasília\n\nI hope that helps! Let me know if you need more', role='assistant', function_call=None, tool_calls=None))], created=1721380462, model='default', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=210, prompt_tokens=38, total_tokens=248))

Also I will support batch and model API in another PR

Comment thread python/sglang/srt/managers/tokenizer_manager.py Outdated
@yichuan-w
Copy link
Copy Markdown
Contributor Author

Here I fix these LoC.
Also, I rewrite some codes to align with gpt-3.5-turbo-instruct output's sequence

@merrymercy merrymercy merged commit 49c5e0e into sgl-project:main Jul 20, 2024
timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025
cen121212 pushed a commit to cen121212/sglang that referenced this pull request Nov 10, 2025
<!-- Thank you for your contribution! Please follow these guidelines to
enhance your pull request. If anything is unclear, submit your PR and
reach out to maintainers for assistance. Join our Slack community at
https://slack.sglang.ai to discuss further. -->

## Motivation

<!-- Describe the purpose and goals of this pull request. -->

## Modifications

<!-- Detail the changes made in this pull request. -->

## Accuracy Tests

<!-- If this pull request affects model outputs (e.g., changes to the
kernel or model forward code), provide accuracy test results. -->

## Benchmarking and Profiling

<!-- If this pull request impacts inference speed, provide benchmarking
and profiling results. -->

## Checklist

- [ ] Format your code according to the [Format code with
pre-commit](https://docs.sglang.ai/developer_guide/contribution_guide.html#format-code-with-pre-commit).
- [ ] Add unit tests according to the [Run and add unit
tests](https://docs.sglang.ai/developer_guide/contribution_guide.html#run-and-add-unit-tests).
- [ ] Update documentation according to [Write
documentations](https://docs.sglang.ai/developer_guide/contribution_guide.html#write-documentations).
- [ ] Provide accuracy and speed benchmark results according to [Test
the
accuracy](https://docs.sglang.ai/developer_guide/contribution_guide.html#test-the-accuracy)
and [Benchmark the
speed](https://docs.sglang.ai/developer_guide/contribution_guide.html#benchmark-the-speed).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants