[IR][Pass] Refactor the fusion implementation#164
Merged
yaoyaoding merged 11 commits intohidet-org:mainfrom Apr 7, 2023
Merged
[IR][Pass] Refactor the fusion implementation#164yaoyaoding merged 11 commits intohidet-org:mainfrom
yaoyaoding merged 11 commits intohidet-org:mainfrom
Conversation
vadiklyutiy
pushed a commit
that referenced
this pull request
Jul 22, 2024
Introduces a `SyncLLM` and `AsyncLLM` interface to interact with the LLM, closes #164. ### SyncLLM.generate Takes in 1 or a list of n prompts, and 0, 1, or a list of n sampling parameters. - If no sampling parameter is provided, greedy sampling is used. - If 1 prompt and 1 sampling parameter is provided, the return is a single `SequenceOutput`. - If a list of n prompts and 1 sampling parameter is provided, the sampling parameter is applied to all prompts and the return is a list of `SequenceOutput`. - If a list of n prompts and a list of n sampling parameters is provided, the sampling parameters are applied respectively to each prompt. - Any other configuration is invalid. ### AsyncLLM.generate Takes in 1 prompt and 0 or 1 sampling parameters. The same default from the synchronous version applies if no sampling parameters are provided. _Without blocking_, returns a async iterator over `SequenceOutput`, which is updated with every token generated. ### Usage Here's an example script to demonstrate the API. ```py import asyncio import random from hidet.apps.llm import create_llm from hidet.apps.llm.sampler import SamplingParams async def _demo_async(): llm = create_llm("meta-llama/Llama-2-7b-chat-hf", is_async=True) prompts = [ "Hello, how are you?", "How do you feel about the current political climate?", "What is your favorite food?", "What is your favorite color?", "What is your favorite movie?", "What is your favorite book?", "What is your favorite song?", "What is your favorite animal?", "What is your favorite hobby?", "When is your birthday?", ] coros = [] for prompt in prompts: async def f(prompt): await asyncio.sleep(random.randint(1, 60)) print("Incoming request: ", prompt) params = SamplingParams(temperature=0.0, max_tokens=random.randint(10, 100)) stream = llm.generate(prompt, sampling_params=params) final = None async for output in stream: # print(output.tokens) final = output print("=====") print("Completed request: ", prompt) print("Output: ", final.output_text) print("=====") coros.append(f(prompt)) await asyncio.gather(*coros) def demo_async(): asyncio.run(_demo_async()) def demo_sync(): llm = create_llm("meta-llama/Llama-2-7b-chat-hf", is_async=False) prompts = [ "Hello, how are you?", "How do you feel about the current political climate?", "What is your favorite food?", "What is your favorite color?", "What is your favorite movie?", "What is your favorite book?", "What is your favorite song?", "What is your favorite animal?", "What is your favorite hobby?", "When is your birthday?", ] for output in llm.generate(prompts): print("=====") print("Completed request: ", output.prompt) print("Output: ", output.output_text) print("=====") if __name__ == "__main__": demo_sync() # demo_async() ``` --------- Co-authored-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
vadiklyutiy
pushed a commit
that referenced
this pull request
Jul 23, 2024
Introduces a `SyncLLM` and `AsyncLLM` interface to interact with the LLM, closes #164. ### SyncLLM.generate Takes in 1 or a list of n prompts, and 0, 1, or a list of n sampling parameters. - If no sampling parameter is provided, greedy sampling is used. - If 1 prompt and 1 sampling parameter is provided, the return is a single `SequenceOutput`. - If a list of n prompts and 1 sampling parameter is provided, the sampling parameter is applied to all prompts and the return is a list of `SequenceOutput`. - If a list of n prompts and a list of n sampling parameters is provided, the sampling parameters are applied respectively to each prompt. - Any other configuration is invalid. ### AsyncLLM.generate Takes in 1 prompt and 0 or 1 sampling parameters. The same default from the synchronous version applies if no sampling parameters are provided. _Without blocking_, returns a async iterator over `SequenceOutput`, which is updated with every token generated. ### Usage Here's an example script to demonstrate the API. ```py import asyncio import random from hidet.apps.llm import create_llm from hidet.apps.llm.sampler import SamplingParams async def _demo_async(): llm = create_llm("meta-llama/Llama-2-7b-chat-hf", is_async=True) prompts = [ "Hello, how are you?", "How do you feel about the current political climate?", "What is your favorite food?", "What is your favorite color?", "What is your favorite movie?", "What is your favorite book?", "What is your favorite song?", "What is your favorite animal?", "What is your favorite hobby?", "When is your birthday?", ] coros = [] for prompt in prompts: async def f(prompt): await asyncio.sleep(random.randint(1, 60)) print("Incoming request: ", prompt) params = SamplingParams(temperature=0.0, max_tokens=random.randint(10, 100)) stream = llm.generate(prompt, sampling_params=params) final = None async for output in stream: # print(output.tokens) final = output print("=====") print("Completed request: ", prompt) print("Output: ", final.output_text) print("=====") coros.append(f(prompt)) await asyncio.gather(*coros) def demo_async(): asyncio.run(_demo_async()) def demo_sync(): llm = create_llm("meta-llama/Llama-2-7b-chat-hf", is_async=False) prompts = [ "Hello, how are you?", "How do you feel about the current political climate?", "What is your favorite food?", "What is your favorite color?", "What is your favorite movie?", "What is your favorite book?", "What is your favorite song?", "What is your favorite animal?", "What is your favorite hobby?", "When is your birthday?", ] for output in llm.generate(prompts): print("=====") print("Completed request: ", output.prompt) print("Output: ", output.output_text) print("=====") if __name__ == "__main__": demo_sync() # demo_async() ``` --------- Co-authored-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
vadiklyutiy
pushed a commit
that referenced
this pull request
Dec 26, 2024
Introduces a `SyncLLM` and `AsyncLLM` interface to interact with the LLM, closes #164. ### SyncLLM.generate Takes in 1 or a list of n prompts, and 0, 1, or a list of n sampling parameters. - If no sampling parameter is provided, greedy sampling is used. - If 1 prompt and 1 sampling parameter is provided, the return is a single `SequenceOutput`. - If a list of n prompts and 1 sampling parameter is provided, the sampling parameter is applied to all prompts and the return is a list of `SequenceOutput`. - If a list of n prompts and a list of n sampling parameters is provided, the sampling parameters are applied respectively to each prompt. - Any other configuration is invalid. ### AsyncLLM.generate Takes in 1 prompt and 0 or 1 sampling parameters. The same default from the synchronous version applies if no sampling parameters are provided. _Without blocking_, returns a async iterator over `SequenceOutput`, which is updated with every token generated. ### Usage Here's an example script to demonstrate the API. ```py import asyncio import random from hidet.apps.llm import create_llm from hidet.apps.llm.sampler import SamplingParams async def _demo_async(): llm = create_llm("meta-llama/Llama-2-7b-chat-hf", is_async=True) prompts = [ "Hello, how are you?", "How do you feel about the current political climate?", "What is your favorite food?", "What is your favorite color?", "What is your favorite movie?", "What is your favorite book?", "What is your favorite song?", "What is your favorite animal?", "What is your favorite hobby?", "When is your birthday?", ] coros = [] for prompt in prompts: async def f(prompt): await asyncio.sleep(random.randint(1, 60)) print("Incoming request: ", prompt) params = SamplingParams(temperature=0.0, max_tokens=random.randint(10, 100)) stream = llm.generate(prompt, sampling_params=params) final = None async for output in stream: # print(output.tokens) final = output print("=====") print("Completed request: ", prompt) print("Output: ", final.output_text) print("=====") coros.append(f(prompt)) await asyncio.gather(*coros) def demo_async(): asyncio.run(_demo_async()) def demo_sync(): llm = create_llm("meta-llama/Llama-2-7b-chat-hf", is_async=False) prompts = [ "Hello, how are you?", "How do you feel about the current political climate?", "What is your favorite food?", "What is your favorite color?", "What is your favorite movie?", "What is your favorite book?", "What is your favorite song?", "What is your favorite animal?", "What is your favorite hobby?", "When is your birthday?", ] for output in llm.generate(prompts): print("=====") print("Completed request: ", output.prompt) print("Output: ", output.output_text) print("=====") if __name__ == "__main__": demo_sync() # demo_async() ``` --------- Co-authored-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR refactor the implementation of post-scheduling fusion.
Previously, we use a TaskGraph to store the sub-graph inside each Task.
After this refactor, we remove the TaskGraph attribute of Task, but created a new kind of operator to represent the fused sub-graph. This allows us to support more kinds of fusion, and make the IR more clean.
Some other udpates: