Skip to content

[IR][Pass] Refactor the fusion implementation#164

Merged
yaoyaoding merged 11 commits intohidet-org:mainfrom
yaoyaoding:refactor-fusion
Apr 7, 2023
Merged

[IR][Pass] Refactor the fusion implementation#164
yaoyaoding merged 11 commits intohidet-org:mainfrom
yaoyaoding:refactor-fusion

Conversation

@yaoyaoding
Copy link
Copy Markdown
Member

@yaoyaoding yaoyaoding commented Apr 7, 2023

This PR refactor the implementation of post-scheduling fusion.

Previously, we use a TaskGraph to store the sub-graph inside each Task.

After this refactor, we remove the TaskGraph attribute of Task, but created a new kind of operator to represent the fused sub-graph. This allows us to support more kinds of fusion, and make the IR more clean.

Some other udpates:

  1. Each lib.so will contains a single operator. We fix the name of packed function to 'launch' (previously, the name of packed function is the name of the task). This will avoid passing the function name and make the code more clean.
  2. Now, the implement_cpu and implement_cuda methods of class Task can return a list of IRModule, indicating the tunable schedules.
  3. Move the generation of 'launch' function to pass list. Now, we do not need to call the add_packed_func to manually add it.

@yaoyaoding yaoyaoding merged commit 3cc75b6 into hidet-org:main Apr 7, 2023
@yaoyaoding yaoyaoding deleted the refactor-fusion branch April 7, 2023 18:22
vadiklyutiy pushed a commit that referenced this pull request Jul 22, 2024
Introduces a `SyncLLM` and `AsyncLLM` interface to interact with the
LLM, closes #164.

### SyncLLM.generate

Takes in 1 or a list of n prompts, and 0, 1, or a list of n sampling
parameters.
- If no sampling parameter is provided, greedy sampling is used.
- If 1 prompt and 1 sampling parameter is provided, the return is a
single `SequenceOutput`.
- If a list of n prompts and 1 sampling parameter is provided, the
sampling parameter is applied to all prompts and the return is a list of
`SequenceOutput`.
- If a list of n prompts and a list of n sampling parameters is
provided, the sampling parameters are applied respectively to each
prompt.
- Any other configuration is invalid.

### AsyncLLM.generate

Takes in 1 prompt and 0 or 1 sampling parameters. The same default from
the synchronous version applies if no sampling parameters are provided.
_Without blocking_, returns a async iterator over `SequenceOutput`,
which is updated with every token generated.

### Usage

Here's an example script to demonstrate the API.

```py
import asyncio
import random

from hidet.apps.llm import create_llm
from hidet.apps.llm.sampler import SamplingParams


async def _demo_async():
    llm = create_llm("meta-llama/Llama-2-7b-chat-hf", is_async=True)
    prompts = [
        "Hello, how are you?",
        "How do you feel about the current political climate?",
        "What is your favorite food?",
        "What is your favorite color?",
        "What is your favorite movie?",
        "What is your favorite book?",
        "What is your favorite song?",
        "What is your favorite animal?",
        "What is your favorite hobby?",
        "When is your birthday?",
    ]

    coros = []
    for prompt in prompts:
        async def f(prompt):
            await asyncio.sleep(random.randint(1, 60))
            print("Incoming request: ", prompt)
            params = SamplingParams(temperature=0.0, max_tokens=random.randint(10, 100))
            stream = llm.generate(prompt, sampling_params=params)
            final = None
            async for output in stream:
                # print(output.tokens)
                final = output
            print("=====")
            print("Completed request: ", prompt)
            print("Output: ", final.output_text)
            print("=====")
        coros.append(f(prompt))

    await asyncio.gather(*coros)


def demo_async():
    asyncio.run(_demo_async())


def demo_sync():
    llm = create_llm("meta-llama/Llama-2-7b-chat-hf", is_async=False)
    prompts = [
        "Hello, how are you?",
        "How do you feel about the current political climate?",
        "What is your favorite food?",
        "What is your favorite color?",
        "What is your favorite movie?",
        "What is your favorite book?",
        "What is your favorite song?",
        "What is your favorite animal?",
        "What is your favorite hobby?",
        "When is your birthday?",
    ]
    for output in llm.generate(prompts):
        print("=====")
        print("Completed request: ", output.prompt)
        print("Output: ", output.output_text)
        print("=====")


if __name__ == "__main__":
    demo_sync()
    # demo_async()
```

---------

Co-authored-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
vadiklyutiy pushed a commit that referenced this pull request Jul 23, 2024
Introduces a `SyncLLM` and `AsyncLLM` interface to interact with the
LLM, closes #164.

### SyncLLM.generate

Takes in 1 or a list of n prompts, and 0, 1, or a list of n sampling
parameters.
- If no sampling parameter is provided, greedy sampling is used.
- If 1 prompt and 1 sampling parameter is provided, the return is a
single `SequenceOutput`.
- If a list of n prompts and 1 sampling parameter is provided, the
sampling parameter is applied to all prompts and the return is a list of
`SequenceOutput`.
- If a list of n prompts and a list of n sampling parameters is
provided, the sampling parameters are applied respectively to each
prompt.
- Any other configuration is invalid.

### AsyncLLM.generate

Takes in 1 prompt and 0 or 1 sampling parameters. The same default from
the synchronous version applies if no sampling parameters are provided.
_Without blocking_, returns a async iterator over `SequenceOutput`,
which is updated with every token generated.

### Usage

Here's an example script to demonstrate the API.

```py
import asyncio
import random

from hidet.apps.llm import create_llm
from hidet.apps.llm.sampler import SamplingParams


async def _demo_async():
    llm = create_llm("meta-llama/Llama-2-7b-chat-hf", is_async=True)
    prompts = [
        "Hello, how are you?",
        "How do you feel about the current political climate?",
        "What is your favorite food?",
        "What is your favorite color?",
        "What is your favorite movie?",
        "What is your favorite book?",
        "What is your favorite song?",
        "What is your favorite animal?",
        "What is your favorite hobby?",
        "When is your birthday?",
    ]

    coros = []
    for prompt in prompts:
        async def f(prompt):
            await asyncio.sleep(random.randint(1, 60))
            print("Incoming request: ", prompt)
            params = SamplingParams(temperature=0.0, max_tokens=random.randint(10, 100))
            stream = llm.generate(prompt, sampling_params=params)
            final = None
            async for output in stream:
                # print(output.tokens)
                final = output
            print("=====")
            print("Completed request: ", prompt)
            print("Output: ", final.output_text)
            print("=====")
        coros.append(f(prompt))

    await asyncio.gather(*coros)


def demo_async():
    asyncio.run(_demo_async())


def demo_sync():
    llm = create_llm("meta-llama/Llama-2-7b-chat-hf", is_async=False)
    prompts = [
        "Hello, how are you?",
        "How do you feel about the current political climate?",
        "What is your favorite food?",
        "What is your favorite color?",
        "What is your favorite movie?",
        "What is your favorite book?",
        "What is your favorite song?",
        "What is your favorite animal?",
        "What is your favorite hobby?",
        "When is your birthday?",
    ]
    for output in llm.generate(prompts):
        print("=====")
        print("Completed request: ", output.prompt)
        print("Output: ", output.output_text)
        print("=====")


if __name__ == "__main__":
    demo_sync()
    # demo_async()
```

---------

Co-authored-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
vadiklyutiy pushed a commit that referenced this pull request Dec 26, 2024
Introduces a `SyncLLM` and `AsyncLLM` interface to interact with the
LLM, closes #164.

### SyncLLM.generate

Takes in 1 or a list of n prompts, and 0, 1, or a list of n sampling
parameters.
- If no sampling parameter is provided, greedy sampling is used.
- If 1 prompt and 1 sampling parameter is provided, the return is a
single `SequenceOutput`.
- If a list of n prompts and 1 sampling parameter is provided, the
sampling parameter is applied to all prompts and the return is a list of
`SequenceOutput`.
- If a list of n prompts and a list of n sampling parameters is
provided, the sampling parameters are applied respectively to each
prompt.
- Any other configuration is invalid.

### AsyncLLM.generate

Takes in 1 prompt and 0 or 1 sampling parameters. The same default from
the synchronous version applies if no sampling parameters are provided.
_Without blocking_, returns a async iterator over `SequenceOutput`,
which is updated with every token generated.

### Usage

Here's an example script to demonstrate the API.

```py
import asyncio
import random

from hidet.apps.llm import create_llm
from hidet.apps.llm.sampler import SamplingParams


async def _demo_async():
    llm = create_llm("meta-llama/Llama-2-7b-chat-hf", is_async=True)
    prompts = [
        "Hello, how are you?",
        "How do you feel about the current political climate?",
        "What is your favorite food?",
        "What is your favorite color?",
        "What is your favorite movie?",
        "What is your favorite book?",
        "What is your favorite song?",
        "What is your favorite animal?",
        "What is your favorite hobby?",
        "When is your birthday?",
    ]

    coros = []
    for prompt in prompts:
        async def f(prompt):
            await asyncio.sleep(random.randint(1, 60))
            print("Incoming request: ", prompt)
            params = SamplingParams(temperature=0.0, max_tokens=random.randint(10, 100))
            stream = llm.generate(prompt, sampling_params=params)
            final = None
            async for output in stream:
                # print(output.tokens)
                final = output
            print("=====")
            print("Completed request: ", prompt)
            print("Output: ", final.output_text)
            print("=====")
        coros.append(f(prompt))

    await asyncio.gather(*coros)


def demo_async():
    asyncio.run(_demo_async())


def demo_sync():
    llm = create_llm("meta-llama/Llama-2-7b-chat-hf", is_async=False)
    prompts = [
        "Hello, how are you?",
        "How do you feel about the current political climate?",
        "What is your favorite food?",
        "What is your favorite color?",
        "What is your favorite movie?",
        "What is your favorite book?",
        "What is your favorite song?",
        "What is your favorite animal?",
        "What is your favorite hobby?",
        "When is your birthday?",
    ]
    for output in llm.generate(prompts):
        print("=====")
        print("Completed request: ", output.prompt)
        print("Output: ", output.output_text)
        print("=====")


if __name__ == "__main__":
    demo_sync()
    # demo_async()
```

---------

Co-authored-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant