[BUG] Make InferenceModule enable_cuda_graph more flexible.

**Describe the bug**
A clear and concise description of what the bug is.

The following with `enable_cuda_graph=True` breaks as the entire model can't be traced under a cuda graph.

```python
hf_auth_key = os.getenv("HF_AUTH_KEY")
if not hf_auth_key:
    raise ValueError("HF_AUTH_KEY is not set")

pipe = diffusers.StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    use_auth_token=hf_auth_key,
    torch_dtype=torch.float16,
    revision="fp16")

print(pipe)
pipe = deepspeed.init_inference(pipe.to("cuda"), dtype=torch.float16, enable_cuda_graph=True)
```

However, by introducing a global cuda graph and local ones through the policies like DSUnet, etc.. works. This enables to gain an extra 100-150 ms.

```python
class InferenceEngine(InferenceEngine):

    def __init__(self, *args, enable_cuda_graph_global: bool = False, **kwargs):
        super().__init__(*args, **kwargs)
        self.enable_cuda_graph_global = enable_cuda_graph_global

    def forward(self, *inputs, **kwargs):
        """Execute forward propagation
        Arguments:
            *inputs: Variable length input list
            **kwargs: variable length keyword arguments
        """
        start = None
        if self.model_profile_enabled and self.enable_cuda_graph_global:
            torch.cuda.synchronize()
            start = time.time()

        if self.enable_cuda_graph_global:
            if self.cuda_graph_created:
                outputs = self._graph_replay(*inputs, **kwargs)
            else:
                self._create_cuda_graph(*inputs, **kwargs)
                outputs = self._graph_replay(*inputs, **kwargs)
        else:
            outputs = self.module(*inputs, **kwargs)

        if self.model_profile_enabled and self.enable_cuda_graph_global:
            torch.cuda.synchronize()
            duration = time.time() - start
            self._model_times.append(duration)

        return outputs
```

You can found the code there: https://github.com/Lightning-AI/stablediffusion/blob/lit/ldm/deepspeed_replace.py#L34

**To Reproduce**
Steps to reproduce the behavior:
1. Simple inference script to reproduce
2. What packages are required and their versions
3. How to run the script
4. ...

**Expected behavior**
A clear and concise description of what you expected to happen.

**ds_report output**
Please run `ds_report` to give us details about your setup.

**Screenshots**
If applicable, add screenshots to help explain your problem.

**System info (please complete the following information):**
 - OS: [e.g. Ubuntu 18.04]
 - GPU count and types [e.g. two machines with x8 A100s each]
 - (if applicable) what [DeepSpeed-MII](https://github.com/microsoft/deepspeed-mii) version are you using
 - (if applicable) Hugging Face Transformers/Accelerate/etc. versions
 - Python version
 - Any other relevant info about your setup

**Docker context**
Are you using a specific docker image that you can share?

**Additional context**
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Make InferenceModule enable_cuda_graph more flexible. #2717

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Make InferenceModule enable_cuda_graph more flexible. #2717

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions