[Issue]: [MIOpen - gfx1200/Windows] First SD generation at VAE stage is extremely slow and crashes GPU driver - even with AOTriton enabled

> Note
> Also reported at https://github.com/ROCm/rocm-libraries/issues/1860

### Problem Description

Using a gfx1200 GPU, the first image generation in Stable Diffusion goes quickly - I get like 2.5 it/s - but at the end, during the VAE decode stage, it crashes the GPU driver, occasionally giving OOM errors in the console.

I’m using average/default generation parameters - basically every UI’s base values (1024×1024 resolution, 20 steps, Euler a or DPM++ 2M karras, etc.).

I’ve tried the well-known UIs:
[ComfyUI](https://github.com/comfyanonymous/ComfyUI), [SD.Next](https://github.com/vladmandic/automatic), [Stable Diffusion WebUI reForge](https://github.com/Panchovix/stable-diffusion-webui-reForge), etc., and they all behave the same.

Subsequent generations usually work, but if I change the resolution to anything else, the problem repeats.  
For example, using the [krita-ai-diffusion](https://github.com/Acly/krita-ai-diffusion) plugin with [Krita](https://github.com/KDE/krita) triggers the same issue every single time, because there the resolution and other parameters often change. This doesn’t seem reasonable.

I’ve tried every flag I could think of, for example in Comfy:
`--use-pytorch-cross-attention`, `--disable-smart-memory`, `--reserve-vram 8`, `--fp16-vae`, `--bf16-vae`,  tiled VAE node, etc., but nothing helps.

/ I figured out, along with other users, that disabling MIOpen entirely by hard-coding `torch.backends.cudnn.enabled = False` in the script generally prevents driver crashes and OOM issues, but it’s just a workaround, not a real solution. /

### Operating System

Windows 11

### CPU

Intel Core i5

### GPU

AMD Radeon RX 9060 XT 16 GB

### ROCm Version

7.0.0rc20250917

### Steps to Reproduce

Install the latest wheels:
`python -m pip install --index-url https://rocm.nightlies.amd.com/v2/gfx120X-all/ torch torchvision torchaudio`

Then open any SD UI, and generate an image.

(I'm using the `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` environment variable every time to enable AOTriton on the gfx1200.)

### Additional Information

I’ve seen other AMD users mention this VAE issue in several other places online.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: [MIOpen - gfx1200/Windows] First SD generation at VAE stage is extremely slow and crashes GPU driver - even with AOTriton enabled #1542

Problem Description

Operating System

CPU

GPU

ROCm Version

Steps to Reproduce

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: [MIOpen - gfx1200/Windows] First SD generation at VAE stage is extremely slow and crashes GPU driver - even with AOTriton enabled #1542

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

Steps to Reproduce

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions