[Bug] A100 PCIE torch compile error

### Checklist

- [X] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
- [X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [X] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [X] 5. Please use English, otherwise it will be closed.

### Describe the bug

```
[11:19:46 TP0] Decode batch. #running-req: 36, #token: 14473, token usage: 0.03, gen throughput (token/s): 2283.73, #queue-req: 0
../aten/src/ATen/native/cuda/MultinomialKernel.cu:112: binarySearchForMultinomial: block: [0,31,0], thread: [0,0,0] Assertion `cumdist[size - 1] > static_cast<scalar_t>(0)` failed.
../aten/src/ATen/native/cuda/MultinomialKernel.cu:112: binarySearchForMultinomial: block: [0,31,0], thread: [1,0,0] Assertion `cumdist[size - 1] > static_cast<scalar_t>(0)` failed.
[11:19:46 TP0] Exception in ModelTpServer:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker.py", line 244, in exposed_step
    self.forward_step()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker.py", line 273, in forward_step
    self.forward_decode_batch(self.running_batch)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker.py", line 685, in forward_decode_batch
    sample_output, logits_output = self.model_runner.forward(
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 582, in forward
    return self.forward_decode(batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 528, in forward_decode
    return self.cuda_graph_runner.replay(batch)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 315, in replay
    torch.cuda.synchronize()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 892, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


[11:19:46 TP0] Exception in ControllerSingle:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller_single.py", line 165, in start_controller_process
    controller.loop_for_forward()
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller_single.py", line 102, in loop_for_forward
    out_pyobjs = self.tp_server.exposed_step(recv_reqs)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker.py", line 244, in exposed_step
    self.forward_step()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker.py", line 273, in forward_step
    self.forward_decode_batch(self.running_batch)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker.py", line 685, in forward_decode_batch
    sample_output, logits_output = self.model_runner.forward(
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 582, in forward
    return self.forward_decode(batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 528, in forward_decode
    return self.cuda_graph_runner.replay(batch)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 315, in replay
    torch.cuda.synchronize()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 892, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Killed
```

### Reproduction

```
# 0.2.15

pip install --upgrade pip
pip install "sglang[all]"

# Install FlashInfer CUDA kernels
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
```

```python3
import argparse
import asyncio
import os
import pickle
import re
from collections import defaultdict

import openai
import transformers
from datasets import load_dataset
from openai import AsyncOpenAI
from tenacity import (
    retry,
    retry_if_exception_type,
    stop_after_attempt,
    wait_exponential,
)
from tqdm import tqdm

# Mapping backends to their clients and models
backend_to_models = {
    "sglang": {
        "8b": "meta-llama/Meta-Llama-3.1-8B-Instruct",
        "70b": "meta-llama/Meta-Llama-3.1-70B-Instruct",
        "405b": "meta-llama/Meta-Llama-3.1-405B-Instruct",
    },
}


# Define the retry strategy
retry_strategy = retry(
    stop=stop_after_attempt(5),  # Stop after 5 attempts
    wait=wait_exponential(multiplier=1, min=4, max=10),  # Exponential backoff
    retry=retry_if_exception_type(Exception),  # Retry on any exception
)


# Define the fetch_responses function with retry strategy
@retry_strategy
async def fetch_responses(
    client, prompt, semaphore, index, backend, model_size, output_dir, max_tokens
):
    output_file = os.path.join(output_dir, f"response_{index}.pkl")
    if os.path.exists(output_file):
        print(f"File {output_file} already exists, skipping.")
        return

    async with semaphore:
        response = await client.completions.create(
            model=backend_to_models[backend][model_size],
            prompt=prompt,
            temperature=0.0,
            max_tokens=max_tokens,
        )
        if isinstance(response, openai.BadRequestError):
            with open(output_file, "wb") as f:
                pickle.dump("bad_response", f)
        assert isinstance(response, openai.types.completion.Completion)
        # Save response to a file
        with open(output_file, "wb") as f:
            pickle.dump(response, f)


TASK_TO_MAX_TOKENS = {
    "evals__mmlu__details": 1,
    "evals__mmlu__0_shot__cot__details": 1024,
    # Official meta uses 1024, but a small % (.05) of questions are answered correctly after relaxing
    "evals__mmlu_pro__details": 2048,
    "evals__gsm8k__details": 1024,
}


def get_client(backend):
    return {
        "sglang": AsyncOpenAI(base_url="http://127.0.0.1:30000/v1/"),
    }[backend]


async def run_benchmark(args):
    ds = load_dataset(
        "meta-llama/Meta-Llama-3.1-405B-Instruct-evals",
        f"Meta-Llama-3.1-405B-Instruct-{args.task}",
    )
    semaphore = asyncio.Semaphore(args.concurrency)  # Limit to 16 concurrent tasks

    if args.num_examples is None:
        args.num_examples = len(ds["latest"]["input_final_prompts"])
    prompts = ds["latest"]["input_final_prompts"][: args.num_examples]

    # Create the output directory if it does not exist
    os.makedirs(args.output_dir, exist_ok=True)

    tasks = []
    # Create the tasks with tqdm progress bar
    max_tokens = TASK_TO_MAX_TOKENS[args.task]
    client = get_client(args.backend)
    for idx, prompt in enumerate(tqdm(prompts, desc="Creating tasks")):
        tasks.append(
            asyncio.create_task(
                fetch_responses(
                    client,
                    f"<|begin_of_|text|>{prompt[0]}",
                    semaphore,
                    idx,
                    args.backend,
                    args.model_size,
                    args.output_dir,
                    max_tokens=max_tokens,
                )
            )
        )

    # Run the tasks with tqdm progress bar
    for future in tqdm(
        asyncio.as_completed(tasks), total=len(tasks), desc="Processing tasks"
    ):
        await future


def get_mmlu_answer(response):
    if response is not None:
        return response.choices[0].text.lstrip().rstrip().upper().replace(".", "")
    return None


def get_mmlu_cot_answer(response):
    pattern = r"The best answer is (.+)\.?"
    match = re.search(pattern, response.choices[0].text)
    if match:
        return match.group(1).replace(".", "").replace("*", "")

    pattern = r"the best answer is (.+)\.?"
    match = re.search(pattern, response.choices[0].text)
    if match:
        return match.group(1).replace(".", "")

    pattern = r"The correct answer is (.+)\.?"
    match = re.search(pattern, response.choices[0].text)
    if match:
        return match.group(1).replace(".", "")

    pattern = r"the correct answer is (.+)\.?"
    match = re.search(pattern, response.choices[0].text)
    if match:
        return match.group(1).replace(".", "")


def get_answer_gsm8k(response):
    pattern = r"The final answer is (.+)\.?"
    match = re.search(pattern, response.choices[0].text)
    if match:
        s = match.group(1)
        for ok_symbol in ["%", "$"]:
            s = s.replace(ok_symbol, "")
        return s


TASK_TO_ANSWER_EXTRACTOR = {
    "evals__mmlu__details": get_mmlu_answer,
    "evals__mmlu__0_shot__cot__details": get_mmlu_cot_answer,
    "evals__gsm8k__details": get_answer_gsm8k,
    "evals__mmlu_pro__details": get_mmlu_cot_answer,
}


def get_dataset_from_task(task, response_path):
    ds_405b = load_dataset(
        f"meta-llama/Meta-Llama-3.1-405B-Instruct-evals",
        f"Meta-Llama-3.1-405B-Instruct-{task}",
    )
    ds_405b_hash_order = [x[0] for x in ds_405b["latest"]["input_final_prompts_hash"]]

    if "70b" in str(response_path) or "8b" in str(response_path):
        if "70" in str(response_path):
            ref_model_ds = load_dataset(
                f"meta-llama/Meta-Llama-3.1-70B-Instruct-evals",
                f"Meta-Llama-3.1-70B-Instruct-{task}",
            )
        else:
            ref_model_ds = load_dataset(
                f"meta-llama/Meta-Llama-3.1-8B-Instruct-evals",
                f"Meta-Llama-3.1-8B-Instruct-{task}",
            )

        hash_to_row = {}
        for row in ref_model_ds["latest"]:
            hash_to_row[row["input_final_prompts_hash"][0]] = row
        reordered_rows = []
        for prompt_hash in ds_405b_hash_order:
            reordered_rows.append(hash_to_row[prompt_hash])
        ref_model_ds["latest"] = reordered_rows
        return ref_model_ds

    return ds_405b


def analyze_answers(task, response_path):
    ds = get_dataset_from_task(task, response_path)

    responses = []
    total = len(ds["latest"])

    for i in range(0, total):
        response = pickle.load(
            open(os.path.join(response_path, f"response_{i}.pkl"), "rb")
        )
        responses.append(response)

    from dataclasses import dataclass

    @dataclass
    class Stats:
        correct: int = 0
        total: int = 0
        meta_correct: int = 0

        average: float = None

    subtask_name_to_stats = defaultdict(lambda: Stats())

    for response, ds_row in zip(responses, ds["latest"]):
        model_answer = TASK_TO_ANSWER_EXTRACTOR[task](response)

        subtask = ds_row["subtask_name"]

        is_eval_correct = model_answer in ds_row["input_correct_responses"]
        if is_eval_correct:
            subtask_name_to_stats[subtask].correct += 1

        if ds_row["is_correct"]:
            subtask_name_to_stats[subtask].meta_correct += 1

        subtask_name_to_stats[subtask].total += 1

    micro_stats = Stats()
    for subtask, stats in subtask_name_to_stats.items():
        stats.average = stats.correct / stats.total
        stats.meta_average = stats.meta_correct / stats.total

        micro_stats.correct += stats.correct
        micro_stats.total += stats.total
        micro_stats.meta_correct += stats.meta_correct

    micro_stats.average = micro_stats.correct / micro_stats.total
    micro_stats.meta_average = micro_stats.meta_correct / micro_stats.total

    import numpy as np

    print("Macro average", np.mean([x.average for x in subtask_name_to_stats.values()]))
    print(
        "Meta Macro average",
        np.mean([x.meta_average for x in subtask_name_to_stats.values()]),
    )
    print("Micro average", micro_stats.average)
    print("Meta Micro average", micro_stats.meta_average)


# Entry point for the script
if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Script to run model with specified parameters."
    )
    parser.add_argument(
        "--model-size",
        type=str,
        required=True,
        help="Size of the model (e.g., 8b or 70b)",
    )
    parser.add_argument(
        "--backend", type=str, required=True, help="Backend name (e.g., sglang)"
    )
    parser.add_argument("--task", type=str, required=True)
    parser.add_argument(
        "--num-examples", type=int, default=None, help="Number of examples to process"
    )
    parser.add_argument("--concurrency", type=int, default=128)
    parser.add_argument(
        "--output-dir", type=str, required=True, help="Directory to save responses"
    )

    os.environ['OPENAI_API_KEY'] = 'EMPTY'

    args = parser.parse_args()
    asyncio.run(run_benchmark(args))

    analyze_answers(args.task, args.output_dir)
```

```
python3 eval.py --model-size 8b --backend sglang --task evals__gsm8k__details --output-dir tmp/8b
```

### Environment

```
Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
CUDA available: True
GPU 0: NVIDIA A100 80GB PCIe
GPU 0 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 550.90.07
PyTorch: 2.4.0+cu121
sglang: 0.2.15
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.44.2
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.3
aiohttp: 3.10.5
fastapi: 0.112.2
hf_transfer: 0.1.8
huggingface_hub: 0.24.6
interegular: 0.3.3
packaging: 23.2
PIL: 10.2.0
psutil: 5.9.8
pydantic: 2.8.2
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 24.0.1
vllm: 0.5.5
multipart: 0.0.9
openai: 1.43.0
anthropic: 0.34.1
NVIDIA Topology:
        GPU0    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    0-31,64-95      0               N/A
NIC0    NODE     X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_bond_0


ulimit soft: 1048576
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] A100 PCIE torch compile error #1301

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] A100 PCIE torch compile error #1301

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions