Checklist
Describe the bug
[11:19:46 TP0] Decode batch. #running-req: 36, #token: 14473, token usage: 0.03, gen throughput (token/s): 2283.73, #queue-req: 0
../aten/src/ATen/native/cuda/MultinomialKernel.cu:112: binarySearchForMultinomial: block: [0,31,0], thread: [0,0,0] Assertion `cumdist[size - 1] > static_cast<scalar_t>(0)` failed.
../aten/src/ATen/native/cuda/MultinomialKernel.cu:112: binarySearchForMultinomial: block: [0,31,0], thread: [1,0,0] Assertion `cumdist[size - 1] > static_cast<scalar_t>(0)` failed.
[11:19:46 TP0] Exception in ModelTpServer:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker.py", line 244, in exposed_step
self.forward_step()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker.py", line 273, in forward_step
self.forward_decode_batch(self.running_batch)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker.py", line 685, in forward_decode_batch
sample_output, logits_output = self.model_runner.forward(
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 582, in forward
return self.forward_decode(batch)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 528, in forward_decode
return self.cuda_graph_runner.replay(batch)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 315, in replay
torch.cuda.synchronize()
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 892, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[11:19:46 TP0] Exception in ControllerSingle:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller_single.py", line 165, in start_controller_process
controller.loop_for_forward()
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller_single.py", line 102, in loop_for_forward
out_pyobjs = self.tp_server.exposed_step(recv_reqs)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker.py", line 244, in exposed_step
self.forward_step()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker.py", line 273, in forward_step
self.forward_decode_batch(self.running_batch)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker.py", line 685, in forward_decode_batch
sample_output, logits_output = self.model_runner.forward(
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 582, in forward
return self.forward_decode(batch)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 528, in forward_decode
return self.cuda_graph_runner.replay(batch)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 315, in replay
torch.cuda.synchronize()
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 892, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Killed
Reproduction
# 0.2.15
pip install --upgrade pip
pip install "sglang[all]"
# Install FlashInfer CUDA kernels
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
import argparse
import asyncio
import os
import pickle
import re
from collections import defaultdict
import openai
import transformers
from datasets import load_dataset
from openai import AsyncOpenAI
from tenacity import (
retry,
retry_if_exception_type,
stop_after_attempt,
wait_exponential,
)
from tqdm import tqdm
# Mapping backends to their clients and models
backend_to_models = {
"sglang": {
"8b": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"70b": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"405b": "meta-llama/Meta-Llama-3.1-405B-Instruct",
},
}
# Define the retry strategy
retry_strategy = retry(
stop=stop_after_attempt(5), # Stop after 5 attempts
wait=wait_exponential(multiplier=1, min=4, max=10), # Exponential backoff
retry=retry_if_exception_type(Exception), # Retry on any exception
)
# Define the fetch_responses function with retry strategy
@retry_strategy
async def fetch_responses(
client, prompt, semaphore, index, backend, model_size, output_dir, max_tokens
):
output_file = os.path.join(output_dir, f"response_{index}.pkl")
if os.path.exists(output_file):
print(f"File {output_file} already exists, skipping.")
return
async with semaphore:
response = await client.completions.create(
model=backend_to_models[backend][model_size],
prompt=prompt,
temperature=0.0,
max_tokens=max_tokens,
)
if isinstance(response, openai.BadRequestError):
with open(output_file, "wb") as f:
pickle.dump("bad_response", f)
assert isinstance(response, openai.types.completion.Completion)
# Save response to a file
with open(output_file, "wb") as f:
pickle.dump(response, f)
TASK_TO_MAX_TOKENS = {
"evals__mmlu__details": 1,
"evals__mmlu__0_shot__cot__details": 1024,
# Official meta uses 1024, but a small % (.05) of questions are answered correctly after relaxing
"evals__mmlu_pro__details": 2048,
"evals__gsm8k__details": 1024,
}
def get_client(backend):
return {
"sglang": AsyncOpenAI(base_url="http://127.0.0.1:30000/v1/"),
}[backend]
async def run_benchmark(args):
ds = load_dataset(
"meta-llama/Meta-Llama-3.1-405B-Instruct-evals",
f"Meta-Llama-3.1-405B-Instruct-{args.task}",
)
semaphore = asyncio.Semaphore(args.concurrency) # Limit to 16 concurrent tasks
if args.num_examples is None:
args.num_examples = len(ds["latest"]["input_final_prompts"])
prompts = ds["latest"]["input_final_prompts"][: args.num_examples]
# Create the output directory if it does not exist
os.makedirs(args.output_dir, exist_ok=True)
tasks = []
# Create the tasks with tqdm progress bar
max_tokens = TASK_TO_MAX_TOKENS[args.task]
client = get_client(args.backend)
for idx, prompt in enumerate(tqdm(prompts, desc="Creating tasks")):
tasks.append(
asyncio.create_task(
fetch_responses(
client,
f"<|begin_of_|text|>{prompt[0]}",
semaphore,
idx,
args.backend,
args.model_size,
args.output_dir,
max_tokens=max_tokens,
)
)
)
# Run the tasks with tqdm progress bar
for future in tqdm(
asyncio.as_completed(tasks), total=len(tasks), desc="Processing tasks"
):
await future
def get_mmlu_answer(response):
if response is not None:
return response.choices[0].text.lstrip().rstrip().upper().replace(".", "")
return None
def get_mmlu_cot_answer(response):
pattern = r"The best answer is (.+)\.?"
match = re.search(pattern, response.choices[0].text)
if match:
return match.group(1).replace(".", "").replace("*", "")
pattern = r"the best answer is (.+)\.?"
match = re.search(pattern, response.choices[0].text)
if match:
return match.group(1).replace(".", "")
pattern = r"The correct answer is (.+)\.?"
match = re.search(pattern, response.choices[0].text)
if match:
return match.group(1).replace(".", "")
pattern = r"the correct answer is (.+)\.?"
match = re.search(pattern, response.choices[0].text)
if match:
return match.group(1).replace(".", "")
def get_answer_gsm8k(response):
pattern = r"The final answer is (.+)\.?"
match = re.search(pattern, response.choices[0].text)
if match:
s = match.group(1)
for ok_symbol in ["%", "$"]:
s = s.replace(ok_symbol, "")
return s
TASK_TO_ANSWER_EXTRACTOR = {
"evals__mmlu__details": get_mmlu_answer,
"evals__mmlu__0_shot__cot__details": get_mmlu_cot_answer,
"evals__gsm8k__details": get_answer_gsm8k,
"evals__mmlu_pro__details": get_mmlu_cot_answer,
}
def get_dataset_from_task(task, response_path):
ds_405b = load_dataset(
f"meta-llama/Meta-Llama-3.1-405B-Instruct-evals",
f"Meta-Llama-3.1-405B-Instruct-{task}",
)
ds_405b_hash_order = [x[0] for x in ds_405b["latest"]["input_final_prompts_hash"]]
if "70b" in str(response_path) or "8b" in str(response_path):
if "70" in str(response_path):
ref_model_ds = load_dataset(
f"meta-llama/Meta-Llama-3.1-70B-Instruct-evals",
f"Meta-Llama-3.1-70B-Instruct-{task}",
)
else:
ref_model_ds = load_dataset(
f"meta-llama/Meta-Llama-3.1-8B-Instruct-evals",
f"Meta-Llama-3.1-8B-Instruct-{task}",
)
hash_to_row = {}
for row in ref_model_ds["latest"]:
hash_to_row[row["input_final_prompts_hash"][0]] = row
reordered_rows = []
for prompt_hash in ds_405b_hash_order:
reordered_rows.append(hash_to_row[prompt_hash])
ref_model_ds["latest"] = reordered_rows
return ref_model_ds
return ds_405b
def analyze_answers(task, response_path):
ds = get_dataset_from_task(task, response_path)
responses = []
total = len(ds["latest"])
for i in range(0, total):
response = pickle.load(
open(os.path.join(response_path, f"response_{i}.pkl"), "rb")
)
responses.append(response)
from dataclasses import dataclass
@dataclass
class Stats:
correct: int = 0
total: int = 0
meta_correct: int = 0
average: float = None
subtask_name_to_stats = defaultdict(lambda: Stats())
for response, ds_row in zip(responses, ds["latest"]):
model_answer = TASK_TO_ANSWER_EXTRACTOR[task](response)
subtask = ds_row["subtask_name"]
is_eval_correct = model_answer in ds_row["input_correct_responses"]
if is_eval_correct:
subtask_name_to_stats[subtask].correct += 1
if ds_row["is_correct"]:
subtask_name_to_stats[subtask].meta_correct += 1
subtask_name_to_stats[subtask].total += 1
micro_stats = Stats()
for subtask, stats in subtask_name_to_stats.items():
stats.average = stats.correct / stats.total
stats.meta_average = stats.meta_correct / stats.total
micro_stats.correct += stats.correct
micro_stats.total += stats.total
micro_stats.meta_correct += stats.meta_correct
micro_stats.average = micro_stats.correct / micro_stats.total
micro_stats.meta_average = micro_stats.meta_correct / micro_stats.total
import numpy as np
print("Macro average", np.mean([x.average for x in subtask_name_to_stats.values()]))
print(
"Meta Macro average",
np.mean([x.meta_average for x in subtask_name_to_stats.values()]),
)
print("Micro average", micro_stats.average)
print("Meta Micro average", micro_stats.meta_average)
# Entry point for the script
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Script to run model with specified parameters."
)
parser.add_argument(
"--model-size",
type=str,
required=True,
help="Size of the model (e.g., 8b or 70b)",
)
parser.add_argument(
"--backend", type=str, required=True, help="Backend name (e.g., sglang)"
)
parser.add_argument("--task", type=str, required=True)
parser.add_argument(
"--num-examples", type=int, default=None, help="Number of examples to process"
)
parser.add_argument("--concurrency", type=int, default=128)
parser.add_argument(
"--output-dir", type=str, required=True, help="Directory to save responses"
)
os.environ['OPENAI_API_KEY'] = 'EMPTY'
args = parser.parse_args()
asyncio.run(run_benchmark(args))
analyze_answers(args.task, args.output_dir)
python3 eval.py --model-size 8b --backend sglang --task evals__gsm8k__details --output-dir tmp/8b
Environment
Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
CUDA available: True
GPU 0: NVIDIA A100 80GB PCIe
GPU 0 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 550.90.07
PyTorch: 2.4.0+cu121
sglang: 0.2.15
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.44.2
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.3
aiohttp: 3.10.5
fastapi: 0.112.2
hf_transfer: 0.1.8
huggingface_hub: 0.24.6
interegular: 0.3.3
packaging: 23.2
PIL: 10.2.0
psutil: 5.9.8
pydantic: 2.8.2
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 24.0.1
vllm: 0.5.5
multipart: 0.0.9
openai: 1.43.0
anthropic: 0.34.1
NVIDIA Topology:
GPU0 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE 0-31,64-95 0 N/A
NIC0 NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_bond_0
ulimit soft: 1048576
Checklist
Describe the bug
Reproduction
Environment