Add llama implementation with no tensor parallel linears by jerryzh168 · Pull Request #1561 · sgl-project/sglang

jerryzh168 · 2024-10-03T20:59:58Z

Summary:
Trying to demo llama with normal linear + quantized model + tensor parallelism works

verified correctness against original llama3 model
supported json-model-override-args in bench_latency script

Next: add pytorch native tensor parallelism test code for int8 weight only in torchao, diff from current llama model def: https://gist.github.com/jerryzh168/692ff83735d4ca298c1aad2424b2c225

Test Plan:

Using json-model-override-args to overwrite the name of the model

python3 -m sglang.bench_latency --correct --model meta-llama/Meta-Llama-3-8B --json-model-override-args '{"architectures": ["TorchNativeLlamaForCausalLM"]}'
Init nccl begin.
Load weight begin. avail mem=94.48 GB
INFO 10-04 15:00:53 weight_utils.py:236] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  2.46it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:00<00:00,  2.26it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  2.25it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00,  3.11it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00,  2.75it/s]

Load weight end. type=TorchNativeLlamaForCausalLM, dtype=torch.bfloat16, avail mem=79.41 GB

performance check

python3 -m sglang.bench_latency --model jerryzh168/llama3-8B --batch-size 1 --input 128 --output 8
python3 -m sglang.bench_latency --model jerryzh168/llama3-8B --batch-size 1 --input 128 --output 8 --torchao-config int4wo-128

max_total_num_tokens=631444
Warmup ...
Prefill. latency: 0.09536 s, throughput:   1342.32 token/s
Decode.  latency: 0.00538 s, throughput:    185.80 token/s
Decode.  latency: 0.00476 s, throughput:    209.91 token/s
Decode.  latency: 0.00466 s, throughput:    214.38 token/s
Decode.  median latency: 0.00476 s, median throughput:    209.91 token/s
Total. latency:  0.110 s, throughput:   1198.18 token/s
Benchmark ...
Prefill. latency: 0.06534 s, throughput:   1958.93 token/s
Decode.  latency: 0.00502 s, throughput:    199.16 token/s
Decode.  latency: 0.00476 s, throughput:    210.03 token/s
Decode.  latency: 0.00469 s, throughput:    213.19 token/s
Decode.  latency: 0.00466 s, throughput:    214.77 token/s
Decode.  latency: 0.00466 s, throughput:    214.74 token/s
Decode.  median latency: 0.00469 s, median throughput:    213.19 token/s
Total. latency:  0.098 s, throughput:   1381.24 token/s

Accuracy check:


# python3 scripts/playground/reference_hf.py --model meta-llama/Meta-Llama-3-8B
========== Prompt 0 ==========
prefill logits (final) tensor([ 5.0195,  3.0801,  0.7422,  ..., -7.4805, -7.4805, -7.4805],
       device='cuda:0')
<|begin_of_text|>The capital of France is Paris. It is located in the north of the country. The city is situated
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

========== Prompt 1 ==========
prefill logits (final) tensor([ 5.2109,  4.2344,  1.8408,  ..., -7.5195, -7.5195, -7.5195],
       device='cuda:0')
<|begin_of_text|>The capital of the United Kindom is London. It is the largest city in the UK and the largest city in the
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

========== Prompt 2 ==========
prefill logits (final) tensor([ 9.5391,  3.1914,  0.8188,  ..., -7.0469, -7.0469, -7.0469],
       device='cuda:0')
<|begin_of_text|>Today is a sunny day and I like to go out and enjoy the sun. I am going to the beach with my

# python3 scripts/playground/reference_hf.py --model jerryzh168/llama3-8B
========== Prompt 0 ==========
prefill logits (final) tensor([ 5.0195,  3.0801,  0.7422,  ..., -7.4805, -7.4805, -7.4805],
       device='cuda:0')
<|begin_of_text|>The capital of France is Paris. It is located in the north of the country. The city is situated
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

========== Prompt 1 ==========
prefill logits (final) tensor([ 5.2109,  4.2344,  1.8408,  ..., -7.5195, -7.5195, -7.5195],
       device='cuda:0')
<|begin_of_text|>The capital of the United Kindom is London. It is the largest city in the UK and the largest city in the
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

========== Prompt 2 ==========
prefill logits (final) tensor([ 9.5391,  3.1914,  0.8188,  ..., -7.0469, -7.0469, -7.0469],
       device='cuda:0')
<|begin_of_text|>Today is a sunny day and I like to go out and enjoy the sun. I am going to the beach with my


# python3 -m sglang.bench_latency --correct --model meta-llama/Meta-Llama-3-8B
max_total_num_tokens=557684

input_ids=[[128000, 791, 6864, 315, 9822, 374], [128000, 791, 6864, 315, 279, 3723, 17262, 316, 374], [128000, 15724, 374, 264, 40798, 1938, 323, 358, 1093]]

prefill logits (first half): tensor([[ 1.9609,  2.1094, -1.2500,  ..., -5.5000, -5.5000, -5.5000],
        [ 1.9609,  2.1094, -1.2500,  ..., -5.5000, -5.5000, -5.5000],
        [ 2.2969,  2.9531,  2.1406,  ..., -8.3750, -8.3750, -8.3750]],
       device='cuda:0')

prefill logits (final): tensor([[ 5.0312,  3.1094,  0.7500,  ..., -7.4375, -7.4375, -7.4375],
        [ 5.2188,  4.2188,  1.8359,  ..., -7.5312, -7.5312, -7.5312],
        [ 9.5000,  3.1406,  0.7891,  ..., -7.0938, -7.0938, -7.0938]],
       device='cuda:0')

========== Prompt 0 ==========
<|begin_of_text|>The capital of France is Paris. It is located in the north of the country. It is the largest

========== Prompt 1 ==========
<|begin_of_text|>The capital of the United Kindom is London. It is the largest city in the UK and the largest city in the

========== Prompt 2 ==========
<|begin_of_text|>Today is a sunny day and I like to go out for a walk. I am going to the park. I am


# python3 -m sglang.bench_latency --correct --model jerryzh168/llama3-8B
Load weight end. type=TorchNativeLlamaForCausalLM, dtype=torch.bfloat16, avail mem=79.41 GB
Memory pool end. avail mem=11.16 GB
Capture cuda graph begin. This can take up to several minutes.
max_total_num_tokens=557684

input_ids=[[128000, 791, 6864, 315, 9822, 374], [128000, 791, 6864, 315, 279, 3723, 17262, 316, 374], [128000, 15724, 374, 264, 40798, 1938, 323, 358, 1093]]

prefill logits (first half): tensor([[ 1.9609,  2.1094, -1.2500,  ..., -5.5000, -5.5000, -5.5000],
        [ 1.9609,  2.1094, -1.2500,  ..., -5.5000, -5.5000, -5.5000],
        [ 2.2969,  2.9531,  2.1406,  ..., -8.3750, -8.3750, -8.3750]],
       device='cuda:0')

prefill logits (final): tensor([[ 5.0312,  3.1094,  0.7500,  ..., -7.4375, -7.4375, -7.4375],
        [ 5.2188,  4.2188,  1.8359,  ..., -7.5312, -7.5312, -7.5312],
        [ 9.5000,  3.1406,  0.7891,  ..., -7.0938, -7.0938, -7.0938]],
       device='cuda:0')

========== Prompt 0 ==========
<|begin_of_text|>The capital of France is Paris. It is located in the north of the country. Paris is the largest

========== Prompt 1 ==========
<|begin_of_text|>The capital of the United Kindom is London. It is the largest city in the UK and the largest city in the

========== Prompt 2 ==========
<|begin_of_text|>Today is a sunny day and I like to go out for a walk. I am going to the park. I am

Reviewers:

Subscribers:

Tasks:

Tags:

merrymercy

Is `TorchNativeLlamaForCausalLM a better name?

Did you test the correctness?

sglang/docs/en/model_support.md

Lines 13 to 14 in 04b262c

    
           - Get the reference output by `python3 scripts/playground/reference_hf.py --model [new model]` 
        
           - Get the SGLang output by `python3 -m sglang.bench_latency --correct --model [new model]`

Maybe we can add some arguments that allow using this model implementation without using a new checkpoint. We have some arguments like

sglang/python/sglang/srt/server_args.py

Lines 429 to 435 in 04b262c

    
           # Model override args 
        
           parser.add_argument( 
        
               "--json-model-override-args", 
        
               type=str, 
        
               help="A dictionary in JSON string format used to override default model configurations.", 
        
               default=ServerArgs.json_model_override_args, 
        
           )

to override the model configs. I am not sure whether it works.

Summary: Trying to demo llama with normal lineaer + quantized model + tensor parallelism works Test Plan: TODO Reviewers: Subscribers: Tasks: Tags:

merrymercy · 2024-10-05T18:22:37Z

@jerryzh168 Thanks! It is merged.

…#1561)

jerryzh168 force-pushed the raw-llama-tp branch from 966a3d6 to a328ee5 Compare October 3, 2024 23:25

merrymercy reviewed Oct 4, 2024

View reviewed changes

jerryzh168 added 4 commits October 4, 2024 15:06

Add llama implementation with no tensor parallel linears

c9b829c

Summary: Trying to demo llama with normal lineaer + quantized model + tensor parallelism works Test Plan: TODO Reviewers: Subscribers: Tasks: Tags:

typo

f3ff7e8

format

444ad55

address comments

296eabd

jerryzh168 force-pushed the raw-llama-tp branch from a328ee5 to 296eabd Compare October 4, 2024 22:07

format

1448854

jerryzh168 requested a review from merrymercy October 5, 2024 00:00

merrymercy merged commit 9b0926c into sgl-project:main Oct 5, 2024

merrymercy mentioned this pull request Oct 19, 2024

Development Roadmap (2024 Q4) #1487

Closed

37 tasks

jerryzh168 mentioned this pull request Jan 3, 2025

[Bug] How to load weight with torchao #2721

Closed

5 tasks

zhaochenyang20 mentioned this pull request Mar 3, 2025

Development Roadmap (2025 H1) #4035

Closed

22 tasks

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025

Add llama implementation with no tensor parallel linears (sgl-project…

bc22333

…#1561)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add llama implementation with no tensor parallel linears#1561

Add llama implementation with no tensor parallel linears#1561
merrymercy merged 5 commits intosgl-project:mainfrom
jerryzh168:raw-llama-tp

jerryzh168 commented Oct 3, 2024 •

edited

Loading

Uh oh!

merrymercy left a comment

Uh oh!

merrymercy commented Oct 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	- Get the reference output by `python3 scripts/playground/reference_hf.py --model [new model]`
	- Get the SGLang output by `python3 -m sglang.bench_latency --correct --model [new model]`

	# Model override args
	parser.add_argument(
	"--json-model-override-args",
	type=str,
	help="A dictionary in JSON string format used to override default model configurations.",
	default=ServerArgs.json_model_override_args,
	)

Conversation

jerryzh168 commented Oct 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Using json-model-override-args to overwrite the name of the model

performance check

Accuracy check:

Uh oh!

merrymercy left a comment

Choose a reason for hiding this comment

Uh oh!

merrymercy commented Oct 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jerryzh168 commented Oct 3, 2024 •

edited

Loading