Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model by zhaochenyang20 · Pull Request #1186 · sgl-project/sglang

zhaochenyang20 · 2024-08-22T15:03:31Z

Motivation

Current SGLang only supports the e5-mistral embedding model. I added Alibaba-NLP/gte-Qwen2-7B-instruct model in this PR.

Also, previously SGLang determines a model as an embedding model through its hf_config.architectures. But gte model has the same architecture as CausalLM. So I added a new parameter in the server_args and changed the forward function of Qwen2ForCausalLM.

Modifications

Changed the forward function of Qwen2ForCausalLM.
Added a new parameter is_embedding in server_args.
Some related changes.
Added unit tests for gte models. (both in the generation and embedding tests. I used rouge-L score in the generation tests)
Changed readme.

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

zhaochenyang20 · 2024-08-23T07:24:33Z

@Ying1123 I added gte in the generation model test. Note that I changed the prefill tolerance accordingly and added the rouge-l metric instead of assert output_strs exactly the same.

hnyls2002 · 2024-08-26T00:20:52Z

+        import multiprocessing as mp
+
+        try:
+            mp.set_start_method("spawn")


Why would this be needed?

llmforever · 2024-08-28T06:25:12Z

@zhaochenyang20
您好，使用您提交的这个方法，和原始transformer与sentence transformer得到的embedding差距都很大，并且效果不好，您能帮忙看看吗？7B/1.5B的模型都测试过了
different result compare with orginal transformer backend,why?

prompt = “hello world”
sglang:
import openai
client = openai.Client(
base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.embeddings.create(
model="default",
input=prompt ,
)

transformer：
tokenizer = AutoTokenizer.from_pretrained('Alibaba-NLP/gte-Qwen2-7B-instruct', trust_remote_code=True)
model = AutoModel.from_pretrained('Alibaba-NLP/gte-Qwen2-7B-instruct', trust_remote_code=True)

max_length = 8192

batch_dict = tokenizer(prompt, max_length=max_length, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

embeddings = F.normalize(embeddings, p=2, dim=1)

zhaochenyang20 · 2024-09-01T23:55:56Z

@llmforever hello. Sorry, I haven't noticed this before. Do you still need to fix this? Actually, we have a unit test for this in test/srt/models/test_embedding_models.py. The logits here is indeed closed.

Also, I don't understand what did you mean by "perform not so well". Could you provide your running snifts and your serving command for SGLang.

And, does e5-mistral also have this problem? Or only get?

thomZ1 · 2024-09-02T05:30:40Z

The same problem, I tried using the SGLang OpenAI API and SentenceTransformer with the same prompt, but the output embeddings were different.

zhaochenyang20 · 2024-09-02T09:41:13Z

Yeah. The embedding could be different due to a lot of reasons. @llmforever

You can check this unit test: https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_embedding_models.py

We set a tolerance value for the embedding difference.

Also, please try the e5-mistral model and give us the embedding difference.

https://huggingface.co/intfloat/e5-mistral-7b-instruct

@Ying1123 Do you think the difference provided is tolerable?

llmforever · 2024-09-04T11:45:51Z

Yeah. The embedding could be different due to a lot of reasons. @llmforever

You can check this unit test: https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_embedding_models.py

We set a tolerance value for the embedding difference.

Also, please try the e5-mistral model and give us the embedding difference.

https://huggingface.co/intfloat/e5-mistral-7b-instruct

@Ying1123 Do you think the difference provided is tolerable?

I test about 10 cases，each accuracy drop from 80% to less than 10%，i think the difference is not tolerable，but the result of the e5-mistral-7b-instruct model is the same，can you please help me look that? Here is the code i use to generate the embedding：

for transformer：

import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel

input_texts = ['hello']
tokenizer = AutoTokenizer.from_pretrained('Alibaba-NLP/gte-Qwen2-7B-instruct', trust_remote_code=True)
model = AutoModel.from_pretrained('Alibaba-NLP/gte-Qwen2-7B-instruct', trust_remote_code=True)

max_length = 8192

batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

embeddings = F.normalize(embeddings, p=2, dim=1)

for sglang:
import openai
import time
client = openai.Client(
base_url="http://localhost:30000/v1", api_key="EMPTY")

input_texts = ['hello']

queres = client.embeddings.create(
model="default",
input=quelist,
)
embeddings = torch.tensor(response.data[0].embedding)

zhaochenyang20 · 2024-09-04T13:43:44Z

@Ying1123 I think he provides an intolerable difference hummm? I gonna check it these days.

…t#1186) Co-authored-by: Ying Sheng <sqy1415@gmail.com>

zhyncs requested a review from Ying1123 August 22, 2024 15:28

zhaochenyang20 mentioned this pull request Aug 23, 2024

[Model] Add support for 'gte-Qwen2' embedding models vllm-project/vllm#6282

Closed

Ying1123 requested changes Aug 23, 2024

View reviewed changes

zhaochenyang20 requested a review from Ying1123 August 23, 2024 07:24

merrymercy changed the title ~~Support Alibaba-NLP/gte-Qwen2-7B-instruct Model~~ Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model Aug 23, 2024

Ying1123 approved these changes Aug 24, 2024

View reviewed changes

zhaochenyang20 mentioned this pull request Aug 25, 2024

[Feature] Use Embedding/Generation Model to get its Generation/Emebedding #1200

Closed

2 tasks

Ying1123 approved these changes Aug 25, 2024

View reviewed changes

Ying1123 enabled auto-merge (squash) August 25, 2024 07:33

Ying1123 disabled auto-merge August 25, 2024 17:26

Ying1123 added 2 commits August 25, 2024 10:28

add support for gte embedding

e6a1331

skip last prompt for gte (passed on h100 but not our a100 CI machine)

fbef398

Ying1123 merged commit 30b4f77 into sgl-project:main Aug 25, 2024

Ying1123 mentioned this pull request Aug 25, 2024

Development Roadmap (2024 Q3) #634

Closed

29 tasks

hnyls2002 reviewed Aug 26, 2024

View reviewed changes

zhaochenyang20 deleted the support_qwn2 branch September 1, 2024 12:42

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025

Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model (sgl-projec…

26e765d

…t#1186) Co-authored-by: Ying Sheng <sqy1415@gmail.com>

Conversation

zhaochenyang20 commented Aug 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhaochenyang20 commented Aug 23, 2024

Uh oh!

hnyls2002 Aug 26, 2024

Choose a reason for hiding this comment

Uh oh!

llmforever commented Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaochenyang20 commented Sep 1, 2024

Uh oh!

thomZ1 commented Sep 2, 2024

Uh oh!

zhaochenyang20 commented Sep 2, 2024

Uh oh!

llmforever commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaochenyang20 commented Sep 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zhaochenyang20 commented Aug 22, 2024 •

edited

Loading

llmforever commented Aug 28, 2024 •

edited

Loading

llmforever commented Sep 4, 2024 •

edited

Loading