Skip to content

Commit fb4bf25

Browse files
authored
docs: Update Multimodal Example README (#1275)
This change corrects the README.md file in the examples/multimodal folder: - Correct "vllm worker" to "decode worker" - Correct assertion that data is moved via NATS when embeddings are moved via RDMA. Additionally, this change updates the textual graphs with Mermaid graphs for improved presentation on github.com.
1 parent f67dc38 commit fb4bf25

1 file changed

Lines changed: 75 additions & 69 deletions

File tree

examples/multimodal/README.md

Lines changed: 75 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -24,26 +24,29 @@ The examples are based on the [llava-1.5-7b-hf](https://huggingface.co/llava-hf/
2424

2525
### Components
2626

27-
- workers: For aggregated serving, we have two workers, [encode_worker](components/encode_worker.py) for encoding and [vllm_worker](components/worker.py) for prefilling and decoding.
28-
- processor: Tokenizes the prompt and passes it to the vllm worker.
29-
- frontend: Http endpoint to handle incoming requests.
27+
- workers: For aggregated serving, we have two workers, [encode_worker](components/encode_worker.py) for encoding and [decode_worker](components/decode_worker.py) for prefilling and decoding.
28+
- processor: Tokenizes the prompt and passes it to the decode worker.
29+
- frontend: HTTP endpoint to handle incoming requests.
3030

3131
### Deployment
3232

33-
In this deployment, we have two workers, [encode_worker](components/encode_worker.py) and [vllm_worker](components/worker.py).
34-
The encode worker is responsible for encoding the image and passing the embeddings to the vllm worker via NATS.
35-
The vllm worker then prefills and decodes the prompt, just like the [LLM aggregated serving](../llm/README.md) example.
33+
In this deployment, we have two workers, [encode_worker](components/encode_worker.py) and [decode_worker](components/decode_worker.py).
34+
The encode worker is responsible for encoding the image and passing the embeddings to the decode worker via a combination of NATS and RDMA.
35+
The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
36+
Its decode worker then prefills and decodes the prompt, just like the [LLM aggregated serving](../llm/README.md) example.
3637
By separating the encode from the prefill and decode stages, we can have a more flexible deployment and scale the
3738
encode worker independently from the prefill and decode workers if needed.
3839

3940
This figure shows the flow of the deployment:
41+
```mermaid
42+
flowchart LR
43+
HTTP --> processor
44+
processor --> HTTP
45+
processor --> decode_worker
46+
decode_worker --> processor
47+
decode_worker --image_url--> encode_worker
48+
encode_worker --embeddings--> decode_worker
4049
```
41-
42-
+------+ +-----------+ +------------------+ image url +---------------+
43-
| HTTP |----->| processor |----->| vllm worker |--------------------->| encode worker |
44-
| |<-----| |<-----| |<---------------------| |
45-
+------+ +-----------+ +------------------+ image embeddings +---------------+
46-
4750
```
4851
4952
```bash
@@ -58,61 +61,64 @@ In another terminal:
5861
curl http://localhost:8000/v1/chat/completions \
5962
-H "Content-Type: application/json" \
6063
-d '{
61-
"model": "llava-hf/llava-1.5-7b-hf",
62-
"messages": [
63-
{
64-
"role": "user",
65-
"content": [
66-
{
67-
"type": "text",
68-
"text": "What is in this image?"
69-
},
70-
{
71-
"type": "image_url",
72-
"image_url": {
73-
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
64+
"model": "llava-hf/llava-1.5-7b-hf",
65+
"messages": [
66+
{
67+
"role": "user",
68+
"content": [
69+
{
70+
"type": "text",
71+
"text": "What is in this image?"
72+
},
73+
{
74+
"type": "image_url",
75+
"image_url": {
76+
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
77+
}
7478
}
75-
}
76-
]
77-
}
78-
],
79-
"max_tokens": 300,
80-
"stream": false
81-
}'
79+
]
80+
}
81+
],
82+
"max_tokens": 300,
83+
"stream": false
84+
}'
8285
```
8386

8487
You should see a response similar to this:
85-
```
88+
```json
8689
{"id": "c37b946e-9e58-4d54-88c8-2dbd92c47b0c", "object": "chat.completion", "created": 1747725277, "model": "llava-hf/llava-1.5-7b-hf", "choices": [{"index": 0, "message": {"role": "assistant", "content": " In the image, there is a city bus parked on a street, with a street sign nearby on the right side. The bus appears to be stopped out of service. The setting is in a foggy city, giving it a slightly moody atmosphere."}, "finish_reason": "stop"}]}
8790
```
8891

8992
## Multimodal Disaggregated serving
9093

9194
### Components
9295

93-
- workers: For disaggregated serving, we have three workers, [encode_worker](components/encode_worker.py) for encoding, [vllm_worker](components/worker.py) for decoding, and [prefill_worker](components/prefill_worker.py) for prefilling.
94-
- processor: Tokenizes the prompt and passes it to the vllm worker.
95-
- frontend: Http endpoint to handle incoming requests.
96+
- workers: For disaggregated serving, we have three workers, [encode_worker](components/encode_worker.py) for encoding, [decode_worker](components/decode_worker.py) for decoding, and [prefill_worker](components/prefill_worker.py) for prefilling.
97+
- processor: Tokenizes the prompt and passes it to the decode worker.
98+
- frontend: HTTP endpoint to handle incoming requests.
9699

97100
### Deployment
98101

99-
In this deployment, we have three workers, [encode_worker](components/encode_worker.py), [vllm_worker](components/worker.py), and [prefill_worker](components/prefill_worker.py).
102+
In this deployment, we have three workers, [encode_worker](components/encode_worker.py), [decode_worker](components/decode_worker.py), and [prefill_worker](components/prefill_worker.py).
100103
For the Llava model, embeddings are only required during the prefill stage. As such, the encode worker is connected directly to the prefill worker.
101-
The encode worker handles image encoding and transmits the resulting embeddings to the prefill worker via NATS.
102-
The prefill worker performs the prefilling step and forwards the KV cache to the vllm worker for decoding.
103-
For more details on the roles of the prefill and vllm workers, refer to the [LLM disaggregated serving](../llm/README.md) example.
104+
The encode worker is responsible for encoding the image and passing the embeddings to the prefill worker via a combination of NATS and RDMA.
105+
Its work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
106+
The prefill worker performs the prefilling step and forwards the KV cache to the decode worker for decoding.
107+
For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](../llm/README.md) example.
104108

105109
This figure shows the flow of the deployment:
110+
```mermaid
111+
flowchart LR
112+
HTTP --> processor
113+
processor --> HTTP
114+
processor --> decode_worker
115+
decode_worker --> processor
116+
decode_worker --> prefill_worker
117+
prefill_worker --> decode_worker
118+
prefill_worker --image_url--> encode_worker
119+
encode_worker --embeddings--> prefill_worker
106120
```
107121

108-
+------+ +-----------+ +------------------+ +------------------+ image url +---------------+
109-
| HTTP |----->| processor |----->| vllm worker |----->| prefill worker |--------------------->| encode worker |
110-
| |<-----| |<-----| (decode worker) |<-----| |<---------------------| |
111-
+------+ +-----------+ +------------------+ +------------------+ image embeddings +---------------+
112-
113-
```
114-
115-
116122
```bash
117123
cd $DYNAMO_HOME/examples/multimodal
118124
dynamo serve graphs.disagg:Frontend -f configs/disagg.yaml
@@ -125,30 +131,30 @@ In another terminal:
125131
curl http://localhost:8000/v1/chat/completions \
126132
-H "Content-Type: application/json" \
127133
-d '{
128-
"model": "llava-hf/llava-1.5-7b-hf",
129-
"messages": [
130-
{
131-
"role": "user",
132-
"content": [
133-
{
134-
"type": "text",
135-
"text": "What is in this image?"
136-
},
137-
{
138-
"type": "image_url",
139-
"image_url": {
140-
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
134+
"model": "llava-hf/llava-1.5-7b-hf",
135+
"messages": [
136+
{
137+
"role": "user",
138+
"content": [
139+
{
140+
"type": "text",
141+
"text": "What is in this image?"
142+
},
143+
{
144+
"type": "image_url",
145+
"image_url": {
146+
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
147+
}
141148
}
142-
}
143-
]
144-
}
145-
],
146-
"max_tokens": 300,
147-
"stream": false
148-
}'
149+
]
150+
}
151+
],
152+
"max_tokens": 300,
153+
"stream": false
154+
}'
149155
```
150156

151157
You should see a response similar to this:
152-
```
158+
```json
153159
{"id": "c1774d61-3299-4aa3-bea1-a0af6c055ba8", "object": "chat.completion", "created": 1747725645, "model": "llava-hf/llava-1.5-7b-hf", "choices": [{"index": 0, "message": {"role": "assistant", "content": " This image shows a passenger bus traveling down the road near power lines and trees. The bus displays a sign that says \"OUT OF SERVICE\" on its front."}, "finish_reason": "stop"}]}
154160
```

0 commit comments

Comments
 (0)