docs: Update Multimodal Example README (#1275)

whoisj · web-flow · commit fb4bf252e5e4 · 2025-05-29T17:39:02.000-04:00
This change corrects the README.md file in the examples/multimodal folder:
- Correct "vllm worker" to "decode worker"
- Correct assertion that data is moved via NATS when embeddings are moved via RDMA.

Additionally, this change updates the textual graphs with Mermaid graphs for improved presentation on github.com.
diff --git a/examples/multimodal/README.md b/examples/multimodal/README.md
@@ -24,26 +24,29 @@ The examples are based on the [llava-1.5-7b-hf](https://huggingface.co/llava-hf/
 
 ### Components
 
-- workers: For aggregated serving, we have two workers, [encode_worker](components/encode_worker.py) for encoding and [vllm_worker](components/worker.py) for prefilling and decoding.
-- processor: Tokenizes the prompt and passes it to the vllm worker.
-- frontend: Http endpoint to handle incoming requests.
+- workers: For aggregated serving, we have two workers, [encode_worker](components/encode_worker.py) for encoding and [decode_worker](components/decode_worker.py) for prefilling and decoding.
+- processor: Tokenizes the prompt and passes it to the decode worker.
+- frontend: HTTP endpoint to handle incoming requests.
 
 ### Deployment
 
-In this deployment, we have two workers, [encode_worker](components/encode_worker.py) and [vllm_worker](components/worker.py).
-The encode worker is responsible for encoding the image and passing the embeddings to the vllm worker via NATS.
-The vllm worker then prefills and decodes the prompt, just like the [LLM aggregated serving](../llm/README.md) example.
+In this deployment, we have two workers, [encode_worker](components/encode_worker.py) and [decode_worker](components/decode_worker.py).
+The encode worker is responsible for encoding the image and passing the embeddings to the decode worker via a combination of NATS and RDMA.
+The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
+Its decode worker then prefills and decodes the prompt, just like the [LLM aggregated serving](../llm/README.md) example.
 By separating the encode from the prefill and decode stages, we can have a more flexible deployment and scale the
 encode worker independently from the prefill and decode workers if needed.
 
 This figure shows the flow of the deployment:
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --> HTTP
+  processor --> decode_worker
+  decode_worker --> processor
+  decode_worker --image_url--> encode_worker
+  encode_worker --embeddings--> decode_worker
 ```
-
-+------+      +-----------+      +------------------+      image url       +---------------+
-| HTTP |----->| processor |----->|   vllm worker    |--------------------->| encode worker |
-|      |<-----|           |<-----|                  |<---------------------|               |
-+------+      +-----------+      +------------------+   image embeddings   +---------------+
-
 ```
 
 ```bash
@@ -58,61 +61,64 @@ In another terminal:
 curl http://localhost:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
-    "model": "llava-hf/llava-1.5-7b-hf",
-    "messages": [
-      {
-        "role": "user",
-        "content": [
-          {
-            "type": "text",
-            "text": "What is in this image?"
-          },
-          {
-            "type": "image_url",
-            "image_url": {
-              "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
+      "model": "llava-hf/llava-1.5-7b-hf",
+      "messages": [
+        {
+          "role": "user",
+          "content": [
+            {
+              "type": "text",
+              "text": "What is in this image?"
+            },
+            {
+              "type": "image_url",
+              "image_url": {
+                "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
+              }
             }
-          }
-        ]
-      }
-    ],
-    "max_tokens": 300,
-    "stream": false
-  }'
+          ]
+        }
+      ],
+      "max_tokens": 300,
+      "stream": false
+    }'
 ```
 
 You should see a response similar to this:
-```
+```json
 {"id": "c37b946e-9e58-4d54-88c8-2dbd92c47b0c", "object": "chat.completion", "created": 1747725277, "model": "llava-hf/llava-1.5-7b-hf", "choices": [{"index": 0, "message": {"role": "assistant", "content": " In the image, there is a city bus parked on a street, with a street sign nearby on the right side. The bus appears to be stopped out of service. The setting is in a foggy city, giving it a slightly moody atmosphere."}, "finish_reason": "stop"}]}
 ```
 
 ## Multimodal Disaggregated serving
 
 ### Components
 
-- workers: For disaggregated serving, we have three workers, [encode_worker](components/encode_worker.py) for encoding, [vllm_worker](components/worker.py) for decoding, and [prefill_worker](components/prefill_worker.py) for prefilling.
-- processor: Tokenizes the prompt and passes it to the vllm worker.
-- frontend: Http endpoint to handle incoming requests.
+- workers: For disaggregated serving, we have three workers, [encode_worker](components/encode_worker.py) for encoding, [decode_worker](components/decode_worker.py) for decoding, and [prefill_worker](components/prefill_worker.py) for prefilling.
+- processor: Tokenizes the prompt and passes it to the decode worker.
+- frontend: HTTP endpoint to handle incoming requests.
 
 ### Deployment
 
-In this deployment, we have three workers, [encode_worker](components/encode_worker.py), [vllm_worker](components/worker.py), and [prefill_worker](components/prefill_worker.py).
+In this deployment, we have three workers, [encode_worker](components/encode_worker.py), [decode_worker](components/decode_worker.py), and [prefill_worker](components/prefill_worker.py).
 For the Llava model, embeddings are only required during the prefill stage. As such, the encode worker is connected directly to the prefill worker.
-The encode worker handles image encoding and transmits the resulting embeddings to the prefill worker via NATS.
-The prefill worker performs the prefilling step and forwards the KV cache to the vllm worker for decoding.
-For more details on the roles of the prefill and vllm workers, refer to the [LLM disaggregated serving](../llm/README.md) example.
+The encode worker is responsible for encoding the image and passing the embeddings to the prefill worker via a combination of NATS and RDMA.
+Its work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
+The prefill worker performs the prefilling step and forwards the KV cache to the decode worker for decoding.
+For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](../llm/README.md) example.
 
 This figure shows the flow of the deployment:
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --> HTTP
+  processor --> decode_worker
+  decode_worker --> processor
+  decode_worker --> prefill_worker
+  prefill_worker --> decode_worker
+  prefill_worker --image_url--> encode_worker
+  encode_worker --embeddings--> prefill_worker
 ```
 
-+------+      +-----------+      +------------------+      +------------------+      image url       +---------------+
-| HTTP |----->| processor |----->|   vllm worker    |----->|  prefill worker  |--------------------->| encode worker |
-|      |<-----|           |<-----|  (decode worker) |<-----|                  |<---------------------|               |
-+------+      +-----------+      +------------------+      +------------------+   image embeddings   +---------------+
-
-```
-
-
 ```bash
 cd $DYNAMO_HOME/examples/multimodal
 dynamo serve graphs.disagg:Frontend -f configs/disagg.yaml
@@ -125,30 +131,30 @@ In another terminal:
 curl http://localhost:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
-    "model": "llava-hf/llava-1.5-7b-hf",
-    "messages": [
-      {
-        "role": "user",
-        "content": [
-          {
-            "type": "text",
-            "text": "What is in this image?"
-          },
-          {
-            "type": "image_url",
-            "image_url": {
-              "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
+      "model": "llava-hf/llava-1.5-7b-hf",
+      "messages": [
+        {
+          "role": "user",
+          "content": [
+            {
+              "type": "text",
+              "text": "What is in this image?"
+            },
+            {
+              "type": "image_url",
+              "image_url": {
+                "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
+              }
             }
-          }
-        ]
-      }
-    ],
-    "max_tokens": 300,
-    "stream": false
-  }'
+          ]
+        }
+      ],
+      "max_tokens": 300,
+      "stream": false
+    }'
 ```
 
 You should see a response similar to this:
-```
+```json
 {"id": "c1774d61-3299-4aa3-bea1-a0af6c055ba8", "object": "chat.completion", "created": 1747725645, "model": "llava-hf/llava-1.5-7b-hf", "choices": [{"index": 0, "message": {"role": "assistant", "content": " This image shows a passenger bus traveling down the road near power lines and trees. The bus displays a sign that says \"OUT OF SERVICE\" on its front."}, "finish_reason": "stop"}]}
 ```