{"id":37736,"date":"2024-10-21T06:00:00","date_gmt":"2024-10-21T00:30:00","guid":{"rendered":"https:\/\/debuggercafe.com\/?p=37736"},"modified":"2024-10-21T06:02:15","modified_gmt":"2024-10-21T00:32:15","slug":"serving-llms-using-litserve","status":"publish","type":"post","link":"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/","title":{"rendered":"Serving LLMs using LitServe"},"content":{"rendered":"\n<p>At one point or the other, we all have run LLMs locally, maybe through Hugging Face Transformers, Ollama, or any of the online tools and software available. However, for production-ready environments, we will need to serve LLMs using an API URL. It can be a locally hosted LLM accessible to the organization only or an exposed API for the customers to use. The serving part can be tricky. But with the myriad of libraries coming out, the process is slowly becoming simpler, even for deep learning engineers. One such library is <strong>LitServe<\/strong>. In this article, we will explore the <strong>serving of LLMs using LitServe<\/strong>.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-horizontal is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-499968f5 wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button is-style-outline is-style-outline--1\"><a class=\"wp-block-button__link has-black-color has-luminous-vivid-orange-background-color has-text-color has-background wp-element-button\" href=\"#download-code\"><strong>Jump to Download Code<\/strong><\/a><\/div>\n<\/div>\n\n\n\n<p><\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/litserve-studio-inference.gif\" target=\"_blank\" rel=\"noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"585\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/litserve-studio-inference.gif\" alt=\"LitServe Lightning Studio inference - serving LLMs using LitServe.\" class=\"wp-image-37845\"\/><\/a><figcaption class=\"wp-element-caption\">Figure 1. LitServe Lightning Studio inference &#8211; serving LLMs using LitServe.<\/figcaption><\/figure>\n<\/div>\n\n\n<p>Primarily, our goal is to explore different techniques for serving LLMs using LitServe, both locally and through an exposed API. After going through the article, we should be well-equipped to take our LLM projects one step further to real-world deployment. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><em>What are we going to cover in this article?<\/em><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>What is LitServe?<\/em><\/li>\n\n\n\n<li><em>How to set up LitServe locally?<\/em><\/li>\n\n\n\n<li><em>What are the different techniques of serving models through LitServe?<\/em><\/li>\n\n\n\n<li><em>How to use Lightning Studio to serve LLMs through LitServe and LitGPT?<\/em><\/li>\n\n\n\n<li><em>How to use LLMs from Hugging Face Transformers with LitServe?<\/em><\/li>\n\n\n\n<li><em>What are the steps to use LitServe locally for serving LLMs?<\/em><\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p><strong><em>Note: <\/em><\/strong><em>We do not discuss a complete production-ready deployment pipeline here. This is more of a &#8220;<strong>getting started with LLM serving and deployment<\/strong>&#8221; for deep learning engineers and practitioners who mostly work with model creation, data pipelines, and <strong><a href=\"https:\/\/debuggercafe.com\/hugging-face-autotrain\/\" target=\"_blank\" rel=\"noreferrer noopener\">LLM training<\/a><\/strong>.<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What is LitServe?<\/h2>\n\n\n\n<p><a href=\"https:\/\/github.com\/Lightning-AI\/litserve\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>LitServe<\/strong><\/a> is an AI model serving engine by Lightning AI, the company behind PyTorch Lightning. LitServe can not only serve LLMs but any AI model that we want. It is flexible, fast, and built on top of FastAPI, and for LLMs, it has integration with LitGPT.<\/p>\n\n\n\n<p>Along with that, it has a host of features like GPU autoscaling, integration with Lightning Studio, batching, streaming, and many more. Take a look at the <strong><a href=\"https:\/\/github.com\/Lightning-AI\/litserve\/?tab=readme-ov-file#features\" target=\"_blank\" rel=\"noreferrer noopener\">entire list of features<\/a><\/strong> to know more. Because of optimized worker and GPU handling, it is faster than some of the most popular APIs used for serving, like FastAPI, and Torchserve.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/litserve-vs-other-serving-apis.png\" target=\"_blank\" rel=\"noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"3122\" height=\"782\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/litserve-vs-other-serving-apis.png\" alt=\"LitServe compared to other serving APIs for serving LLMs.\" class=\"wp-image-37847\" srcset=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/litserve-vs-other-serving-apis.png 3122w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/litserve-vs-other-serving-apis-300x75.png 300w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/litserve-vs-other-serving-apis-768x192.png 768w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/litserve-vs-other-serving-apis-1536x385.png 1536w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/litserve-vs-other-serving-apis-2048x513.png 2048w\" sizes=\"auto, (max-width: 3122px) 100vw, 3122px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 2. LitServe compared to other serving APIs for serving LLMs.<\/figcaption><\/figure>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\">How to Set Up LitServe Locally?<\/h2>\n\n\n\n<p>Installing LitServe locally is straightforward with PyPi. We are going to use the following versions of LitServe and LitGPT in this article. Be sure to install these in a new environment.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">pip install litgpt==0.4.11\npip install litserve==0.2.2<\/pre>\n\n\n\n<p>Installing the above two will also install PyTorch with CUDA dependencies and other necessary libraries.<\/p>\n\n\n\n<p><strong><a href=\"https:\/\/github.com\/Lightning-AI\/litgpt\/\" target=\"_blank\" rel=\"noreferrer noopener\">LitGPT<\/a><\/strong> is a library for running, <strong><a href=\"https:\/\/debuggercafe.com\/fine-tuning-phi-1-5-using-qlora\/\" target=\"_blank\" rel=\"noreferrer noopener\">fine-tuning, and evaluating LLMs<\/a><\/strong> with just one line of command. You can find an extensive set of <strong><a href=\"https:\/\/github.com\/Lightning-AI\/litgpt\/tree\/main\/tutorials\" target=\"_blank\" rel=\"noreferrer noopener\">tutorials here<\/a><\/strong>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Project Directory Structure<\/h2>\n\n\n\n<p>As we will cover both, serving LLMs locally, and on Lightning Studio, it is important to go through the directory structure once.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">\u251c\u2500\u2500 checkpoints\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 google\n\u2502\u00a0\u00a0     \u2514\u2500\u2500 gemma-2-2b-it\n\u251c\u2500\u2500 call.py\n\u251c\u2500\u2500 client.py\n\u251c\u2500\u2500 commands.txt\n\u251c\u2500\u2500 README.md\n\u251c\u2500\u2500 requirements.txt\n\u251c\u2500\u2500 serve_hf.py\n\u2514\u2500\u2500 serve.py<\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">checkpoints<\/code> directory contains all the pretrained models that LitGPT downloads and converts.<\/li>\n\n\n\n<li>The <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">commands.txt<\/code> file contains some useful serving commands and the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">README<\/code> file contains some essential links to docs and tutorials.<\/li>\n\n\n\n<li>There are four Python files that we will cover in the following sections.<\/li>\n\n\n\n<li>Finally, we have a requirements file to install the dependencies.<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p>To install the rest of the dependencies you can execute the following command.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">pip install -r requirements.txt<\/pre>\n\n\n\n<p class=\"has-background\" style=\"background-color:#ffb76a\"><strong><em>All the above Python files, command files, README, and requirements file are available to download via the download section below.<\/em><\/strong><\/p>\n\n\n\n<h3 class=\"wp-block-heading has-text-align-center\" id=\"download-code\">Download Code<\/h3>\n\n\n\n<div class=\"wp-block-button is-style-outline center\"><a data-sumome-listbuilder-id=\"be6393be-21b8-4015-a3b5-c8ae21bce912\" class=\"wp-block-button__link has-black-color has-luminous-vivid-orange-background-color has-text-color has-background\"><b>Download the Source Code for this Tutorial<\/b><\/a><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Serving LLMs Through LitServe on Lightning Studio<\/h2>\n\n\n\n<p>In the first part, we are going to cover the serving of Gemma-2 2B Instruction Tuned model through LitServe on Lightning Studio.<\/p>\n\n\n\n<p>We are using the Gemma-2 2B as we can easily run it on both CPU and GPU.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Setting Up a New Studio<\/h3>\n\n\n\n<p>The first step is to create a new <strong><a href=\"https:\/\/lightning.ai\/studios\" target=\"_blank\" rel=\"noreferrer noopener\">sign-up for Lightning Studio<\/a><\/strong> and create a new Studio. <\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/lightning-studio.png\" target=\"_blank\" rel=\"noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"1382\" height=\"755\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/lightning-studio.png\" alt=\"Lightning Studio home screen.\" class=\"wp-image-37849\" srcset=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/lightning-studio.png 1382w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/lightning-studio-300x164.png 300w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/lightning-studio-768x420.png 768w\" sizes=\"auto, (max-width: 1382px) 100vw, 1382px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 3. Lightning Studio home screen.<\/figcaption><\/figure>\n<\/div>\n\n\n<p>You should also receive some free credits which makes it easier to play around with GPU powered Studios and get to know the environment.<\/p>\n\n\n\n<p>Now, go under the Studio tab and create a new <strong>Teamspaces<\/strong>. Under the newly created teamspace, we can start a new studio.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/new-studio-lightning-studio.png\" target=\"_blank\" rel=\"noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"759\" height=\"414\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/new-studio-lightning-studio.png\" alt=\"Starting a new studio under Teamspaces in Lightning Studio.\" class=\"wp-image-37851\" srcset=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/new-studio-lightning-studio.png 759w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/new-studio-lightning-studio-300x164.png 300w\" sizes=\"auto, (max-width: 759px) 100vw, 759px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 4. Starting a new studio under Teamspaces in Lightning Studio.<\/figcaption><\/figure>\n<\/div>\n\n\n<p>In case you want to replicate the studio that I created, check the published <strong><a href=\"https:\/\/lightning.ai\/sovitrath5\/studios\/gemma2-serving\" target=\"_blank\" rel=\"noreferrer noopener\">studio here<\/a><\/strong> and you can make a clone of it.<\/p>\n\n\n\n<p>The new studio should open in CPU mode by default. You can continue as it is or switch to a GPU. Navigating around the studio should be simple enough as it is a VS Code environment.<\/p>\n\n\n\n<p>After the environment starts, be sure to install <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">litgpt<\/code> and <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">litserve<\/code> using the terminal.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">pip install litgpt\npip install litserve<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Server and Client Scripts for LLM Serving<\/h3>\n\n\n\n<p>Here, we need to focus on the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">server.py<\/code> and <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">client.py<\/code> scripts. Here are the contents of both.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"server.py\" data-enlighter-group=\"server_1\">import litgpt\nimport litserve as ls\n\nclass LlmAPI(ls.LitAPI):\n    def setup(self, device):\n        self.llm = litgpt.LLM.load('google\/gemma-2-2b-it')\n\n    def decode_request(self, request):\n        return request['prompt']\n\n    def predict(self, prompt):\n        return self.llm.generate(\n            prompt, max_new_tokens=1024\n        )\n\n    def encode_response(self, output):\n        return {'output': output}\n\nif __name__ == '__main__':\n    api = LlmAPI()\n    server = ls.LitServer(\n        api, \n        accelerator='auto'\n    )\n    server.run(port=8000)<\/pre>\n\n\n\n<p>The above is the server script that loads the Gemma-2 2B Instruction Tuned model. Let&#8217;s go through some of the important parts:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>We need just two imports, <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">litgpt<\/code>, and <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">litserve<\/code>. We can access the internal API modules through <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">litserve<\/code>.<\/li>\n\n\n\n<li>We have a class called <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">LlmAPI<\/code>. This has four methods, <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">setup<\/code>, <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">decode_request<\/code>, <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">predict<\/code>, and <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">encode<\/code>.<\/li>\n\n\n\n<li>The <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">setup<\/code> method runs right away whenever we initialize the class. It will download and load the model onto the memory. Notice that we are using LitGPT to load the Gemma-2 model.<\/li>\n\n\n\n<li>The <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">decode_request<\/code> decodes the user prompt that the client sends. <\/li>\n\n\n\n<li>The <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">predict<\/code> method carries out the forward pass after the decoding step is complete.<\/li>\n\n\n\n<li>Finally, the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">encode_request<\/code> sends the request back to the client as a response from the LLM.<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p>The main block initializes the server on port 8000. By default, it starts on the localhost, http:\/\/127.0.0.1.<\/p>\n\n\n\n<p>Following is the client script.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"client.py\" data-enlighter-group=\"client_1\">import requests\nimport json\nimport argparse\n\nSERVER_URL = 'http:\/\/127.0.0.1:8000'\n\ndef main():\n    # Set up command line argument parsing\n    parser = argparse.ArgumentParser(description='Send a prompt to the LitServe server.')\n    parser.add_argument(\n        '--prompt', \n        type=str, \n        help='The prompt text to generate text from.',\n        default='Hello. How are you?'\n    )\n    \n    # Parse command line arguments\n    args = parser.parse_args()\n    \n    # Use the provided prompt from the command line\n    prompt_text = args.prompt\n    \n    # Define the server's URL and the endpoint\n    predict_endpoint = '\/predict'\n    \n    # Prepare the request data as a dictionary\n    request_data = {\n        'prompt': prompt_text\n    }\n    \n    # Send a POST request to the server and receive the response\n    response = requests.post(f'{SERVER_URL}{predict_endpoint}', json=request_data)\n    \n    # Check if the request was successful\n    if response.status_code == 200:\n        # Parse the JSON response\n        response_data = json.loads(response.text)\n        \n        # Print the output from the response\n        print('Response:', response_data['output'])\n    else:\n        print(f'Error: Received response code {response.status_code}')\n\nif __name__ == '__main__':\n    main()<\/pre>\n\n\n\n<p>This script is quite straightforward. We have a <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">SERVER_URL<\/code> which is the local host where our server is running. Then we have a <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">main<\/code> function which has a single command line argument to pass the user prompt.<\/p>\n\n\n\n<p>Whatever prompt the user sends will hit the server&#8217;s <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">predict<\/code> endpoint and the request will be sent as a JSON data.<\/p>\n\n\n\n<p>The response from the LLM will also be a JSON data that we decode as a dictionary where the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">output<\/code> key holds the LLM response. We print that on the terminal.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Executing the Server and Client Script<\/h3>\n\n\n\n<p>In the VS Code Studio environments, first, we need to execute the server script, then the client script. Before executing the server script, ensure that you login to Hugging Face via the CLI as Gemma-2 is a gated model. Use the following command and paste your Hugging Face token when prompted.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">huggingface-cli login<\/pre>\n\n\n\n<p>Next, execute the server script.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python server.py<\/pre>\n\n\n\n<p>The first time you execute the script, it will download the model from Hugging Face, store it in the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">checkpoints<\/code> directory and also convert it into the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">litgpt<\/code> format.<\/p>\n\n\n\n<p>Next, open a new terminal and execute the client script while passing down a prompt.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python client.py --prompt \"Hello! How are you?\"<\/pre>\n\n\n\n<p>If you are running it on GPU, you should get the response right away. On CPU, it might take a while.<\/p>\n\n\n\n<p>Here is a video showing the entire process on Lightning Studio.<\/p>\n\n\n\n<figure class=\"wp-block-video aligncenter\"><video height=\"1244\" style=\"aspect-ratio: 1700 \/ 1244;\" width=\"1700\" autoplay controls loop muted src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/litserve-studio-inference.mp4\"><\/video><figcaption class=\"wp-element-caption\">Video 1. Running Server and Client scripts on Lightning Studio for serving LLMs using LitServe.<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Exposing an API Through Lightning Studio<\/h3>\n\n\n\n<p>Above, the client query was sent from the same environment where the server was running. What if we want to send a request from our system or anyone else wants to send from their system. For this, Lightning Studio provides a one-click solution to expose APIs, the <strong>API Builder<\/strong>. We just need to add that as plugin in the Studio environment, and we will get an exposed URL that we can use.<\/p>\n\n\n\n<p>After getting the URL, open the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">api_call.py<\/code> script in your <strong>local system<\/strong> and paste it as shown below.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"api_call.py\" data-enlighter-group=\"api_call_1\">import requests, json\nimport argparse\n\nparser = argparse.ArgumentParser()\nparser.add_argument(\n    '--prompt'\n)\n\nargs = parser.parse_args()\n\nresponse = requests.post(\n    # SERVER_URL. Replace with your URL\/predict\n    'SERVER_URL\/predict',\n    json={'prompt': args.prompt},\n    stream=True\n)\n\n# With streaming.\nfor line in response.iter_lines(decode_unicode=True):\n    if line:\n        print(json.loads(line)['output'], end='')<\/pre>\n\n\n\n<p>That&#8217;s it. Now, we can execute the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">api_call.py<\/code> script with a prompt from our local terminal and get an answer from the LLM.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python api_call.py --prompt \"Hello\"<\/pre>\n\n\n\n<p>Take a look at the following video to see how the entier process works.<\/p>\n\n\n\n<figure class=\"wp-block-video aligncenter\"><video height=\"1244\" style=\"aspect-ratio: 1700 \/ 1244;\" width=\"1700\" autoplay controls loop muted src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/litserve-studio-api-bulider.mp4\"><\/video><figcaption class=\"wp-element-caption\">Video 2. Creating an API using Lightning Studio API Builder and exposing the API.<\/figcaption><\/figure>\n\n\n\n<p><strong><em>When building API, make sure that the port (8000 in this case) matches the one where you create the API. In our case, the server was running on port 8000. If you run the server on a different port, then the API Builder port has to change as well<\/em>.<\/strong><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Running Hugging Face Transformer Models using LitGPT and Serving Through LitServe<\/h3>\n\n\n\n<p>In the previous section, we ran the models on Lightning Studio using LitGPT. However, there may be some custom models available (even our own fine-tuned) on Hugging Face Model Hub that we want to serve. Fortunately, LitServe makes it really easy to customize the server class and serve any model from Hugging Face. <\/p>\n\n\n\n<p>Upload the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">server_hf.py<\/code> to the Studio environment that contains the code to run the Gemma-2 2B instruction tuned model from Hugging Face.<\/p>\n\n\n\n<p>In case, <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">transformers<\/code> is not installed, you can do so by executing <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">pip install transformers<\/code>.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"server_hf.py\" data-enlighter-group=\"server_hf_1\">\"\"\"\nLitAPI serving for Hugging Face LLMs.\n\nUSAGE:\n$ python serve_hf.py\n\"\"\"\n\nimport litserve as ls\nimport torch\n\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\nclass LlmAPI(ls.LitAPI):\n    def setup(self, device):\n        self.device = device\n        self.tokenizer = AutoTokenizer.from_pretrained('google\/gemma-2-2b-it')\n        self.model = AutoModelForCausalLM.from_pretrained(\n            'google\/gemma-2-2b-it',\n            torch_dtype=torch.float16\n        ).to(device).eval()\n\n    def decode_request(self, request):\n        text = request['prompt']\n        input_ids = self.tokenizer(text, return_tensors='pt').to(self.device)\n        return input_ids\n\n    def predict(self, input_ids):\n        with torch.no_grad():\n            outputs = self.model.generate(\n                **input_ids,\n                max_new_tokens=1024\n            )\n        return outputs\n\n    def encode_response(self, outputs):\n        return {'output': self.tokenizer.decode(\n            outputs[0], add_generation_prompt=False, skip_special_tokens=True\n        )}\n\nif __name__ == '__main__':\n    api = LlmAPI()\n    server = ls.LitServer(\n        api, \n        accelerator='auto'\n    )\n    server.run(port=8000)<\/pre>\n\n\n\n<p>Here we modify the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">setup<\/code>, <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">decode_request<\/code>, <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">predict<\/code>, and <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">encoder_response<\/code> methods as per the requirements of the model that we are using.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">setup<\/code> method now loads the Gemma-2 tokenizer, and the model from Hugging Face in Float16.<\/li>\n\n\n\n<li>The <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">decode_request<\/code> method tokenizes the prompt that we pass through the client script.<\/li>\n\n\n\n<li>The forward pass through the model happens in the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">predict<\/code> method. <strong><em>Notice that it uses the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">input_ids<\/code> from the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">decode_request<\/code> method.<\/em><\/strong><\/li>\n\n\n\n<li>And the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">encode_response<\/code> method de-tokenizes the response from the model and returns it back to the client terminal.<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p>Running this is the same as running the previous server script. However, as it is running on the same port, ensure that you quit the previous script before executing this. Execute the server in one terminal.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python server_hf.py<\/pre>\n\n\n\n<p>And the client in another terminal.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python client.py --prompt \"Tell me about CNNs.\"<\/pre>\n\n\n\n<p>You should see a response on the terminal after a few seconds.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Streaming Response<\/h3>\n\n\n\n<p>At the moment, we get the response from the model on the client terminal in one-shot. However, the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">api_call.py<\/code> script supports streaming of text. This allows us to stream output tokens as they come, just as in real chat applications.<\/p>\n\n\n\n<p>To use the streaming capabilities of LitGPT and LitServer we need to use the CLI command for executing models. <\/p>\n\n\n\n<p>Quit the current server script that is running, and execute the following in the Lightning Studio terminal.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">litgpt serve google\/gemma-2-2b-it --port 8000 --stream True --max_new_tokens 1024 --devices 0<\/pre>\n\n\n\n<p>We execute the same model, however, this time using the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">litgpt serve<\/code> CLI command. <\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>It first accepts the model name<\/li>\n\n\n\n<li>Then the port number<\/li>\n\n\n\n<li>Next, whether we want to stream the output <\/li>\n\n\n\n<li>And finally, the maximum number of output tokens<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p>In the above command, we use <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">--devices<\/code> 0 which indicates that the model should be loaded onto the CPU as there are no GPU devices. If you have enabled GPU in your Lightning Studio environment, you can skip the final argument.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">litgpt serve google\/gemma-2-2b-it --port 8000 --stream True --max_new_tokens 1024<\/pre>\n\n\n\n<p>Similarly, in your local terminal, execute the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">api_call.py<\/code> script. As the API builder is already running and we have set the exposed URL path with the end point, we can directly execute it with a prompt of our choice.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python api_call.py --prompt \"Tell me about NLP\"<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Exposing API on Local System<\/h2>\n\n\n\n<p>All the above steps that we carried out above, will also work on the local system when executed in two different terminals. In fact, when we created the API on Lightning Studio, we executed the client call from the local terminal.<\/p>\n\n\n\n<p><strong><em>But what if we want to serve a model using LitServe on our local system and expose the API?<\/em><\/strong><\/p>\n\n\n\n<p>In general, this is a challenging task.<\/p>\n\n\n\n<p>However, <strong><a href=\"https:\/\/ngrok.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">ngrok<\/a><\/strong> makes it really straightforward. In simple words, it creates an exposed URL of the local host URL given a port number where the application is running.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Setting Up ngrok<\/h3>\n\n\n\n<p>Here, we will install ngrok on Ubuntu. First, you need to create an account. Next, to install, execute the follwing command on the terminal. You can find the <strong><a href=\"https:\/\/dashboard.ngrok.com\/get-started\/setup\/linux\" target=\"_blank\" rel=\"noreferrer noopener\">setup &amp; installation steps here<\/a><\/strong>.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">curl -sSL https:\/\/ngrok-agent.s3.amazonaws.com\/ngrok.asc \\\n\t| sudo tee \/etc\/apt\/trusted.gpg.d\/ngrok.asc >\/dev\/null \\\n\t&amp;&amp; echo \"deb https:\/\/ngrok-agent.s3.amazonaws.com buster main\" \\\n\t| sudo tee \/etc\/apt\/sources.list.d\/ngrok.list \\\n\t&amp;&amp; sudo apt update \\\n\t&amp;&amp; sudo apt install ngrok<\/pre>\n\n\n\n<p>Every ngrok account is attached with an authentication token that we need to set up as well. In the above link, you can see the next command to set the authentication token on your system.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">ngrok config add-authtoken YOUR_AUTH_TOKEN<\/pre>\n\n\n\n<p>That&#8217;s it, we have everything set.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Running the Server Script Locally<\/h3>\n\n\n\n<p>Now, let&#8217;s run the server script locally.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python server.py<\/pre>\n\n\n\n<p>Now, in another terminal, create an exposed URL of the localhost running on port 8000 through tunneling.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">ngrok http http:\/\/localhost:8000<\/pre>\n\n\n\n<p>The terminal screen should change to be similar to the following.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">ngrok                                                                                                                                                                       (Ctrl+C to quit)\n                                                                                                                                                                                            \nSign up to try new private endpoints https:\/\/ngrok.com\/new-features-update?ref=private                                                                                                      \n                                                                                                                                                                                            \nSession Status                online                                                                                                                                                        \nAccount                       sovitrath5@gmail.com (Plan: Free)                                                                                                                             \nVersion                       3.15.0                                                                                                                                                        \nRegion                        India (in)                                                                                                                                                    \nLatency                       82ms                                                                                                                                                          \nWeb Interface                 http:\/\/127.0.0.1:4040                                                                                                                                         \nForwarding                    https:\/\/4f5c-2401-4900-4bbc-4040-5553-445c-7733-ac0b.ngrok-free.app -> http:\/\/localhost:8080                                                                  \n                                                                                                                                                                                            \nConnections                   ttl     opn     rt1     rt5     p50     p90                                                                                                                   \n                              0       0       0.00    0.00    0.00    0.00 <\/pre>\n\n\n\n<p>Above, the HTTPS URL after <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">Forwarding<\/code> is the exposed URL that we can use instead of the Lightning Studio API Builder URL.<\/p>\n\n\n\n<p>Update the same in the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">api_call.py<\/code> script. Replace the <strong>NGROK_URL<\/strong> with the URL that you get on your terminal.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"api_call.py\" data-enlighter-group=\"api_call_2\">import requests, json\nimport argparse\n\nparser = argparse.ArgumentParser()\nparser.add_argument(\n    '--prompt'\n)\n\nargs = parser.parse_args()\n\nresponse = requests.post(\n    # SERVER_URL\n    'NGROK_URL\/predict',\n    json={'prompt': args.prompt},\n    stream=True\n)\n\n# With streaming.\nfor line in response.iter_lines(decode_unicode=True):\n    if line:\n        print(json.loads(line)['output'], end='')<\/pre>\n\n\n\n<p>Next, open another terminal, and execute the API call script with a prompt of your choice.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python api_call.py --prompt \"Hello\"<\/pre>\n\n\n\n<p>The following video shows the entire process.<\/p>\n\n\n\n<figure class=\"wp-block-video aligncenter\"><video height=\"1244\" style=\"aspect-ratio: 1694 \/ 1244;\" width=\"1694\" autoplay controls loop muted src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/litserve-local-exposed-api.mp4\"><\/video><figcaption class=\"wp-element-caption\">Video 3. Exposing an API with LitServe and ngrok for LLM inference.<\/figcaption><\/figure>\n\n\n\n<p>You can similarly execute the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">litgpt serve<\/code> CLI command with streaming and execute the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">api_call.py<\/code> script to get streaming output. <\/p>\n\n\n\n<p>With the URL provided by ngrok, we can make the API call from any network, not just the local system. However, it is a free account on ngrok at the moment and is not production ready. There are API call limitations and the link will also expire after a few hours. But it is possible to take it from here and make it production ready, which at the moment is out of scope of this article.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Summary and Conclusion<\/h2>\n\n\n\n<p>We covered a lot in this article. Starting from setting up LitGPT and LitServe locally, running &amp; serving model on Lightning Studio, to exposing URLs from the local system, along with the discussion of the limitations.<\/p>\n\n\n\n<p>Right now, we are well-equipped to build interesting applications. The best part is that we can also serve any model that we want through Lightning Serve, and not just LLMs. <em>Why don&#8217;t you try playing around with diffusion based image generation models and try the same process with API Builder?<\/em> Let others know in the comments what you are building. I hope that this article was worth your time.<\/p>\n\n\n\n<p>If you have any doubts, thoughts, or suggestions, please leave them in the comment section. I will surely address them.<\/p>\n\n\n\n<p>You can contact me using the <strong><a aria-label=\"Contact (opens in a new tab)\" href=\"https:\/\/debuggercafe.com\/contact-us\/\" target=\"_blank\" rel=\"noreferrer noopener\">Contact<\/a><\/strong> section. You can also find me on <strong><a aria-label=\"LinkedIn (opens in a new tab)\" href=\"https:\/\/www.linkedin.com\/in\/sovit-rath\/\" target=\"_blank\" rel=\"noreferrer noopener\">LinkedIn<\/a><\/strong>, and <strong><a href=\"https:\/\/x.com\/SovitRath5\" target=\"_blank\" rel=\"noreferrer noopener\">X<\/a><\/strong>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this article, we use LitGPT, LitAPI, and LitServe for serving LLMs using Lightning Studio and also on the local system. <\/p>\n","protected":false},"author":1,"featured_media":37869,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1010,819,409],"tags":[1019,1013,1016,1014,1011,1017,1018,1015,1012],"class_list":["post-37736","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-lightning-ai","category-llms","category-nlp","tag-lightning-api-builder","tag-lightning-studio","tag-litapi","tag-litgpt","tag-litserve","tag-litserve-api","tag-litserve-api-builder","tag-litserve-llms","tag-serving-llms"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.9 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Serving LLMs using LitServe<\/title>\n<meta name=\"description\" content=\"Serving LLMs using LitServe, LitGPT, &amp; LitAPI and exposing the API URL for remote executing of the Gemma2-2b instruction tuned LLM.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Serving LLMs using LitServe\" \/>\n<meta property=\"og:description\" content=\"Serving LLMs using LitServe, LitGPT, &amp; LitAPI and exposing the API URL for remote executing of the Gemma2-2b instruction tuned LLM.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/\" \/>\n<meta property=\"og:site_name\" content=\"DebuggerCafe\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/profile.php?id=100013731104496\" \/>\n<meta property=\"article:published_time\" content=\"2024-10-21T00:30:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-10-21T00:32:15+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/Serving-LLMs-using-LitServe-e1725412076490.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1000\" \/>\n\t<meta property=\"og:image:height\" content=\"563\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Sovit Ranjan Rath\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@SovitRath5\" \/>\n<meta name=\"twitter:site\" content=\"@SovitRath5\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Sovit Ranjan Rath\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"12 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/\"},\"author\":{\"name\":\"Sovit Ranjan Rath\",\"@id\":\"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752\"},\"headline\":\"Serving LLMs using LitServe\",\"datePublished\":\"2024-10-21T00:30:00+00:00\",\"dateModified\":\"2024-10-21T00:32:15+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/\"},\"wordCount\":2298,\"commentCount\":1,\"image\":{\"@id\":\"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/Serving-LLMs-using-LitServe-e1725412076490.png\",\"keywords\":[\"Lightning API Builder\",\"Lightning Studio\",\"LitAPI\",\"LitGPT\",\"LitServe\",\"LitServe API\",\"LitServe API Builder\",\"LitServe LLMs\",\"Serving LLMs\"],\"articleSection\":[\"Lightning AI\",\"LLMs\",\"NLP\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/\",\"url\":\"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/\",\"name\":\"Serving LLMs using LitServe\",\"isPartOf\":{\"@id\":\"https:\/\/debuggercafe.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/Serving-LLMs-using-LitServe-e1725412076490.png\",\"datePublished\":\"2024-10-21T00:30:00+00:00\",\"dateModified\":\"2024-10-21T00:32:15+00:00\",\"author\":{\"@id\":\"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752\"},\"description\":\"Serving LLMs using LitServe, LitGPT, & LitAPI and exposing the API URL for remote executing of the Gemma2-2b instruction tuned LLM.\",\"breadcrumb\":{\"@id\":\"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/#primaryimage\",\"url\":\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/Serving-LLMs-using-LitServe-e1725412076490.png\",\"contentUrl\":\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/Serving-LLMs-using-LitServe-e1725412076490.png\",\"width\":1000,\"height\":563,\"caption\":\"Serving LLMs using LitServe\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/debuggercafe.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Serving LLMs using LitServe\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/debuggercafe.com\/#website\",\"url\":\"https:\/\/debuggercafe.com\/\",\"name\":\"DebuggerCafe\",\"description\":\"Machine Learning and Deep Learning\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/debuggercafe.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752\",\"name\":\"Sovit Ranjan Rath\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/debuggercafe.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/f71ca13ec56d630e7d8045e8b846396068791aa204936c3d74d721c6dd2b4d3c?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/f71ca13ec56d630e7d8045e8b846396068791aa204936c3d74d721c6dd2b4d3c?s=96&d=mm&r=g\",\"caption\":\"Sovit Ranjan Rath\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Serving LLMs using LitServe","description":"Serving LLMs using LitServe, LitGPT, & LitAPI and exposing the API URL for remote executing of the Gemma2-2b instruction tuned LLM.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/","og_locale":"en_US","og_type":"article","og_title":"Serving LLMs using LitServe","og_description":"Serving LLMs using LitServe, LitGPT, & LitAPI and exposing the API URL for remote executing of the Gemma2-2b instruction tuned LLM.","og_url":"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/","og_site_name":"DebuggerCafe","article_publisher":"https:\/\/www.facebook.com\/profile.php?id=100013731104496","article_published_time":"2024-10-21T00:30:00+00:00","article_modified_time":"2024-10-21T00:32:15+00:00","og_image":[{"width":1000,"height":563,"url":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/Serving-LLMs-using-LitServe-e1725412076490.png","type":"image\/png"}],"author":"Sovit Ranjan Rath","twitter_card":"summary_large_image","twitter_creator":"@SovitRath5","twitter_site":"@SovitRath5","twitter_misc":{"Written by":"Sovit Ranjan Rath","Est. reading time":"12 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/#article","isPartOf":{"@id":"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/"},"author":{"name":"Sovit Ranjan Rath","@id":"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752"},"headline":"Serving LLMs using LitServe","datePublished":"2024-10-21T00:30:00+00:00","dateModified":"2024-10-21T00:32:15+00:00","mainEntityOfPage":{"@id":"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/"},"wordCount":2298,"commentCount":1,"image":{"@id":"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/#primaryimage"},"thumbnailUrl":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/Serving-LLMs-using-LitServe-e1725412076490.png","keywords":["Lightning API Builder","Lightning Studio","LitAPI","LitGPT","LitServe","LitServe API","LitServe API Builder","LitServe LLMs","Serving LLMs"],"articleSection":["Lightning AI","LLMs","NLP"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/debuggercafe.com\/serving-llms-using-litserve\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/","url":"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/","name":"Serving LLMs using LitServe","isPartOf":{"@id":"https:\/\/debuggercafe.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/#primaryimage"},"image":{"@id":"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/#primaryimage"},"thumbnailUrl":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/Serving-LLMs-using-LitServe-e1725412076490.png","datePublished":"2024-10-21T00:30:00+00:00","dateModified":"2024-10-21T00:32:15+00:00","author":{"@id":"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752"},"description":"Serving LLMs using LitServe, LitGPT, & LitAPI and exposing the API URL for remote executing of the Gemma2-2b instruction tuned LLM.","breadcrumb":{"@id":"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/debuggercafe.com\/serving-llms-using-litserve\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/#primaryimage","url":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/Serving-LLMs-using-LitServe-e1725412076490.png","contentUrl":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/09\/Serving-LLMs-using-LitServe-e1725412076490.png","width":1000,"height":563,"caption":"Serving LLMs using LitServe"},{"@type":"BreadcrumbList","@id":"https:\/\/debuggercafe.com\/serving-llms-using-litserve\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/debuggercafe.com\/"},{"@type":"ListItem","position":2,"name":"Serving LLMs using LitServe"}]},{"@type":"WebSite","@id":"https:\/\/debuggercafe.com\/#website","url":"https:\/\/debuggercafe.com\/","name":"DebuggerCafe","description":"Machine Learning and Deep Learning","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/debuggercafe.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752","name":"Sovit Ranjan Rath","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/debuggercafe.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/f71ca13ec56d630e7d8045e8b846396068791aa204936c3d74d721c6dd2b4d3c?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/f71ca13ec56d630e7d8045e8b846396068791aa204936c3d74d721c6dd2b4d3c?s=96&d=mm&r=g","caption":"Sovit Ranjan Rath"}}]}},"_links":{"self":[{"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/posts\/37736","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/comments?post=37736"}],"version-history":[{"count":145,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/posts\/37736\/revisions"}],"predecessor-version":[{"id":37890,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/posts\/37736\/revisions\/37890"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/media\/37869"}],"wp:attachment":[{"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/media?parent=37736"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/categories?post=37736"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/tags?post=37736"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}