Million Miles Technologies

How to Build a Chat Interface using Gradio & Vultr Cloud GPU — SitePoint


This article was created in partnership with Vultr. Thank you for supporting the partners who make SitePoint possible.

Gradio is a Python library that simplifies the process of deploying and sharing machine learning models by providing a user-friendly interface that requires minimal code. You can use it to create customizable interfaces and share them conveniently using a public link for other users.

In this guide, you’ll be creating a web interface where you can interact with the Mistral 7B large language model through the input field and see model outputs displayed in real time on the interface.

Prerequisites

Before you begin:

Create a Gradio Chat Interface

On the deployed instance, you need to install some packages for creating a Gradio application. However, you don’t need to install packages like the NVIDIA CUDA Toolkit, cuDNN, and PyTorch, as they come pre-installed on the Vultr GPU Stack instances.

  1. Upgrade the Jinja package:
    $ pip install --upgrade jinja2
    
  2. Install the required dependencies:
    $ pip install transformers gradio
    
  3. Create a new file named chatbot.py using nano:
    $ sudo nano chatbot.py
    

    Follow the next steps for populating this file.

  4. Import the required modules:
    import gradio as gr
    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer
    from threading import Thread
    

    The above code snippet imports all the required modules in the namespace for inferring the Mistral 7B large language model and launching a Gradio chat interface.

  5. Initialize the model and tokenizer:
    
    
    model_repo = "mistralai/Mistral-7B-v0.1"
    
    model = AutoModelForCausalLM.from_pretrained(model_repo, torch_dtype=torch.float16)
    tokenizer = AutoTokenizer.from_pretrained(model_repo)
    
    model = model.to('cuda:0')
    

    The above code snippet initializes model, tokenizer and enable CUDA processing.

  6. Define the stopping criteria:
    class StopOnTokens(StoppingCriteria):
        def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
            stop_ids = [29, 0]
            for stop_id in stop_ids:
                if input_ids[0][-1] == stop_id:
                    return True
            return False
    

    The above code snippets inherits a new class named StopOnTokens from the StoppingCriteria class.

  7. Define the predict() function:
    def predict(message, history):
        stop = StopOnTokens()
        history_transformer_format = history + [[message, ""]]
        messages = "".join(["".join(["\n:" + item[0], "\n:" + item[1]]) for item in history_transformer_format])
    

    The above code snippet defines variables for StopOnToken() object and storing the conversation history. It formats the history by pairing each of the message with its response and providing tags to determine whether it is from a human or a bot.

    The code snippet in the next step is to be pasted inside the predict() function as well.

  8. Initialize a text interator streamer:
    model_inputs = tokenizer([messages], return_tensors="pt").to("cuda")
        streamer = TextIteratorStreamer(tokenizer, timeout=10., skip_prompt=True, skip_special_tokens=True)
        generate_kwargs = dict(
            model_inputs,
            streamer=streamer,
            max_new_tokens=200,
            do_sample=True,
            top_p=0.95,
            top_k=1000,
            temperature=0.4,
            num_beams=1,
            stopping_criteria=StoppingCriteriaList([stop])
        )
        t = Thread(target=model.generate, kwargs=generate_kwargs)
        t.start()
        partial_message  = ""
        for new_token in streamer:
            if new_token != '<':
                partial_message += new_token
                yield partial_message
    

    The streamer requests for new tokens from the model and receives them one by one ensuring a continuous flow of text output.

    You can adjust the model parameters such as max_new_tokens, top_p, top_k, and temperature to manipulate the model response. To know more about these parameters you can refer to How to Use TII Falcon Large Language Model on Vultr Cloud GPU.

  9. Launch Gradio chat interface at the end of file:
    gr.ChatInterface(predict).launch(server_name='0.0.0.0')
    
  10. Exit the text editor using CTRL + X to save the file and hit Y to allow file overwrites.
  11. Allow incoming connections on port 7860:
    $ sudo ufw allow 7860
    

    Gradio uses the port 7860 by default.

  12. Reload the firewall:
    $ sudo ufw reload
    
  13. Execute the application:
    $ python3 chatbot.py
    

    Executing the application for the first time can take additional time for downloading the checkpoints for the Mistral 7B large language model and loading it on to the GPU. This procedure may take anywhere from 5 mins to 10 mins depending on your hardware, internet connectivity and so on.

    Once it executes, you can access the Gradio chat interface via your web browser by navigating to:

    http://SERVER_IP_ADRESS:7860/
    

    The expected output is shown below.

    Gradio chat

Do More With Gradio

Conclusion

In this guide, you used Gradio to build a chat interface and infer the Mistral 7B model by Mistral AI using Vultr GPU Stack.

This is a sponsored article by Vultr. Vultr is the world’s largest privately-held cloud computing platform. A favorite with developers, Vultr has served over 1.5 million customers across 185 countries with flexible, scalable, global Cloud Compute, Cloud GPU, Bare Metal, and Cloud Storage solutions. Learn more about Vultr.

Related blogs