GDCorner

#3 - Speak To Me! (TTS & STT) - No BS Intro To Developing with LLMs

2024-07-11T04:20:13+00:00

Welcome back to the No BS series on LLMs. Today we’re going to take a fun detour from LLMs and work with some audio processing. This is an important topic since so much of our language is communicated verbally. Being able to process recordings such as meeting recordings and videos as well generate audible responses, such as answers, summaries, or even your own personal morning news podcast will be very useful.

This isn’t going to be a super deep dive into DSP and audio, but rather just introduce you to some of the tools and concepts. If you’d like a deep dive into signals analysis and processing then checkout The Scientist and Engineer’s Guide to Digital Signal Processing.

Intro To Working With Audio

Working with audio can seem daunting if you haven’t done it before, and it is naturally a very deep topic, but for our purposes it’s pretty quick to get up to working speed with it.

Image By Д.Ильин: vectorization - File:Signal Sampling.png by Email4mobile (talk), CC0, https://commons.wikimedia.org/w/index.php?curid=98587159

To represent a continuous signal like an audio waveform digitally we need to break it down into discrete (or individual) values. Audio is represented in digital systems as a series of intensity or strength values. An individual measured value is called a sample or a frame in the case of the audio library we’ll be using. These samples are taken at regular intervals which is usually expressed in hertz (often abbreviated as hz) which means samples per second. This is called the sample rate or sampling rate. The resolution or size of the samples is called the bit depth, for example 8, 16 and 24 bits are common.

Mono Audio is a single series of values to represent 1 waveform. Imagine a single microphone, or a single speaker. Stereo is 2 sets of these series of values, for the left and right speakers. Having multiple microphones can have advantages, especially in preprocessing audio. For example with some clever DSP techniques you can use multiple microphones to find the direction of sound, reduce or eliminate noise and so on. All of those topics are complex and out of scope for this article though, see the book I recommended earlier for a deeper look into Digital Signals Processing.

Audio can be compressed and decompressed using a codec, you are likely familiar with the popular ones like AAC, MP3, and so on.

For our purposes we’re going to stick with uncompressed mono audio, meaning there is only 1 channel or series of values, and we don’t need to deal with a codec library at the same time.

Recording Audio

To ensure our app works on most operating systems we’ll use PyAudio, which provides a nice simple interface to record and play audio with.

We won’t dive into device selection in this series. I’m going to assume you only have 1 input and 1 output audio device, or that you’ve appropriately set the default audio devices in your system settings.

Once again, on Ubuntu we need to install a little more. If you need a refresher on the setup so far checkout article 1. We need the development package for Python 3.11, as well as PortAudio which is what PyAudio uses behind the scenes, and PyAudio is what we’ll be using in Python.

# Ensure our virtual environment is active
source .venv/bin/activate

sudo apt-get install python3.11-dev portaudio19-dev
pip install pyaudio

Now to record audio in python we first create a PyAudio object which initialises it for us.

import pyaudio

# Instantiate PyAudio
pyaudio_instance = pyaudio.PyAudio()

Now we need to open a stream for recording from the microphone

stream = pyaudio_instance.open(format=SAMPLE_BIT_DEPTH,
                channels=NUM_CHANNELS,
                rate=SAMPLE_RATE,
                frames_per_buffer=CHUNK_SIZE,
                input=True)

Let’s walk through that quickly, we’ve covered most of these concepts in the intro to audio section. format allows us to specify the datatype of the samples, in this case how many bits per sample. channels is where we can set how many waveforms we want, such as mono, stereo etc. rate is the sampling rate in hz . With input we specify that this is an input, or a recording stream, for example we want information from the microphone.

The final interesting value is frames_per_buffer which is how many samples we want to receive at a time. We won’t be receiving individual samples at a time, but rather chunks or segments of audio. Smaller chunks can reduce latency of processing however if you choose a value that’s too small or don’t keep up with the throughput of the system it can lead lead to skipped audio or if you are streaming straight back out it may sound choppy or weird. Using a value too high may lead to long latency before you can react to audio events, like say if you are trying to stream audio straight to another source. You need to choose a balance for your particular application.

The values I use for this demo are below. I’ve specifically chosen these values for working with Whisper which will do the speech to text conversion for us, but we’ll cover that in a few sections time.

CHUNK_SIZE = 1024  # Record in chunks of 1024 samples
SAMPLE_BIT_DEPTH = pyaudio.paInt16  # 16 bits per sample
NUM_CHANNELS = 1
SAMPLE_RATE = 16000  # Record at 16khz which Whisper is designed for
                     # https://github.com/openai/whisper/discussions/870
RECORDING_LENGTH = 10

Now the stream is open we need to receive this data.

chunks = []  # Initialize list to store chunks of samples as we received them

TOTAL_SAMPLES = SAMPLE_RATE * RECORDING_LENGTH

for i in range(int(TOTAL_SAMPLES / CHUNK_SIZE) + 1):
    data = stream.read(CHUNK_SIZE)
    chunks.append(data)
    
# Join the chunks together to get the full recording in bytes
frames = b''.join(chunks)
    
# Trim the recorded frames to the desired recording length
# The frames were received as bytes, so we need to account for 2 bytes per sample
frames = frames[:TOTAL_SAMPLES * 2]

In this example we are recording for 10 seconds RECORDING_LENGTH, which means the range for the loop is Samples per second * Number of Seconds then finally we divide that by our segment or chunk size. Basically, we read enough chunks of audio to cover our desired recording length.

Because we record in chunks, we often won’t be able to get the exact number of samples desired, so we’ll try and get more samples than required, and then trim it down. Alternatively, you could also just pad it with zeroes. you can see in chunks we are storing each chunk of audio as a new entry. That means we need to reassemble the full audio stream afterwards which is where we create the frames variable and do the bytes join to append each of those chunks into a new bytes object.

Finally, we close up the stream and terminate our PyAudio instance.

# Stop and close the stream 
stream.stop_stream()
stream.close()
# Release PyAudio
pyaudio_instance.terminate()

Wave Files

Now we have the audio recorded to an in-memory buffer, what do we do with it? Well let’s start by writing it to file. Wave files are a common uncompressed file format. Python has an inbuilt library for working with them. This will be useful for testing so we don’t have to keep speaking and waiting for the recording time every time we want to test a new iteration of our program.

To write to a wave file we simply import wave, and then can write a new wave file with a few lines of code.

One thing to be careful of is that we’ve dealt with sample resolution/depth in terms of bits so far, but waves store this as a number of bytes, so we just need to be careful to appropriately convert between the two.

# Python inbuilt library for dealing with wave files
import wave
# Open our wave file as write binary
wf = wave.open(filename, 'wb')
# Set the number of channels to match our recording
wf.setnchannels(NUM_CHANNELS)
# Set the bit depth of our samples. Waves store the sample
# size as bytes, not bits, so we need to do a conversion
wf.setsampwidth(pyaudio_instance.get_sample_size(SAMPLE_BIT_DEPTH))
# Set the sampling rate
wf.setframerate(SAMPLE_RATE)
# Then write out the samples we recorded earlier
wf.writeframes(frames)
# Close the file
wf.close()

Reading a wave file is equally simple. Once again, being careful about bits vs bytes when dealing with sample width.

with wave.open(filename, 'rb') as wf:
		# Careful, waves store the bitdepth as bytes
    SAMPLE_BYTE_DEPTH = wf.getsampwidth()
    # Bit depth can be obtained by converting like this
    SAMPLE_BIT_DEPTH = pyaudio_instance.get_format_from_width(wf.getsampwidth())
    NUM_CHANNELS = wf.getnchannels()
    # note again here that the wave lib calls samples frames
    # so the sample rate is called the frame rate.
    SAMPLE_RATE = wf.getframerate()
    # Read samples
    chunks = wf.readframes(CHUNK_SIZE)

Playing Audio

Playing audio is pretty similar in reverse.

Once again we define a chunk size that we will use to send the audio in segments, as well as which file we’ll be reading

CHUNK_SIZE = 1024
filename = "recording.wav"

Make sure we have a PyAudio instance

# Instantiate PyAudio
pyaudio_instance = pyaudio.PyAudio()

Then we open the file for reading as binary.

with wave.open(filename, 'rb') as wf:

We initialize a stream with the settings from the wave file. The key things are the sample width, the number of channels, and the sample rate.

# Waves store bit depth as number of bytes, so we convert this to PyAudio format
bit_depth = pyaudio_instance.get_format_from_width(wf.getsampwidth())

# Open stream
stream = pyaudio_instance.open(format=bit_depth,
                                channels=wf.getnchannels(),
                                rate=wf.getframerate(),
                                output=True)

Now that the file is open, we can start reading chunks and sending them to the audio stream.

# Play samples from the wave file in the same chunksize
while len(data := wf.readframes(CHUNK_SIZE)):
    stream.write(data)

And finally, we close the audio stream and terminate the PyAudio instance.

# Close stream
stream.close()

# Release PyAudio system resources
pyaudio_instance.terminate()

You can see the full examples of recording and playing audio on GitHub.

Speech To Text with Whisper

For this component we’re going to use Whisper from OpenAI. This is a really great AI model that comes in a variety of sizes. The different sizes have different capabilities at various languages, and they also have different memory and processing requirements. There’s no right answer for which model you choose, it all depends on your needs.

There’s a few improvements to whisper available, like faster-whisper and whisperx however we’ll be sticking with raw Whisper for this series as once again we want to see a few more of the details of the technology and get below some of the abstractions.

Let’s install whisper

# Ensure our virtual environment is active
source .venv/bin/activate

pip install openai-whisper

Recording the Audio for Whisper

Whisper has a few requirements. The first one is that audio sent to Whisper must have a sampling rate of 16,000hz. The settings for sampling frequency we used in the previous example on recording are fine for our requirements. Just double check you didn’t change any of those values. Run the previous example and record a question to a recording.wav file which is what we’ll test our next example on.

If you are transcribing an audio file or stream that doesn’t have a 16khz sampling rate, you need to resample the audio stream. There are a number of libraries and tools to do this, but it’s out of scope for this tutorial, rest assured it’s possible, and can be done with ffmpeg or pytorch as starting points.

Sampling Theorem and Frequencies

The sampling theorem basically states that to accurately represent a signal, you need to sample the signal at twice the maximum frequency you want to receive or reconstruct. In our case this means that Whisper wants a sampling rate of 16,000hz which means it can receive a maximum frequency of 8khz, this is in the HD Voice class of telephony bands. This is perfectly acceptable quality for voice which typically has important frequencies in the 500-8khz range.

Running Whisper on a Wave File

Whisper is pretty simple to use if just transcribing a file. In just a few lines of code we can get some text out of it.

import whisper

# Load the model
model = whisper.load_model("tiny")
# Transcribe the audio file
result = model.transcribe("recording.wav")

# Print the transcribed text
print(result["text"])

That was pretty simple! As you can see Whisper automatically does the sample normalization for us if reading from a wave file.

For our purposes, we only care about the “text” entry in the dict that is returned, but Whisper does some great stuff such as returning time stamps as well. I encourage you to explore the returned dict and think of all the excellent data annotation possibilities from such a feature.

You can see the full example here.

Running Whisper On An Audio Buffer

For a chat bot, we obviously don’t want to have to write everything to a wave file all the time. To solve this we’re going to work with an audio buffer in memory, but to do this we need to do some housekeeping ourselves on the audio stream.

So, using the same process to record audio before, we’ll create a simple record_voice function

def record_voice(record_time=10.0):
    print('Please Speak Now...')

    stream = pyaudio_instance.open(format=SAMPLE_BIT_DEPTH,
                    channels=NUM_CHANNELS,
                    rate=SAMPLE_RATE,
                    frames_per_buffer=CHUNK_SIZE,
                    input=True)

    chunks = []  # Initialize list to store chunks of samples as we received them

    TOTAL_SAMPLES = int(SAMPLE_RATE * record_time)

    for i in range(int(TOTAL_SAMPLES / CHUNK_SIZE) + 1):
        data = stream.read(CHUNK_SIZE)
        chunks.append(data)

    # Join the chunks together to get the full recording in bytes
    frames = b''.join(chunks)

    # Trim extra samples we didn't want to record due to chunk size
    frames = frames[:TOTAL_SAMPLES * 2]

    # Stop and close the stream 
    stream.stop_stream()
    stream.close()

    print("Recording complete")
    return frames

This function simply records audio as before and returns the full set of samples. Now we need to normalize these samples from int16’s which have a range of -32678 to +32677 and convert to floats with a range of -1.0 to +1.0. This can be done quite easily with numpy and simply dividing by 32678.

def convert_audio_for_whisper(samples):
    #samples is an ndarray and must be float32
    whisper_samples = np.array([], dtype=np.float32)

    # Make a numpy array from the samples buffer
    new_samples_int = np.frombuffer(samples, dtype=np.int16)
    # Convert to float and normalize into the range of -1.0 to 1.0
    new_samples = new_samples_int.astype(np.float32) / 32768.0

    whisper_samples = np.append(whisper_samples, new_samples)

    return new_samples

So now we have an array of samples as floats that are in the expected range of values. Now we can use whisper.transcribe just like before.

samples = record_voice(RECORDING_LENGTH)
whisper_samples = convert_audio_for_whisper(samples)

# Load the whisper model
model = whisper.load_model("tiny")

# Transcribe the audio samples
transcription_result = model.transcribe(whisper_samples)

# Print the transcription
print(transcription_result["text"])

You can see the full code of this example here.

That’s all we need from Whisper for a simple chat bot. Whisper is very powerful, and there are a number of models and model sizes that better match different use-cases.

If you want to look further into Speech-To-Text (STT), I recommend checking out these projects:

Faster-Whisper - A re-implementation of Whisper that’s faster and uses less memory
WhisperX - A Library built on top of Faster-Whisper that provides extra functionality like Voice Activity Detection and Speaker Diarization. These are useful for only recording when the user has a question, and for separating individual speakers.
Silero VAD - Another library for Voice Activity Detection
openWakeWord - A library for identifying activation or wake words to automatically begin recording for transcription

Text To Speech with Piper

As with all tasks for building a chat bot, the choices of Text-To-Speech (TTS) engines and models here are plentiful. We’re going to stick with a local model, and I’ve chosen Piper since it’s designed for running on everything right down to a Raspberry Pi. This means it’s light weight, and runs basically anywhere. I found that the voices aren’t particularly expressive/emotive, however the clarity of the voices are surprisingly great and natural given the runtime constraints.

Once again need to make sure we are using Python 3.11 for now. Refer to Article 1 for setting up the python environment if you need a refresher.

To install piper run the following:

# Ensure our virtual environment is active
source .venv/bin/activate

# Install piper-tts
pip install piper-tts

Download a voice

Piper has a large number of pre-trained voices available. You can browse through samples of the voices available here https://rhasspy.github.io/piper-samples/

Once you’ve chosen a voice you like, you can go to the voices page in the repository and find the download links. You need to ensure you download both the .onnx file and the .onnx.json files.

https://github.com/rhasspy/piper/blob/master/VOICES.md

I’ll be using en_US-hfc_female-medium so you need to download both the model and the config json files for this voice, or adjust the model string in the examples.

Speak To Me

First we load the model by importing PiperVoice and then calling .load with our model name.

from piper import PiperVoice

#Make sure the json file is next to this model
piper_model = "en_US-hfc_female-medium.onnx"
# Load the voice model
voice = PiperVoice.load(piper_model, config_path=f"{piper_model}.json")

Our sample text to generate will be an extract of “A Tale Of Two Cities” by Charles Dickens. This is a reasonably long sample so we can see how quickly this generates.

text = "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way—in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only."

Generating Audio is quite straight forward. We provide some parameters for the voice synthesizer and then simply pass in our text we wish to generate.

synthesize_args = {
        "sentence_silence": 0.0,
    }

print("Generating Message")
#open an in-memory wave file
#synthesize the text to the wave file
message_audio_stream = voice.synthesize_stream_raw(text, **synthesize_args)

# message audio is bytes
message_audio = bytearray()

# Grab every byte of audio generated and add it to our message_audio array
for audiobytes in message_audio_stream:
    message_audio += audiobytes

# Finally convert to a bytes object
message_bytes = bytes(message_audio)

Now we have a bytes object with all of our audio, it’s just like playing a wave file. The model we downloaded outputs at a predetermined sample rate, based on the training of the model. So we need to ensure we read this from the settings of the model.

SAMPLE_BIT_DEPTH = pyaudio.paInt16  # 16 bits per sample
NUM_CHANNELS = 1 # mono
pyaudio_instance = pyaudio.PyAudio()

# Open a stream to output the audio. Notice we get the sample rate from the settings of the loaded model.
output_stream = pyaudio_instance.open(format=SAMPLE_BIT_DEPTH,
                                            channels=NUM_CHANNELS,
                                            rate=voice.config.sample_rate,
                                            output=True)
print("Playing Message")
output_stream.write(message_bytes)

So, it’s quite simple to get started generating and playing audio from text. You can see the example for this here.

Now this is great and all, but this still requires that we generate all the audio at once before playing it. On a slower system, a long piece of text, or with a more complex TTS engine, this could have a very significant impact on the responsiveness of our chat bot. It’d be better if we streamed the playback

Let’s refactor this a little, and make it play the audio as it generates.

We’re going to change the for loop where it retrieves the audio from the TTS engine and immediately pump it to the output stream.

# Synthesize the audio to a raw stream
message_audio_stream = voice.synthesize_stream_raw(text, **synthesize_args)

# Larger chunk sizes tend to help stuttering here
CHUNK_SIZE = 4096

# message audio is in bytes
message_audio = bytearray()

for audiobytes in message_audio_stream:
    # We keep acruing audio until we have enough for a chunk
    message_audio += audiobytes
    while len(message_audio) > CHUNK_SIZE:
        # Once we have enough for a chunk (potentially multiple chunks)
        # we extract it from the buffer
        latest_chunk = bytes(message_audio[:CHUNK_SIZE])
        message_audio = message_audio[CHUNK_SIZE:]
        # Output the latest chunk to the audio stream
        output_stream.write(latest_chunk)

# Write whatever is left in message audio that wasn't large enough for a final chunk
output_stream.write(bytes(message_audio))

This now starts outputting audio as soon as we have enough data from the TTS engine to output a large enough chunk. You can see the full example of this here on GitHub.

Putting it all together

Now we’ve learned how to work with audio in doing so we’ve built up a few different pieces of the puzzle of a voice chat bot. So far we’ve built these pieces:

The text chat bot using the OpenAI API from the previous article
A way to record audio from the microphone
A way to transcribe audio using Whisper
A way to generate speech from text

Let’s take the chat bot example from the end of the previous article, and lets add in the new pieces we built up today.

First, we need to import the new libraries.

import whisper
import pyaudio
from piper import PiperVoice

Then at the beginning of the program we need to load the whisper model, and load the piper voice model, and initialize the PyAudio instance.

# Open the whisper tiny model
model = whisper.load_model("tiny")

#Load the piper model
piper_model = "en_US-hfc_female-medium.onnx"
# We can't use CUDA on RPi, so forced off. Turn on if you'd like
voice = PiperVoice.load(piper_model, config_path=f"{piper_model}.json")

# Open a PyAudio instance
pyaudio_instance = pyaudio.PyAudio()

We’ll add in the functions we made for Recording Voice, Converting Audio For Whisper, as well as our default settings for audio

## Voice Record Settings
CHUNK_SIZE = 1024  # Record in chunks of 1024 samples
SAMPLE_BIT_DEPTH = pyaudio.paInt16  # 16 bits per sample
NUM_CHANNELS = 1
SAMPLE_RATE = 16000  # Record at 16khz which Whisper is designed for
RECORDING_LENGTH = 10.0 # Recording time in seconds

def record_voice(record_time=10.0):
    print('Please Speak Now...')

    stream = pyaudio_instance.open(format=SAMPLE_BIT_DEPTH,
                    channels=NUM_CHANNELS,
                    rate=SAMPLE_RATE,
                    frames_per_buffer=CHUNK_SIZE,
                    input=True)

    chunks = []  # Initialize list to store chunks of samples as we received them

    TOTAL_SAMPLES = int(SAMPLE_RATE * record_time)

    for i in range(int(TOTAL_SAMPLES / CHUNK_SIZE) + 1):
        data = stream.read(CHUNK_SIZE)
        chunks.append(data)

    # Join the chunks together to get the full recording in bytes
    frames = b''.join(chunks)

    # Trim extra samples we didn't want to record due to chunk size
    frames = frames[:TOTAL_SAMPLES * 2]

    # Stop and close the stream 
    stream.stop_stream()
    stream.close()

    print("Recording complete")
    return frames

def convert_audio_for_whisper(samples):
    #samples is an ndarray and must be float32
    whisper_samples = np.array([], dtype=np.float32)

    # Make a numpy array from the samples buffer
    new_samples_int = np.frombuffer(samples, dtype=np.int16)
    # Convert to float and normalize into the range of -1.0 to 1.0
    new_samples = new_samples_int.astype(np.float32) / 32768.0

    whisper_samples = np.append(whisper_samples, new_samples)

    return new_samples

We’ll make a nice wrapper function for these to make it simple to get a new question from the user. This will help keep our chat loop nice and clear.

def get_user_message_from_voice():
    samples = record_voice(RECORDING_LENGTH)
    whisper_samples = convert_audio_for_whisper(samples)
    transcription = model.transcribe(whisper_samples)
    return transcription['text']

We also need to add in our Speak Answer function using the streaming version of the TTS generation

def speak_answer(text):
    print("Generating Message Audio")

    # Open stream
    output_stream = pyaudio_instance.open(format=SAMPLE_BIT_DEPTH,
                                            channels=NUM_CHANNELS,
                                            rate=voice.config.sample_rate,
                                            output=True)

    synthesize_args = {
            "sentence_silence": 0.0,
        }

    # Synthesize the audio to a raw stream
    message_audio_stream = voice.synthesize_stream_raw(text, **synthesize_args)

    # Larger chunk sizes tend to help stuttering here
    CHUNK_SIZE = 4096

    # message audio is bytes
    message_audio = bytearray()
    message_chunks = []

    for audiobytes in message_audio_stream:
        message_audio += audiobytes
        while len(message_audio) > CHUNK_SIZE:
            latest_chunk = bytes(message_audio[:CHUNK_SIZE])
            message_chunks.append(latest_chunk)
            output_stream.write(latest_chunk)
            message_audio = message_audio[CHUNK_SIZE:]

    # Write whatever is left in message audio that wasn't large enough for a final chunk
    output_stream.write(bytes(message_audio))

    # Close stream
    output_stream.close()

And now we add those into our chat loop. We will no longer accept text answers, but we’ll use the same input function to make it so the user needs to press enter to begin recording their next question.

def main():
    # Chat loop
    while True:
        # Get a new user input
        text_input = input("Press enter to ask a question, or type 'exit' to quit: ")

        # Check if the user has tried to quit
        if text_input.lower().strip() == "exit":
            break
        
        user_message = get_user_message_from_voice()
        print("The user asked: ", user_message)
        
        # Add the user message to the message history
        add_user_message(user_message)

        # Generate a response from the history of the conversation
        llm_output = client.chat.completions.create(
            model="Meta-Llama-3-8B-Instruct-q5_k_m",
            messages=messages_history,
            stream=True
        )

        full_response = ""
        for chunk in llm_output:
            latest_chunk_str = chunk.choices[0].delta.content
            if latest_chunk_str is None:
                continue
            full_response += latest_chunk_str
            print(latest_chunk_str, end='', flush=True)

        # Force a new line to be printed after the response is completed
        print()
        speak_answer(full_response)

        add_assistant_message(full_response)

        # Print the full conversation
        print_message_history()

    # Release PyAudio
    pyaudio_instance.terminate()

    print("Chatbot finished - Goodbye!")

That’s it! We can now talk to our AI chat bot and have it speak back to us! You can find this full example on GitHub here.

Going Further

As I stated in the first article, I don’t believe chat bots are the killer app (or at the very least the sole application) for LLMs but they’ve provided a great basis to explore a lot of the tech and surrounding technologies like audio processing. I think with what we’ve explored in this series there’s some great foundational knowledge to build on and start integrating other technologies and workflows. I’m leaving some ideas and projects for you to keep exploring below.

If you’d like to go further down the audio processing and DSP route, I highly recommend the book linked in the beginning sections, The Scientist and Engineer’s Guide to Digital Signal Processing as a starting point which you can then use to explore all kinds of crazy expansions on audio processing from noise cancellation to speaker direction tracking.

Project Ideas

Expand this chat bot with voice activity detection and wake words.
Improve the responsiveness of this chat bot by trying to split the streaming LLM generation at sentences and run the TTS asynchronously to avoid the delay of waiting for the entire LLM output.
Make an automatic news reader that gathers the latest news, summarizes and produces a morning podcast for yourself.
Make a tool that summarizes YouTube videos.
Make a tool to generate after meeting email reports with action points and tasks for people based on the transcribed audio.
Make a moderation bot to identify toxic, annoying or unwanted behavior in your chats and communities

#2 - Diving Deeper! - No BS Intro To Developing with LLMs

2024-06-12T04:20:13+00:00

Welcome back to the No BS exploration of LLMs! I’m excited to dive in once again. This article is a bit shorter as we have got over the initial learning hump of the models, runtimes and lots of the jargon involved in getting up to speed.

In this article we’ll explore connecting to an LLM host via a nework API, then we will begin using streaming for our responses so we can start showing the user information sooner. After that we’ll explore usage statistics and generation settings so we can tune the LLM responses a little bit. We’ll also look at Chat Templates and how chat histories are formatted for the LLM to understand them better. Finally we’ll cover system prompts and how we can get the LLM to do the tasks we want them to.

The first thing we’ll look at today is moving to separate out our LLM instance from our app. The advantage of this will be four-fold:

The server can serve more than just one app, and many users, leading to improved hardware utilization if we have more than one consumer of LLMs.
We can host the LLM on a much more powerful system than where we are running the app. For instance hosting the LLM on a server, while our app runs on a phone, Raspberry Pi, or laptop.
Faster iteration time of our application since we are not constantly loading and initializing the LLM every time we restart our chatbot.
When we want to swap out model, server app, or even scale up to using bigger models like GPT4, it’s a much smaller change to our application.

Picking up where we left off

Just as a refresher, lets get back to our project directory and activate the virtual environment

# Go back to our chatbot directory where we left on in Article 1
cd chatbot

# Make sure we are in our virtual environment
source .venv/bin/activate

OpenAI API

OpenAI made their API specification open source, and as a result their API has fairly widespread support which will make it easy to switch hosts or providers quite easily. There are other APIs available and they vary by provider and runtime engine, but for our purposes OpenAI is a great starting point and is fairly widely supported.

Starting Llama.cpp in Server mode

To run llama.cpp in Server mode, it’s as simple as the following command. Make sure to run this in a new terminal so we can still continue on with the rest of the examples.

# Start the server with our model
./llama.cpp/llama-server -c 0 -m Meta-Llama-3-8B-Instruct-q5_k_m.gguf

I specify -c 0 here for an important reason. Llama.cpp defaults to using a 512 token context length. By specifying 0, it tells llama.cpp to use the context length specified in the model. If you go back to the source HuggingFace model, you can open config.json and look for the field entry max_position_embeddings . This was encoded into the GGUF when it was converted. We’ll look at context size more when we get to the settings in a few sections time, but for now just know that it’s basically the length of the chat history we can pass into the LLM to be considered when generating an answer.

Using the OpenAI API

Now lets modify our example chat loop to now use the OpenAI API. Fire up a new terminal and activate the virtual environment

cd chatbot

# Make sure we are in our virtual environment
source .venv/bin/activate

Install the OpenAI package with pip.

pip install openai

We need to make a few changes to our existing chat bot example. Let’s copy the existing example into a new file where we can make our changes.

cp chatbot1.py chatbot-openai.py

First we need to import some new classes from the openai package.

# Remove the old imports
# from llama_cpp import Llama

# Import the OpenAI library
from openai import OpenAI, ChatCompletion

Next, we now need to create a client object through which we can make requests, rather than loading the model. Here we pass in the URL of where we are hosting the model. In this case we are just using localhost. Here is also where you’d pass in an API key for a secured service. In this case we don’t need a valid key for Llama.cpp, but we do need to specify one to keep the library happy.

# Replace the old code
# Load the model
# llm = Llama(
#    model_path="Meta-Llama-3-8B-Instruct-q5_k_m.gguf",
#    chat_format="llama-3",
# )

# Connect to the LLM server
client = OpenAI(base_url="http://localhost:8080/",
                api_key="local-no-key-required")

Now, we change how we request a completion. Here we specify the model we’d like to use, in this case the one we loaded with the server. This is very useful for mixing models when you are using a service that supports multiple models.

# Remove the old completion request
# llm_output = llm.create_chat_completion(
#     messages=messages_history,
#     max_tokens=128,  # Generate up to 128 tokens
# )

# Ask the LLM to generate a response from the history of the conversation
response = client.chat.completions.create(
    model="Meta-Llama-3-8B-Instruct-q5_k_m",
    messages=messages_history
)

Now the example is fetching request from via the OpenAI API, we need to make a few changes to how messages are handled, as they now come as a new object type.

First change is we need to adjust how we get the message from the response. So remove the get_message_from_response function and create a new one which gets the information we need from the new ChatCompletion object. More details on what’s in the ChatCompletion object are available on the OpenAI documentation page.

def get_message_from_openai_response(response: ChatCompletion):
    chat_message = response.choices[0].message
    response = {
        "role": chat_message.role,
        "content": chat_message.content
    }
    return response

And finally change how we append the response to the chat history using this new function.

# Add the message to our message history
messages_history.append(get_message_from_openai_response(llm_output))

And that’s it. Our chatbot now uses the OpenAI API. By changing the client object URL we can connect to a variety of other runtimes or services, and change models with relative ease. This is also great for us to iterate with, as it means we won’t be loading the model constantly.

You can see the full code for this example now here.

If you’d like to see more details on the Chat Completion API, OpenAI have really good documentation on it here.

Streaming Generation

So far our application has been generating the full response before displaying it to the user. This leads to a long latency before the user sees anything which is not the greatest experience. Let’s change this now to use streaming generation. This will generate fragments or chunk and allow us to present part of the response to the user much more quickly.

Once again, lets copy our current chatbot to a new file where we can explore some more.

cp chatbot-openai.py chatbot-streaming.py

First, lets tell the API to generate streaming chunks. Change the chat completions request to specify streaming.

# Generate a response from the history of the conversation
llm_output = client.chat.completions.create(
    model="Meta-Llama-3-8B-Instruct-q5_k_m",
    messages=messages_history,
    stream=True
)

Now, how we handle this message is going to change quite a bit. We will no longer get a single response object with our usage data, and so on. Instead we will get a series of chunks via an Event Stream. Once again, you can see more details about what’s in this response object on the OpenAI documentation page for the Chat Completion Chunk Object.

for chunk in llm_output:
    print(chunk.choices[0].delta.content)

Running this now with python chatbot-streaming.py we can see that it appears we are getting single tokens at a time, and they are printing on their own line. This isn’t great, but since we’re going to turn this into a voice chatbot in the next article, we won’t spend a bunch of time doing this neatly, quick and dirty will do. Also we aren’t getting this message added to the history of our chat. Lets make sure we add this to the history now.

We’ll define a new function to add an assistant message to the history. We could refactor add_user_message, but for now I’d like to keep them separate to ensure we don’t make typos with participant names and hardcoded strings elsewhere in the program, so we’ll just make a new function.

def add_assistant_message(message):
    """Add a user message to the message history."""
    messages_history.append(
        {
            "role": "assistant",
            "content": message
        }
    )

Now, when we are processing the streamed message chunks we’ll add them to a final_response string, and after there are no more chunks to receive, we’ll add this final_response to the message history. This also changes how we print the latest chunk, we don’t want a line feed at the end, and we don’t want it to buffer until we get a new line feed character, so we’ll specify an end string as a blank string and flush the stream on print.

full_response = ""
for chunk in llm_output:
    latest_chunk_str = chunk.choices[0].delta.content
    if latest_chunk_str is None:
        continue
    full_response += latest_chunk_str
    print(latest_chunk_str, end='', flush=True)

# Force a new line to be printed after the response is completed
print()

add_assistant_message(full_response)

Running this example we can see we get our first response chunk back within seconds, rather than waiting for the whole response to be generated. This means we can start presenting this to the user straight away, making the experience feel much more responsive.

Once again, you can see the full example after all these changes on Github here.

Going Further With Streaming

OpenAI have written a fair bit of information about how to use their API and streaming, and it’s worth reading their API docs which has some other examples.

Usage Statistics

Monitoring the usage stats is important for monitoring your costs, the most intensive parts of your system, and also monitoring your context window usage.

Monitoring costs may include monitoring the most high usage customers, a particular workflow that’s running too often or is unexpectedly high load.

Monitoring context usage is going to help you keep aware of one of the ways you can outgrow your chosen model. When your system begins to approach or exceed the context window you may end up chopping off parts of the context or losing reliability in your answers. We’ll cover the context window a bit more in a few moments.

We can get these out of the OpenAI pretty easily. First we update the completion API call with a new parameter, stream_options where we specify to include usage.

llm_output = client.chat.completions.create(
            model="Meta-Llama-3-8B-Instruct-q5_k_m",
            messages=messages_history,
            stream=True,
            stream_options={"include_usage": True}
        )

Now we have the server returning usage stats, lets make sure we handle them. The way Llama.cpp returns is a bit different to the OpenAI documentation. The documentation says the usage will be returned in a separate ChatCompletionChunk after the end of the response, where as the Llama.cpp server returns it as part of the stop reason message. No matter, we’ll make our code compatible with both.

As part of this we’ll also capture the reason the response was finished. This could be that the LLM signaled the answer was complete with a stop token, or it could be that the token generation limit was reached.

full_response = ""
request_usage = None
finish_reason = None
for chunk in llm_output:
    if chunk.usage:
        # If there is a usage object, lets store it
        request_usage = chunk.usage
    if len(chunk.choices) < 1:
		    # The API docs say that choices could be empty,
		    # so abort this loop iteration
		    continue
    if chunk.choices[0].finish_reason:
		    # If there is a finish_reason, lets store it too.
        finish_reason = chunk.choices[0].finish_reason
    latest_chunk_str = chunk.choices[0].delta.content
    if latest_chunk_str is None:
        continue
    full_response += latest_chunk_str
    print(latest_chunk_str, end='', flush=True)

print()

print("Stop reason: ", finish_reason)
print("Usage: ", request_usage)

Great, lets run it and ask a short question.

(Type exit to quit) User: What is the temperature on mars? Short answer only
The average temperature on Mars is around -67°C (-89°F).
Stop reason:  stop
Usage:  CompletionUsage(completion_tokens=15, prompt_tokens=43, total_tokens=58)

===============================
system : You are a helpful teacher who is teaching students about astronomy and the Solar System.
user : What is the temperature on mars? Short answer only
assistant : The average temperature on Mars is around -67°C (-89°F).
(Type exit to quit) User: exit
Chatbot finished - Goodbye!

Excellent, we can see in here that the reason it stopped was the language model output a stop token, which wrapped up the response. We can also see this caught the CompletionUsage object. This gives us 3 values:

completion_tokens, or the number of tokens in the response.
prompt_tokens, the number of tokens in submission to the LLM, so the chat history and so on.
total_tokens, the sum of both of the previous values.

So these are all very helpful. For small hobby tests and projects you can probably ignore them, but as you move towards production I’d look to capture these in your analytics system in some way so you can start getting visibility on the system usage.

You can see the full example of usage statistics here.

Generation settings

Now we’re generating tokens in a more responsive fashion and we can iterate more quickly, lets play with some of the generation settings.

Context Window Length

Context window length is the amount of information, number of tokens, we can feed into the LLM and generate at any one time. The means the query, any additional information, and the generated results needs to fit within the context length.

A larger context window will require more memory, but will also allow extra information to be fed into the LLM which is especially useful if you will be feeding in documents or enriching queries with RAG (Retrieval-Augmented Generation).

A model will be trained with a maximum context length. Making this setting too high will likely lead to poor results.

On the other hand, if your prompt is too long and generates too many tokens, you’ll either get an error, or your prompt will be sliced to include only what will fit into the context window. The telltale sign of this is the LLM appears to forget responses or lose what was being discussed.

Flash Attention is a setting related to Context Window Length. It’s supported by llama.cpp on GPU backends, but it is currently slower on CPU. It is a faster and more memory efficient attention mechanism within the LLMs themselves. The practical impact of this is that it can be a bit of a speed boost, but more importantly, a dramatic reduction in memory for large context windows. It’s worth reading up on this and seeing if it’s worthwhile enabling for your purposes.

Temperature

Temperature affects how the next predicted token is chosen from the predicted probability list. A value of 0.0 will always choose the most likely token, while a value closer to 2.0 will choose a more random token resulting in more creative outputs. Be warned though that setting the Temperature too high can produce inconsistent results and encourage hallucinations.

You would adjust temperature higher when you want your responses to be a bit more varied, or creative. You would adjust it to be lower when you want the results to be more accurate.

Seed

This is the seed for the random number generator when choosing a token with some random sampling to due temperature settings. A consistent starting seed will produce the same output from the same input and same parameters.

You would specify a seed when you want your results to be the same answer for the same input, otherwise the same chat history could generate many different responses with no seed set.

More parameters

There are a LOT of parameters that can affect generation, but these are some of the ones that you will likely tweak initially. For more information of parameters the ollama docs have an interesting list.

Chat Templates

Chat- and Instruct-tuned models, the kind we are using in this series, expect that the chat history is formatted in a particular way. Each model is a bit different. If you don’t format the chat history appropriately then you can expect to get strange results from the LLM.

The model publisher will usually give information on it’s expected template or format.

The Llama 3 instruct template looks like this.

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

<|eot_id|><|start_header_id|>user<|end_header_id|>

<|eot_id|><|start_header_id|>assistant<|end_header_id|>

<|eot_id|>

Lets break this down. It appears that the prompt is expected to begin with <|begin_of_text|>, then the rest is a repeating pattern of:

<|start_header_id|>**system**<|end_header_id|>

****<|eot_id|>

The server will try and do this formatting for you, but there are ways to do this formatting manually and bypass the servers formatting. Why would you do this? Well, apart from curiosity on how all this works, there’s a few practical reasons. For one, you may be working with a newly released model that loads correctly since it is a new instruct-tuned model from a supported base, but it uses a new or slightly changed prompting style. Another reason might be that for whatever reason you are using an API endpoint that doesn’t automatically do the formatting for you, and are required to do it yourself.

Naturally, a templating engine like Jinja would be a great tool to use here, but for simplicity of this example we’re going to just write it in pure Python. We won’t be adjusting our main chatbot to use this, this is just as a side quest to explore the templates concept. So lets take a copy of the chatbot and begin to make some changes.

cp chatbot-streaming.py chatbot-templates.py

First we’ll define a function that can format our chat history:

def format_chat_llama3(message_history):
    START_BLOCK = """<|begin_of_text|>"""
    MESSAGE_BLOCK = """<|start_header_id|>{user_id}<|end_header_id|>

{message_text}<|eot_id|>"""
    ASSISTANT_START = """<|start_header_id|>{user_id}<|end_header_id|>

"""

    chat_str = START_BLOCK
    for message in message_history:
        chat_str += MESSAGE_BLOCK.format(
            user_id=message["role"],
            message_text=message["content"]
        )
    
    # Add the assistant start block ready for the assistant to begin replying:
    chat_str += ASSISTANT_START.format(user_id="assistant")
    return chat_str

Here you can see we have the initial start string in START_BLOCK and then the repeating message string in MESSAGE_BLOCK. Our chat history already uses the same names that Llama3 expects, but some LLMs have been tuned to expect different names for the roles, so this is also a natural point to substitute the names of roles to the names expected by the LLM.

One interesting thing to notice here is that we want to start the assistants reply with the correct metadata tokens/formatting. Remember in the first article that we discussed that LLMs are simply predicting the next token in a stream of text. By beginning the assistants reply block we ensure that the assistant replies properly as the assistant, and doesn’t attempt to fill in it’s own metadata tokens, and doesn’t attempt to further expand the users question, or behave in other strange ways.

Just to highlight one way this can show up, if you don’t add the assistant start block to the prompt sent to Llama3 it will begin it’s response with assistant followed by 2 blanks lines. This is wasted compute, token usage and also string manipulation overhead to remove it that you just don’t want to do.

Now, into our main function we can generate a fully formatted version of our chat history with a quick one liner, and we’ll print that for our own benefit.

formatted_chat = format_chat_llama3(messages_history)

print("Formatted chat:")
print(formatted_chat)

Example Formatted Chat

Here is an example chat after being formatted with the template.

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful teacher who is teaching students about astronomy and the Solar System.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the average temperature on mars? Give me a short answer please.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The average temperature on Mars is around -67°C (-89°F), but it can range from -125°C to 20°C (-200°F to 70°F) depending on the time of day and season!<|eot_id|><|start_header_id|>user<|end_header_id|>

Another short answer please: How long is a day on Mars?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

A day on Mars, also known as a "sol", is approximately 24 hours and 37 minutes long!<|eot_id|><|start_header_id|>user<|end_header_id|>

Next question: What is the atmosphere made of?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Generating with the Formatted Chat

Now we need to change how we generate the text. Since we don’t want the server to format our chat, we won’t use the Chat Completions API, we’ll just use the Completions API. We won’t use streaming for this example.

# Generate a response from the history of the conversation
llm_output = client.completions.create(
    model="Meta-Llama-3-8B-Instruct-q5_k_m",
    prompt=formatted_chat
)

full_response = llm_output.content

add_assistant_message(full_response)

You can see the full example for doing our own formatting here.

System Prompts

System prompts are useful for providing context, guidelines, rules and extra prompts on the expected responses to the LLM. You can even give the LLM a style of speaking or a character to portray!

Not all LLMs have been trained to accept a system prompt. The model won’t break if you send it one, but you may notice it isn’t particularly influential on the results you get from the LLM. In this case it’s recommended to just prepend the system prompt to the first user input.

A good prompt can include:

A brief note about what kind of AI assistant it is, or it’s role
What the goal of the AI should be
Instructions on the expected output, and perhaps output formatting
Some examples of the output or expected writing style

I want to shout out a really interesting project, Fabric. This project has collected, collated, and written some REALLY awesome prompts for a large variety of tasks. They even have a prompt to improve prompts! I definitely recommend looking through the prompts to get some ideas on some well written, well thought out, and well tested prompts. Huge Kudos to Daniel and the contributors to the project on the fantastic work.

As an example of a system prompt, here is a basic prompt I use to help me review my blog posts.

Please review my blog article. It is trying to be informative, engaging without being too formal. It should be concise, to the point, and not meander around with flowery language.

The article should be clear. The explanations should make sense, and provide enough context to the reader. If the explanations are unclear or inaccurate, please point them out.

The article should have a nice flow. The article should build on top of previously stated information. It should avoid jumping around and confusing the reader. If the flow is confusing, please let me know.

The article should be correct. If there are any inaccuracies, please point them out.

The article should be grammatically correct. If there are any grammar or spelling issues, please let me know.

If there are any incomplete sentences, or if i haven’t completed a section by accident, please let me know.

Prompting Strategies

There are a number of tricks to getting the most out of your LLMs, these fall under “prompting strategies” and can include a number of techniques. I hate the term, but this is often referred to as “Prompt Engineering”. Some interesting ones are:

Chain-of-Thought - where you ask the LLM to output some initial steps and intermediate information, in a way allowing it time to “think” about the answer before immediately generating tokens of the final answer.
Few-Shot Prompting - Provide some examples about the expected output so the LLM can mimic the results.

There are a number of guides available for finding new strategies to improve LLM performance. I find this one quite useful PromptingGuide.AI

Agents

You’ll run into the term “agents” or “agentic systems” or “agentic workflow” often. At the most basic level these are simply different System Prompts organised into a workflow. For instance you may have a system prompt that is targeted at summarizing documents, perhaps called a Summarizing Agent, then the original document and the summary result get passed into another LLM step with a new system prompt that is focused on checking the accuracy of a summarization and determining summarization quality, lets call that a QA or QC Agent.

These agent focused prompts can get pretty powerful by having feedback loops, keeping separate chat histories that are relevant only to their individual tasks and having the LLMs determine which system prompt is needed next out of a list of available system prompts. You can even mix and match models and system prompts for the best pairing on a given task.

While there are a number of agent frameworks available, they are not essential for using agents, the power of these frameworks comes more from the utilities they provide around agents, like easily fetching and parsing documents and web searches, and some convenience methods around connecting these together. It’s perfectly possible to setup a quick agent workflow in just a few tens of lines of code by working with LLM APIs directly.

Wrap Up

That’s all for today. That was much shorter than the first article! Let’s recap what we’ve covered today:

We’ve explored using web APIs to interact with remote servers hosting LLMs. This allows us to:

Iterate more quickly
Host the LLM on a more powerful system
Switch between hosts/providers easier

Next we explored Streaming generation which can used to reduce latency in requests for the user.

Following that we dug into usage statistics which we can use to understand which parts of our system, or which users or agents are utilising the system the most. We can also monitor usage costs and compare usage costs across other potential providers by comparing historical usage metrics with published API pricing.

Then we looked at some of the generation settings available which can be used to tweak the LLM to provide more creative, or more deterministic outputs.

We also had a look at Chat Templates and how we can manually apply chat templates. This gives us a greater understanding of how models are fine-tuned for different prompting styles, and allows us flexibility in how we create generations from LLMs, especially if the server does not support a given chat template.

Finally we discussed System prompts, their purpose and what can be put into a good system prompt. I again encourage you to read some of the prompts included in Fabric for some great examples. We also briefly spoke about agents, and how system prompts can be used to define the different roles of an LLM in a larger system.

Next Time…

In the next article we’ll deviate from LLMs and look at finishing our chatbot with a voice system, providing Text To Speech and Speech To Text functionality.

Exercises for the reader

Write a new system prompt for a workflow problem you think LLMs could solve
Host the LLM on a second PC or laptop
Connect to an Ollama server instead of llama.cpp
Get an OpenAI key and connect to OpenAI instead of llama.cpp.
Experiment with making detailed system prompts for a problem you have, or a task you have for LLMs.

#1 - Getting Started - No BS Intro To Developing with LLMs

2024-06-12T02:29:13+00:00

While experimenting with LLMs over the weekends I grew very frustrated trying to decipher terms, and find good information on all the new advances. It was difficult to find a good A to Z beginner guide. Finding information like precise definitions was tough because particular morsels of information were hidden on Reddit, different repos, issue trackers and so on. Some definitions were never clear because it was just assumed knowledge. To make matters worse was the state of search engines these days and SEO constantly showing blog posts that were just ads and sales funnels for hosted services, thin GPT wrappers, or paid medium articles.

While getting myself up to speed I took extensive notes that answer the questions I had, and decided to turn them into a series of articles in the hopes that it helps anyone else with similar frustrations. This is the guide I wish I had!

I’m going to walk you through building a basic voice assistant. Along the way I’ll introduce the terminology, build up a glossary of terms, introduce concepts, and show clear concise examples at each step. I’m not going to sell you anything, we’re going to use free and open source projects and use some of the emerging standard pieces of tech with no BS.

While I don’t find chatbots particularly exciting or the “killer app” of LLMs, it provides an excellent self-contained project where we can explore the tech, and introduce all the concepts and terminology in a straight forward way. After this series you’ll be well prepared to unlock the worlds of potential LLMs have for content moderation, sentiment analysis, data cleanup, summarization and other Natural Language Processing (NLP) heavy tasks that were previously incredibly difficult or impossible.

I’ll lay out some of the different tech choices that can be made and explain the reasoning behind choosing a particular one for this introduction. We’re going to self host everything for the purposes of this series. Working locally will mean there is no fear of unexpected costs and data leaks. I personally find these fears are devastating for my own experimentation. I believe working locally reduces many of the barriers to early experimentation.

Let’s get started!

What we’ll cover

We begin with 3 articles in this series, but I have a few more planned to keep diving deeper into the tech and concepts.

In this first article we’ll cover a lot of the initial jargon and terms you’ll find. We’ll then download a model, and get a basic chat loop going in code.

In article 2 we’ll look at using an LLM via network APIs so we can connect to other LLM services, as well as explore a lot more about working with LLMs, generation settings, chat templates and so on.

Article 3 will guide us through adding some basic voice interactivity since processing speech from live microphones or recordings will be very useful to many LLM applications.

If you’re an absolute beginner and are absolutely bamboozled with all the jargon and terms, stick with article 1 which will get you up to speed in no time.

LLMs and Transformers

This article assumes you know what LLMs are, at least from a users perspective, and will dive into the practical details of using them, exploring terms as they come up only to the depth required to choose models and develop applications that use LLMs. We specifically won’t be going into the internals of LLMs and Transformer architectures. Having said that, there are terms we’ll briefly cover as you’ll see them come up a lot, but don’t worry, they sound a lot more complicated than they are.

Tokens - This is the pieces of words, sentences and grammar that LLMs work with. They don’t directly map to any human concept, but you can almost imagine these like syllables and punctuation. They can represent anything from a single character to a whole word, but generally they represent a few letters or characters and words often break up into multiple tokens. Each model family defines it’s own token set, and some models that are trained off a base model define additional tokens. A larger token vocabulary generally means that fewer tokens are needed to represent a given set of text.
Embeddings - Embeddings are large 2D matrices where each token is represented by a large vector (1D matrix) that captures its meaning and concepts. This is the final representation of a sequence that the LLM works with. We don’t need to touch on this for a while and will likely only become relevant for us when we get to data storage and lookup, known as Retrieval Augmented Generation (or RAG). I’ve linked some resources in the next section if you’d like to know more.
Autoregressive - This means that the model predicts the next token, adds it to the input, and then the process begins and again and starts predicting the next token.
Causal LM / Causal Language Model - This simply means that the language model predicts the next token or word forward based on previous tokens only. It can’t look ahead at future tokens, it will only continue the sequence of what was provided so far. This is the most common model type you’ll use currently.
Context Length - This is how much text, or history that can be considered at a time by the LLM. It’s expressed in number of tokens. A larger context window means more tokens, and therefore more text can be processed at a time.
Transformers - These are the core of LLMs. It’s a deep learning architecture developed by Google in 2017, and has been one of the key technological advances in the development of LLMs.

Resources for Learning ML and Transformers Internals

If you want a deeper dive into the internals of LLMs and Transformers I’d recommend the 3Blue1Brown video series on Neural Networks, specifically videos 5 and 6 focusing on embeddings and transformers which helps build a solid understanding of all the concepts involved. After that consider doing the Hugging Face Natural Language Processing course.

If you want to take a closer look specifically at embeddings, I found this article was really informative - A Beginner’s Guide to Tokens, Vectors and Embeddings in NLP by Sascha Metzger.

Another suggestion I wanted to throw out there if you want to dive deeper is getting familiar with linear algebra. I come from a Game Development background and a lot of the math for handling geometry and transformations such as rotation, translation, projecting to screen space and so on are all done with vectors and matrices. You learn to visualise these operations and build up a pretty good intuition on working with them in 3 dimensions. I’ve found that a lot of the ML theory has been a very natural progression for me with this background, while you can’t visualise multi-thousand dimension vectors, the math follows pretty simply. If you want to get familiar with some of that math in a fairly intuitive way, my personal favourite book on the matter is freely available, 3D Math Primer for Graphics and Game Development. The first few chapters are of particular interest.

Setting up our project

This guide assumes you are working in Ubuntu 24.04. You should be able to follow along just fine in Windows with WSL, and with a few minor modifications on Mac OS too. Make sure you have around 10gb or more of RAM and around 50-60gb of disk space free. We’re going to have a few copies of the model on disk during the conversion process so we need some extra space.

The main things we want are:

Git - The version control tool
Git-LFS - Large File Storage extensions for git. This allows us to work with files that are many gigabytes within the familiar git workflow.
Build-Essential - This is a metapackage in Ubuntu that has a bunch of C and C++ tools which will be required to compile some tools later.

# Install some build tools
sudo apt-get install -y git git-lfs build-essential

One complication is that llama.cpp doesn’t support Python 3.12 yet due to some dependencies. So along with some tools we’ll also need to install Python 3.11 which isn’t available in the official Ubuntu 24.04 repositories. Deadsnakes is a very popular repository to get alternative builds of Python. This is the method we’ll use until Python 3.12 is fully supported.

Note that I’ve heard anaconda, or simply conda, has Python 3.11 include. If you are familiar with conda then feel free to use it, but I’ll be sticking with the system Python for this series.

# Install Python 3.11
sudo apt install -y software-properties-common
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install -y python3.11 python3.11-venv

Finally we’ll setup our project folder

# We'll work in a directory for this project
mkdir chatbot
cd chatbot

# We'll be working with python a fair bit, so we'll setup a virtual environment to work in
python3.11 -m venv .venv
# Activate the virtual environment
source .venv/bin/activate

Speed vs Performance

A quick note during this series I’ll refer to speed and performance of the model regularly. I’ll try and maintain the convention that

Speed is the inference, or generation speed of the model
Performance is the quality of the generated results.

Where to Begin

First up, we’re just going to run a model locally and that introduces the first 2 questions: How? And what model?

Model Runtimes / Engines

There are so many ways to run models now. But what’s best? Which should you choose? There’s no right or wrong answer. Here are a few of the more popular solutions available and why we’ll settle on a particular method for this series.

Just quickly before diving in, lets just take a moment to go over some terms that will repeatedly come up:

Jupyter notebooks are a bit like a special case of python scripts that allow intermingling of code and notes, as well as a semi-interactive python interpreter allowing you to adjust and rerun pieces of code in-line. They are excellent tools for learning, note taking, interactive experimentation and data science.
PyTorch is a python library dedicated to machine learning / ML. It has a whole host of tools within it. Of particular interest is it’s fast implementations of neural networks and tensors, and algorithms for training them. I’m not diving into the particulars of PyTorch in this series, but the name will pop up regularly, so it’s worthwhile understanding what it is.
HuggingFace Transformers is another python library specifically for working with transformer networks, and downloading existing models hosted on HuggingFace. It builds on top of PyTorch, TensorFlow and JAX which are more generic ML libraries.

Reference Implementation / PyTorch

Each model generally comes with it’s own reference implementation which is a python script or scripts, or a Jupyter notebook, and runs using PyTorch or the HuggingFace Transformers Library, or similar Machine Learning framework. This is how the model was developed and gives a good baseline for the performance of a model. For our purposes, this isn’t particularly interesting, but if you want to dive deep on a particular model, or family of models, you might want to look into this.

Ollama

Ollama is a popular and quick way to run models. It’s easy to install on Mac, Windows and Linux. With a single command you can download and run a model. There’s a nice web interface to browse the available models. The models are pre-quantized so they can run on much lower end hardware than the original source model.

There is a large ecosystem of integrations and tools that work with Ollama, so it’s an excellent way to prototype and quickly interact with models, and even target when developing applications since it’s server mode is quite popular. We want to dive a fraction deeper so we can hit some of the interesting details ourselves. Under the hood this uses Llama.cpp.

This is available under the MIT license which is very permissive.

Text Generation WebUI

Text Generation WebUI, sometimes called Ooga Booga after the developers GitHub account, is another popular option. This is more of an application itself. It does have an OpenAI compatible API, but it’s probably not something I’d build on top of, instead this is what I use for a bit of a playground and quick chat interface with existing models. This automatically installs Llama.cpp, but also contains other methods of running and connecting to LLMs.

This is under the AGPL 3.0 license which has some considerations when using it commercially.

Llamafile

Llamafile is a very interesting project from Mozilla that embeds a cross platform runtime in with the model itself. This allows you to download a single file, double click it, and start interacting with the model immediately. The runtime is a fork of Llama.cpp. The developers of Llamafile have been very active in adding speed improvements as well as submitting these upstream to Llama.cpp

This is under the Apache License 2.0 which is quite permissive.

Llama.cpp

Llama.cpp is a very fast implementation of inference for LLMs. It started off as a CPU implementation for Llama2 but has expanded to support a wide range of models, with a large set of backends including GPU support via CUDA, Metal and Vulkan. It can also run the models in a hybrid state where part of the model runs on CPU and part runs on the GPU. It even runs on Raspberry Pis and Androids! As you can imagine, this is a very powerful and useful piece of tech. Basically, you can run this anywhere on whatever you have.

This is available under the MIT license which is very permissive.

Llama-cpp-python

Llama-cpp-python is a python wrapper around llama.cpp. This is the first thing we’ll use before moving directly to Llama.cpp.

Others

There are SO MANY ways to run LLMs these days, these are just a few of the popular ones you’ve likely heard about. There’s more including Koboldcpp, NVIDIA ChatRTX, LM Studio, as well as full frameworks for end-to-end processing and agents like LangChain, LlamaIndex, Crew.ai, Autogen, the list goes on.

Final Choice

Once again, our goal here is working with LLMs in a more bare-bones practical sense so we can learn some of the behind-the-scenes details these existing frameworks and tools abstract away. You can see that llama.cpp plays a central role in most of these options. You can also run llama.cpp just about anywhere on whatever hardware. As a result, we’re going to cut out most of the intervening layers and just work directly with llama.cpp and llama-cpp-python so we can get a little closer to the nuts and bolts, while allowing us to experiment on whatever hardware we have laying about. In the second article we’ll also move towards a way of running the models that should allow you to change how you are running the model quite transparently to the application you are building.

Choosing A Model

Quantized, 7b, 13b, Q5_? WTF?

Ok, there’s lots of Jargon you’ll run into when choosing a model, let’s break down some of them.

Parameters - This is the number of weights in a model. Usually expressed in billions or trillions, hence the 7B, 13B, 1.7T. The number of parameters directly affects the memory required to run the model, as well as the processing power required to generate text quickly. This can be used to also quickly estimate the minimum amount of RAM required to run the models at the original quality and how big the download will be. Without going in-depth on model configuration and architecture, a lot of models use 16bit floating point numbers for the bulk of their weights, so 7,000,000,000 parameters * 2 bytes = 14,000,000,000 bytes = 14 gigabytes of memory

Quantized / Quantization / Q4 / Q8 - This is a form of lossy compression. Essentially it tries to balance keeping as much original data and performance as possible while discarding information that doesn’t impact the results from the LLM by a significant amount. I’ll be covering this in a bit more depth after we’ve downloaded a model. We’re going to download a model at it’s full precision first, and learn how to quantize ourselves.

Base / Chat / Instruct Models - These are different levels, stages, or purposes of training. A base model has had the vast majority of training performed. This is the expensive part of training and building a model. The Chat and Instruct variants have had a further level of training, also called fine-tuning, performed on them to target them for a specific purpose or use. Instruct is also known as “instruction following”. This is useful for models that you might be asking to summarize documents, or perform specific tasks. Chat variants are more targeted at chat-bots and assistants and are good at the question/answer prompt formats.

Fine-Tuning / Fine-Tuned models - Fine tuning is a form of training performed on pre-trained models. A Base model can be fine tuned into a Chat or Instruct model for instance. Fine tuning can change the style of a models output, or help it perform better on specific tasks or formatting. There is some debate around how effective fine-tuning is for adding new knowledge to an LLM.

LoRA / QLoRA / PEFT - These are techniques used in fine-tuning to drastically reduce memory and computation requirements. They have no bearing on a model after it has been trained. Very occasionally you may run into a model which is only available as a “LoRA”, what this means is it’s the result of fine-tuning, but the results haven’t been “applied” or merged in to the model. You can think of it kind of a like a diff or a set of delta changes. In this instance you would be required to load the original model and then apply the LoRA, or you can simply merge the LoRA into the original model and save as a new file which you can load as you would any other model.

Mixture-Of-Experts / MoE - This is a model architecture where only parts of the model are used/activated at a given time. You still need to be able to load the full model into memory for fast inference/generation speeds, because the model will decide which expert is needed for the next bit of the sentence being generated. Since only a portion is being used at any given time this dramatically speeds up the inference / generation speed of the model. Mixtral for instance only uses 2 experts at a time, and is made up of eight 7 billion parameter expert models, so your generation speeds are closer to a 15b parameter model. However just to be clear, these experts aren’t selectable. You can’t just say “Well, I don’t need medical information, so I’ll turn off that expert”. That’s not how it works, and I’m not sure it’s entirely clear which portion of the model contains which bit of knowledge or training. This is purely a computation optimization, not knowledge segmentation.

Merge / Merged Model - This is a process where 2 models can have their weights merged and the result can outperform both original models. It’s an interesting concept that can produce interesting results.

Model Formats

Models can be packaged, distributed and run in several ways. The runtime method chosen often dictates what model format you need. Thankfully there’s only a few we need to worry about, but here’s some terms you’ll run into a fair amount.

PyTorch - This isn’t an LLM format exactly, rather just a generic PyTorch model, and it can be saved in several ways.

.pt / .pth / .bin - This is a generic serialization format for PyTorch models. This is serialized using Pythons Pickle library which can also contain and execute python code during the unpacking process which naturally has some security concerns, especially if you start getting real experimental and downloading many arbitrary models to test with.

SafeTensors - This is a new method of serializing a model. While a PyTorch model saved via the pickle method can contain arbitrary code, SafeTensors cannot, so it’s a more secure format for sharing models.

HuggingFace / hf / Transformers - This is generally just a PyTorch or SafeTensors model with a specific layout of files to define the model, the model architecture, the tokenizer etc. This is compatible with the HuggingFace transformers library. This is the kind of model we’ll download.

GGUF (GPT-Generated Unified Format) - This format is developed alongside the Llama.cpp project, and is therefore the one we’ll be using for inference / generation. All necessary data is packed into a single file.

GGML - This is the predecessor to GGUF and is obsolete now. You should convert to or stick with GGUF.

JAX - JAX is a high performance machine learning library from Google, who release some of their models with a variant that supports JAX.

ONNX Runtime - ONNX is an AI framework from Microsoft. Microsoft typically release their models with an ONNX variant.

The vast majority of models have a HuggingFace Transformers variant released, so this is a good format for us to focus on since it’s quite widely supported / available.

Generally a HuggingFace model will be comprised of many files, here’s some of the files you can expect to see in a HuggingFace model.

config.json will describe some model configuration
generation_config.json defines some runtime parameters like context length and temperature
*.safetensors or *.bin stores the actual model weights and the different matrices. There might be multiple of these files, especially for very large models.
*index.json defines which part of the model is stored in which file. May not exist if the model you downloaded isn’t split into files.
tokenizer.json, special_tokens_map.json - These define the tokens for the model. You can open this in a text editor and see all the token IDs and their associated string.
tokenizer_config.json - Configuration of the tokenizer.

The models I will recommend and link are all in the HuggingFace format, and we’ll convert to GGUF after we’ve downloaded a model. Thankfully Llama.cpp includes tools and scripts to convert the model we download from a HuggingFace format model to a GGUF model.

Downloading a Model

Ok, so now some of the terminology is out of the way, it’s time to pick a model. I’ve linked some of my go-to models below. These are generally very good models to start with. There are a bunch of competing fine-tunes and merged models that stem from these models. The fine tunes are constantly topping the leader-boards, but my preference is to stick with these models for the most part. That’s the beauty though, download whatever models look interesting to you and give them a shot!

A quick note: Some research has shown that it’s possible to embed sleeper trigger prompts into a model with fine tuning. Personally I stick with models from the larger organizations or the more popular models you can find recommended on the LocalLlama subreddit. This is no guarantee of safety from these risks, but if you are building something professional or trying to make a safer application it’s probably best to consider this potential attack vector when selecting a model.

If you want other recommendations on models make sure to checkout the LocalLlama subreddit, and the HuggingFace Open LLM leaderboard and the LMSys ChatArena Leaderboard.

Most models can just be cloned with Git LFS, or you can individually download all the important files. For this example I’m going to use Llama-3-8b-Instruct. You will need to create a HuggingFace account and accepted the license agreement to get access to this model.

# You will be asked for your huggingface username and password to clone this.
git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

Another option to get models is to use the HuggingFace Hub CLI tools. I won’t be covering this though I wanted to draw your attention to it as an alternative. It can be worthwhile as it can require less space on your disk when you download it.

My Recommended Starting Models

This field moves fast, models are released regularly, but at the time of writing (June 12th, 2024) these are my go-to models in no particular order.

Mistral 7B Instruct 0.3 - This is an excellent model, at only 7b parameters it doesn’t require much memory and generates text very quickly. It has a large context window of 32,000 tokens.
Llama 3 - Instruct - This model family was released early 2024 by Meta with a fairly permissive license to use it for most people and companies. It comes in an 8b and 70b parameter variant. The 8B parameter variant performs quite well and can serve as a good starting point. The main limiting factor for me with these models is the 8192 token context window. This does have an unusually large token vocabulary though, so those 8192 tokens go pretty far. Meta have said there are more models coming in this family including ones with larger context windows.
Phi-3 Instruct - This is a model family from Microsoft. It comes in 3.8B , 7B and 14B variants, with both 4k and 128k context windows. As of writing the 7B (small) model is not yet supported by Llama. These models perform surprisingly well and are seriously impressive. With the 128k context windows, I find I’m using the 14B model a lot for document reviews.
Mixtral 8x7B Instruct 0.1 - This is a MoE model from Mistral. It’s comprised of 8 x 7b parameter experts. This requires a fair amount of ram, especially un-quantized, but inference speed is about the same as a 14b parameter model when generating. It also has a 32k context window. Truth be told, I don’t use this model much anymore. The latest 7 and 8b models are pretty close to not justify the memory requirements of this. I find I jump to the larger 70b class of models instead of this if I want better performance.

Installing Llama.cpp

Now the model is downloading, this could take a while, so let’s next move to installing llama.cpp. We’re going to build this from the latest code on Github since it’s pretty easy, but there are binary downloads available if you prefer.

These instructions are for Ubuntu (and Ubuntu on WSL), but it’s a very similar process for Windows and Mac. You’ll need to ensure you have development tools installed. We did this before in the project setup for Ubuntu. More information on the other platforms can be obtained at the official llama.cpp build instructions.

# Go into our project directory
cd chatbot

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build llama.cpp
# -j $(nproc) means using all CPU cores available to build
make -j $(nproc)

# Install the requirements for some of the Llama.cpp tools
pip install -r requirements.txt

# Return to our project directory
cd ..

Converting and Quantizing a model

Llama.cpp uses a custom model format called GGUF, or GPT-Generated Unified Format. You may occasionally run into the older format, GGML, GGUF is the successor and the one we’ll be focusing on. With that in mind, we need to convert from our downloaded format to this format first. After that we’ll run through a process called Quantization, which makes it the model smaller and more practical to run larger models on more modest systems.

Convert from HuggingFace Transformers model to GGUF

First we need to get the model into llama.cpp’s preferred model format, GGUF. We can do this with the various convert scripts included with llama.cpp. Just point it to the directory of the model we downloaded and specify the output file.

There’s a few variations of the convert script from various starting formats. For our purposes we want to use convert_hf_to_gguf.py which converts from hugging face models to GGUF. You can optionally specify an output type for the weights, but for now it’s best to leave this at auto.

python llama.cpp/convert_hf_to_gguf.py Meta-Llama-3-8B-Instruct/ --outfile Meta-Llama-3-8B-Instruct.gguf --outtype auto

In the case of Mistral 0.3 and some others you may see an error regarding duplicate names. In this case the repository has 2 copies of the model, usually a mix of consolidated.safetensors and model-0000*-of-00003.safetensors. Simply delete the consolidated model with rm consolidated.safetensors and it will work as expected.

If you have downloaded a newly released model, it can take some time for llama.cpp to have support added or fully ironed out. Keep an eye on the GitHub issues and pull requests pages for more information on the state of support for a given model.

What does quantizing do?

As mentioned earlier, this is a method of compressing a model so it uses less RAM and memory bandwidth by discarding some precision. We can think of it as a form of lossy compression. It is an optional step, but can reduce memory requirements and increase speed of a model, at the cost of varying impacts on the performance of the LLM.

I’m not going to cover the in-depth representation of IEEE floating point numbers, or some of the intelligent methods that have been developed to go from Half Precision 16 bit Floats to 4 bit integers, but I’d like you to have a reasonable level of intuition for what is happening. On a binary level what you are doing is reducing the precision of the numbers that can be stored. With each bit that is discarded you lose half the precision you can represent. This typically reduces the maximum number you can store, but by scaling the result you get the larger number range with reduced precision.

0 1 2 3 4 5 6 7
*-*-*-*-*-*-*-* = 8 bits = 256 possible combinations
*-*-*-*-*-*-*   = 7 bits = 128
*-*-*-*-*-*     = 6 bits = 64
...
*               = 1 bit  = 2

If you imagine an integer number on a line, lets say 11. It can be represented accurately at full precision. But once we remove 1 bit, it no longer maps to a number we can precisely represent, and this only gets worse as we continue removing bits.

The original number then has to be rounded to the nearest number that can be represented in the new format. In the example above, after we’ve removed 4 bits of precision, we can’t store 11. We can only store 0 or 16, so we store 16 as it’s closer.

Behind the scenes on the newer quantization methods there are tricks to preserve some of this precision, minimizing loss, but ultimately precision still has to be discarded as part of the quantization process.

For some applications, this kind of quantization would be catastrophic, but as it turns out LLMs are remarkably resilient to this kind of compression. That’s not to say there are no impacts and you should only use quantized models, but they definitely have a useful purpose and are great for running larger models on more constrained or consumer level hardware. Some models can go from 16 bits per weight down to 4 bits per weight without suffering catastrophic impacts. Some research is even looking at training LLMs at 1-2bits per weight with impressive results!

Quantization Settings

The details of the quantization settings are naturally quite deep, but a good write up on some of the various methods supported currently is available here on Reddit.

We’ll focus on the K-Quants method which uses a mixed precision, encoding some parts of the model at lower precision, and some at higher precision since there are different parts of the model that are more important than others for accuracy. This mix of precision allows the model to maintain a higher level of performance while still achieving a better overall size reduction.

At a top level we can break down some of the more interesting predefined quantization levels like so:

Q<Target Bits Per Float>_K_<Size>

Where:

defines the main size of the majority of the weights Remember, lower numbers will result in smaller models, but will have greater impacts on the performance.
allows mixed precision on the float size for segments of the model to preserve performance while still minimising memory. For example this may allow a Q5 model to have some weights, some more important areas of the model, at 6bits to improve performance. This is typically S for small, M for medium or L for large.

If you browse a bunch of models from TheBloke, he has done benchmarking on the various quant levels on all the models processed, such as the Llama2 quantized version here, and it seems one of the better tradeoffs for size and performance is Q5_K_M, which is what we’ll be using for our example. Feel free to experiment for your use case though. Your target system may be faster or slower depending on the quantization method chosen, and your application may or may not tolerate the performance/quality loss.

For a comprehensive list of available quantization methods available to you, you can run the below command for a list:

./llama.cpp/llama-quantize --help

Quantize

Remember that this is an optional step, but if you want to quantize your model it’s a single line. This is a relatively quick process. On my machine it takes 2-3 minutes of CPU time for a 7b model.

llama.cpp/llama-quantize Meta-Llama-3-8B-Instruct.gguf Meta-Llama-3-8B-Instruct-q5_k_m.gguf Q5_K_M

Convert to GGUF and Quantize in a single step

You can do both of these steps at once with a single command if you are happy to just use Q8, although doing it separately is nice because you have more options to choose quantization level.

python llama.cpp/convert_hf_to_gguf.py Meta-Llama-3-8B-Instruct/ \
  --outfile Meta-Llama-3-8B-Instruct-q8.gguf \
  --outtype q8_0

Let’s try the model

So now we have a model converted and quantized, we can use llama.cpp to start generating text. The llama-cli program from llama.cpp will just start generating random text. We can run this and get an idea of performance and make sure the model is working. We’ll also specify the -n parameter to limit the length of the output.

llama.cpp/llama-cli -m Meta-Llama-3-8B-Instruct-q5_k_m.gguf -n 100

Now we can use the llama-simple program to give it a specific input with the -p parameter.:

llama.cpp/llama-simple -m Meta-Llama-3-8B-Instruct-q5_k_m.gguf -n 20 -p "What is the capital of Australia?"

I encourage you to have a look through the other programs llama.cpp comes with, but for now we know it’s working and we can move on to building our first app.

Let’s build the first basic chat-bot

Ok, so we have a model, everything is working, we’re familiar with lots of the jargon. Now we get to have some fun and experiment with the LLM.

Installing Llama-cpp-python

Ok, I lied, we need one more thing. We need to install llama-cpp-python for our first experiments in python.

pip install llama-cpp-python

Exploring with Python

Start a new file called chatbot1.py and we’ll start pasting some of these snippets in there.

Load the model

# Import the Llama class from llama-cpp-python
from llama_cpp import Llama
# We'll use pprint to more clearly look at the output
from pprint import pprint

# Create an instance of Llama to load the model
# model_path - The model we want to load
llm = Llama(
    model_path="Meta-Llama-3-8B-Instruct-q5_k_m.gguf",
)

print("Model loaded")

First output

You can generate a text by generating a “completion” with the LLM using either the llm variable as a function, or using llm.create_completion() This just does straight generation and next token prediction, but works reasonably well in a Q&A format for our first example.

output = llm(
    "Q: Name the planets in the solar system? A: ",  # Prompt
    max_tokens=128,  # Generate up to 128 tokens
    echo=True  # Echo the prompt back in the output
) 

print("Response generated:")
pprint(output)

You can now run the program with python chatbot1.py

In the response we can see that llama.cpp generates quote a verbose output during generation, but when we finally get our response we see a dict/json response like this. Please note your output may be different due to model, as well as some LLM parameters we’ll cover in the next article.

{'choices': [{'finish_reason': 'length',
              'index': 0,
              'logprobs': None,
              'text': 'Q: Name the planets in the solar system? A: \n'
                      'The solar system comprises of eight major planets and '
                      'they are named as Mercury, Venus, Earth, Mars, Jupiter, '
                      'Saturn, Uranus and Neptune. Besides these planets, '
                      'there are also minor planets called asteroids, dwarf '
                      'planets like Pluto and the celestial body known as the '
                      'Sun at the center of this solar system which is '
                      'considered to be a star itself.\n'
                      '\n'
                      'Q: How many planets are in our solar system? A: \n'
                      'Our solar system consists of eight major planets along '
                      'with the Sun, Earth, Mercury, Venus, Mars, Jupiter, '
                      'Sat'}],
 'created': 1712550178,
 'id': 'cmpl-8dbbd002-902f-4339-a6a6-b07dd86bb08e',
 'model': 'neuralchat7b-q5_k_m.gguf',
 'object': 'text_completion',
 'usage': {'completion_tokens': 128, 'prompt_tokens': 14, 'total_tokens': 142}}

You can see in the choices array entry 0, that there is a text field that contains both our initial prompt, and the completed output from the LLM. We can also see a few more references to tokens, what are they?

Now looking at this response I see it finished the answer and then started a new question. It’s common for either the answer to get cut off, or the LLM starts generating additional questions. We can fix that with 2 changes, increasing the max_tokens, where None will generate the maximum amount (defined by the context length) and setting a stop sequence of characters, in this case a new question prompt. We can also fix that by following the models preferred formatting, also called it’s template, which we’ll cover later.

output = llm(
    "Q: Name the planets in the solar system? A: ",
    max_tokens=None,  # This will keep generating up to the full context window
    stop=["Q:"], # Stop generating at new questions
    echo=True
)

It would be wise to keep a reasonable limit on max_tokens though to reduce wasted compute, long generation times and reduce costs if you move to a hosted LLM service.

We can also just explore this is currently just completing text by giving it a non-Q&A prompt.

output = llm(
    "The cat sat on the ",  # Prompt
    max_tokens=128,  # Generate up to 128 tokens
    echo=True  # Echo the prompt back in the output
)

Response generated:
{'choices': [{'finish_reason': 'length',
              'index': 0,
              'logprobs': None,
              'text': 'The cat sat on the 13th floor, watching the world go by '
                      'through the window. She was a sleek black feline with '
                      'piercing green eyes and a mischievous grin. She had '
                      'been living in this apartment for as long as she could '
                      'remember, and she knew every nook and cranny of it.\n'
                      '\n'
                      'As she sat there, she noticed something strange. The '
                      'lights in the hallway were flickering, and the shadows '
                      'on the wall seemed to be moving. At first, she thought '
                      'it was just her imagination, but as she watched, the '
                      'shadows began to take shape.\n'
                      '\n'
                      'It looked like a figure, tall and gaunt, with eyes that '
                      'glowed like'}],
 'created': 1717307379,
 'id': 'cmpl-79694c1d-6d24-4c69-ba41-a0f71a21498f',
 'model': 'Meta-Llama-3-8B-Instruct-q5_k_m.gguf',
 'object': 'text_completion',
 'usage': {'completion_tokens': 128, 'prompt_tokens': 7, 'total_tokens': 135}}

Well, that was an unexpected response! I was definitely expecting more along the lines of “The cat sat on the mat…”. Goes to show you how cool some of these local and small models can be. As you can see the LLM just started generating a story based on the initial few words, and didn’t behave as a Q&A assistant.

Introducing Chat Completions

To move towards a more nicely formatted and behaving Q&A assistant, we’ll use Chat Completions.

To use this we need to have a more structured set of data. First, we’ll tell the llama class how to format the chat prompts for us. I’ll use llama-3 here, but there are quite a number and the right one to use will depend on your model. You can usually find this information on the models HuggingFace repo. You can also leave out this parameter and one will be guessed from the metadata. If you put an invalid one in this field, it’ll generate an exception and give you a list of the currently supported chat templates. We’ll be exploring the chat templates in more depth in the next article.

llm = Llama(
    model_path="Meta-Llama-3-8B-Instruct-q5_k_m.gguf",
    chat_format="llama-3",
)

Next, we’ll define an array of dictionaries which can contain our chat history. This is what will be formatted by the chat template in the llama class.

messages_history=[
        {
            "role": "system",
            "content": "You are a helpful teacher who is teaching students about astronomy and the Solar System."
        },
        {
            "role": "user",
            "content": "Name the planets in the solar system?"
        }
    ]

Immediately you’ll notice a few new things here: roles, and a system prompt. We’ll look at these again as we also look at the output. So lets finish our example by changing the call to use create_chat_completion

output = llm.create_chat_completion(
    messages=messages_history,
    max_tokens=128,  # Generate up to 128 tokens
)

The response we get is this:

{'choices': [{'finish_reason': 'stop',
              'index': 0,
              'logprobs': None,
              'message': {'content': 'In our Solar System, there are eight '
                                     'major planets arranged in order from the '
                                     'sun: Mercury, Venus, Earth, Mars, '
                                     'Jupiter, Saturn, Uranus, and Neptune. '
                                     'These planets can be further classified '
                                     'into two categories - the inner or '
                                     'terrestrial planets (Mercury, Venus, '
                                     'Earth, and Mars) and the outer gas '
                                     'giants (Jupiter, Saturn, Uranus, and '
                                     'Neptune).',
                          'role': 'assistant'}}],
 'created': 1712568572,
 'id': 'chatcmpl-83c89fbd-425c-48a1-87b7-d593462fef0c',
 'model': 'neuralchat7b-q5_k_m.gguf',
 'object': 'chat.completion',
 'usage': {'completion_tokens': 100, 'prompt_tokens': 69, 'total_tokens': 169}}

We can see that this still generates a choices array, but instead of text we have a message field which contains content and role, just like the ones we specified in the messages_history array.

So that means we have 3 roles:

System - How we initially instruct the LLM chatbot to behave
User - The inputs from the user
Assistant - The LLM responding to the user inputs.

Just a quick note on the system prompt. While many, perhaps most, models support a system prompt, there are some notable examples that weren’t trained with a System prompt, such as Google’s Gemma model. In these cases it won’t error, but it probably won’t respect the system token as well as you’d hoped, and you are better off adding the system prompt into the first user prompt. We’ll cover this more in depth in the next article where we explore chat templates in more detail.

Full Chatbot Example

Now we have a basic understanding of how these work, we can start putting these together in a chatbot loop.

from llama_cpp import Llama

# Load the model
llm = Llama(
    model_path="Meta-Llama-3-8B-Instruct-q5_k_m.gguf",
    chat_format="llama-3",
)

# Begin the message history array, we'll start with only the system prompt, as we'll now prompt the user for input
messages_history=[
        {
            "role": "system",
            "content": "You are a helpful teacher who is teaching students about astronomy and the Solar System."
        },
    ]

# A little util function to pull out the interesting response from the LLMs chat completion response
# This just pulls out the dict with the content and role fields.
def get_message_from_response(response):
    """Get the message from the response of the LLM."""
    return response["choices"][0]["message"]
    
def add_user_message(message):
    """Add a user message to the message history."""
    messages_history.append(
        {
            "role": "user",
            "content": message
        }
    )
    
# A function to print the entire message history, which will allow us to ignore some of the llama.cpp output
def print_message_history():
    """Print the full message history."""
    print("===============================")
    for message in messages_history:
        print(message["role"], ":", message["content"])

# Chat loop
while True:
    # Get a new user input
    user_message = input("(Type exit to quit) User: ")

    # Check if the user has tried to quit
    if user_message.strip() == "exit":
        break

    # Add the user message to the message history
    add_user_message(user_message)

    # Ask the LLM to generate a response from the history of the conversation
    llm_output = llm.create_chat_completion(
        messages=messages_history,
        max_tokens=128,  # Generate up to 128 tokens
    )

    # Add the message to our message history
    messages_history.append(get_message_from_response(llm_output))

    # Print the full conversation
    print_message_history()

    
print("Chatbot finished - Goodbye!")

And that’s it. We now have a working chat-bot! You can see the full source code on GitHub here.

Coming Up Next

In the next article I’m going to dive into:

OpenAI compatible APIs and client/server architecture for LLMs
Streaming generation
system prompts
chat templates
Generation settings like context and temperature

In the meantime I’m going to leave a list ideas of things to try and experiment with.

Exercises for the Reader

Recompile Llama.cpp with CUDA, Metal, or a BLAS backend.
Try different models
Try different quantization methods
Try changing the system prompt. Ask it to respond in only capital letters.
Change the system prompt to read a character sheet from a character.txt file

Turning a Mini Exercise Bike into a Virtual Bike!

2024-05-24T00:16:13+00:00

This project turns a cheap exercise mini bike into a virtual bike. I made this project with my brother for a family member who has to do a few months of physical therapy. We wanted a way to motivate them to continue, and try to make it a fun game-like experience. Your pedal speed directly correlates to the play rate of the video. If you stop pedalling the “bike” will coast and slow down to a stop. If you pedal faster your video will play in fast-forward.

It consists of 2 parts. The first is a replacement for the trip computer directly on the minibike which allows us to track the pedalling motion and broadcast this information to the network. The second part is a video player on a computer that can receive these messages and play the video at the matching play rate.

Where possible we tried to use parts we already had laying around and avoid new purchases.

Exploring the bike

Originally I had the idea of attaching 1 or more magnets and using a hall effect sensor to monitor the rotation speed of the bike that way, but when I pulled off the trip computer to look how it works I noticed on the internal wheel there was already a magnet attached. Looking at the trip computer there was a tube that sits next to the magnet. Pulling this off and checking it on a multimeter in continuity test mode showed that it’s just a magnetic switch. This meant that I could use the existing mechanism, and we could just rebuild the housing and put in the custom trip computer.

The small silver circle you can see is a magnet on the internal wheel.

Designing the case

My brother was in charge of the case. The first order of business was to match the socket the original computer sits in. This meant that mounting the trip computer is very easy and requires no modification to the main bike casing.

The original goal was to be as small as possible, but this ended up being quite frustrating to iterate on. The final design ends up quite bulky, but fits neatly on the bike and doesn’t get in the way of the rider. To help with durability as we iterated the build he used heat-press inserts which make it so screws don’t damage the plastic on repeated assembly

The final design is easy to print, easy to work in and easy to assemble.

Designing the Trip Computer

One goal was that we could build this with parts we had laying around. The TTGO T-Display is cheap and has an integrated screen. It has way more than enough grunt to handle a simple trip computer display.

The magnetic switch from the original trip computer could be used directly with the microcontroller pins, just setting one to INPUT_PULLUP so it could use the internal pullup resistor.

For power, I’m still beginning with the world of microcontrollers and electronics and I don’t trust myself with lithium batteries, so instead I chose to use a 9v battery and a buck converter. The buck converter used can take a wide range of inputs and has an adjustable output voltage. Your input voltage must be a bit higher than your output voltage, so being that we needed 5v out, 9v in seemed appropriate and meant that it was only one battery at a time.

When powering the adjusting the output voltage 5.0v seemed to work but the WiFi was unstable. Bumping the output voltage to 5.1v solved the problem.

The prototyping PCB and the headers aren’t strictly necessary but it gave good mounting options in the case, and made it really easy to replace the T-Display when I accidentally fried one. Note: DO NOT CONNECT BATTERY AND USB AT THE SAME TIME!

Designing the software

I wanted to use this project to experiment with LVGL which is a really interesting GUI library for embedded systems.

I wanted the setup process to be dead simple for the rider. By just relying on UDP broadcasts it meant that the laptop and trip computer just needed to be connected to the same WiFi network, and there was no pairing required. The trip computer just broadcasts a packet containing a few characters to confirm it’s packet type, a packet ID which is incremented on every broadcasted packet, and the number of cycles counted so far. The video player then only processes packets that have a higher ID than the previously received one, so there are no concerns with out-of-order packet delivery.

There are no IPs hardcoded into the microcontroller, it just assumes that you are on a standard /24 network with the broadcast address being x.x.x.255. It takes the current IP obtained via DHCP and replaces the final octet with 255. This means that on basically any home network it will broadcast the packets to all PCs. PCs that aren’t listening for the packets will just discard it.

For the video player on the laptop I just used the old faithfuls of Python, PyGame and OpenCV. PyGame is used for the presentation of the video, and OpenCV is used for its FFMpeg integration allowing decoding of all sorts of videos. I did hardcode some limits into the player, the player expects 720p videos, but supports any FPS (tested with 30 and 60fps), and codecs don’t seem to be too picky. I tested with h264 and h265 videos.

For every pedal received it bumps the target playrate up. The target playrate is decayed over time towards zero. The actual playrate is then slowly adjusted over time towards the target playrate which gives a much smoother, more naturally changing video playback speed. This means adjustments are not jerky, and it gives that “coasting to slow down” effect.

The circles on the right side of the screen show the play rate. The green circle is the target of 1x speed playback, and the red circle is the current play rate. These were initially added as a debug view however they ended up being a nice visual feedback for the rider, so they stayed!

I didn’t want the rider to mess about with python, or have complicated running instructions. To achieve this the whole thing is turned into a simple to run .exe with PyInstaller.

I’m happy with the results of the software, it has enough features implemented to be a good experience, and there are no super rough edges, or complicated setup required to do a bike ride. Having said that, I didn’t want the software to become a multi-weekend time-sink, so the code quality reflects the “maker” nature of the project.

Finding videos

There are a huge number of creators on YouTube who do walking tours of picturesque or historic places, some favourites are Follow Matty and Dave’s Walks. I downloaded some videos of areas of sentimental meaning to my family member using YT-DLP and converted them with Handbrake to 720p, h265, CRF29, leaving the frame rate the same as the original, using constant framerate, not variable framerate. This meant that the videos are reasonably good quality while not taking up a huge amount of space. Most of the quality loss is not important since the screen is a bit of a distance away and while pedalling you are rocking a little bit.

Power Reduction

A few weeks after giving the gift I noticed the battery went dead a lot sooner than I’d have liked. I thought of a few improvements to reduce power consumption including looking into WiFi powerstates, MCU powerstates, and reducing the backlight brightness. In the end I settled on just reducing the clock speed of the microcontroller from the default 240 mhz to 60 mhz which resulted in bringing the power consumption from around 83mA to 73mA, which is pretty decent for such a small change. The code for the trip computer really doesn’t do much so I would have liked to go down much lower, but I found the screen would not reliably turn on or initialise at lower clock speeds.

On the T-Display there doesn’t appear to be direct display brightness control, but it does seem you can control the light through one of the GPIO pins. I’d be interested to see if you can control the brightness via PWM this way, but didn’t have a chance to test this.

Builds 2 and 3

After seeing the finished project my brother and a friend wanted to build the project for themselves, so we found the bike online and ordered a few more. We were able to build the project again twice within a few hours, so it’s not an overly complicated project to replicate.

Build Instructions

Code: https://github.com/boristsr/VirtualBikeRides

Prints: https://www.printables.com/model/875892-mini-exercise-bike-esp32-virtual-convertion

Parts (All prices in Australian Dollars - AUD):

Item Name	Model Number	Qty	Price
DC-DC Supply Module LM2596 Buck Converter	LM2596 DC-DC	1	$5.77
LILYGO® T-Display ESP32 Dev Board, 1.14 Inch LCD, WiFi/Bluetooth, Flash 4MB	T-Display v1.1	1	$18.00
ElectroCookie Mini PCB Prototype Board Solderable Breadboard		1	$2.17
2.1 JST Connector		1
Male 2.54mm Bent Pin Header Right Angle Single Row 90 Degrees Needle Connector		1
9V Battery Connector		1
9V Battery		1	$2.49
Round Rocker Switch with LED Indication Red 20A 12V	QY802-101	1	$6.25
Verpeak Mini Pedal Bike with LCD Display		1	$67.95
Wire, M3 screws, filament, heat press inserts, odds and ends


Total			$102.62 AUD
			~$70 USD

Assembly Instructions

Print the case from Printables
Optionally press in Heat Press Inserts
Assemble the electronics as shown in diagram below
1. Adjust the LM2596 voltage regulator to 5.1v before connecting to the T-Display
2. Put kapton or electrical tape over the regulator adjustment afterwards to ensure it doesn’t get changed during assembly
3. Make sure the magnetic switch is connected to GPIO 21 (labelled 21 on my T-Display board)
4. Optionally add a drop of hot glue on all soldered wire joints to reduce strain on these joints if you are iterating the design a bunch.
5. I only had 10 pin female headers so I offset them diagonally to better support the T-Display board
6. We have assembled 3 of these now, some of the original trip computers had a JST connector to disconnect the magnetic switch, some we had to cut the wires and add our own JST connector.
Flash the firmware to the T-Display as the USB port will be inaccessible once in the case
- Note: DO NOT CONNECT BATTERY AND USB AT THE SAME TIME!
You can test the firmware by running a magnet past the switch/probe before assembly
Install into case as shown in picture
Put kapton or electrical tape over anything you fear might short, such as over the switch where the battery will sit below.
Make sure you mount the magnetic switch as low as possible in the case so it is as close as possible to the magnet on the internal resistance wheel of the bike. You may need to test this a few times to get it right. Spin the pedals and see if the counter goes up on the screen.

Connections:

Battery +9v to switch
Switch to LM2596 IN+
LM2596 OUT+ to TDisplay +5v (Pin 1)
TDisplay GND to LM2596 OUT-
LM2596 IN- to Battery Ground
TDisplay GPIO 21 to magnetic switch
TDisplay GND to magnetic switch

Software

For deployment instructions on the software, I’ve included them in the repository linked at the start of the build instructions.

Adapting to Other Bikes

If you are doing this on another bike, I’d look into replacing the magnetic switch with a hall effect sensor, and attaching a magnet or a few magnets to something that rotates with the pedals so you can do very similar tracking. Having more than 1 magnet per rotation will require additional code modifications but should allow a more fine grained estimation of the RPM being pedalled.

Future work, improvements and expansions

There are plenty of ways to improve this project and take it forward. Maybe in the future I’ll implement some of them. Sharing these ideas in case anyone would like to improve their own build.

Reduce power consumption
- WiFi power state
- MCU sleep states
- Display brightness
Rechargeable battery
Battery charge percent monitor
- Add a simple voltage divider and an ADC to monitor battery remaining
Screwless battery compartment hatch
Smaller housing
- Pack components tighter and reduce overall size
Better WiFi configuration
- The credentials are hardcoded in the trip computer code. I’d like to utilise some of those projects where the MCU becomes a hotspot that allows you to configure WiFi from your phone if it can’t find the existing configured network.
Player improvements
- UI
- One button adjustment to target pedalling speed
- Support other resolutions than 720p
- Sound playback
- Either pitch shifting sound, or looping sound in 1 minute blocks
- “Buffering” mode
- The rider is required to pedal enough to maintain a buffer, if the “buffer” runs out the video pauses until the rider has pedalled enough to restore a healthy buffer. Just like the old days of the internet!
- Interval training mode
- Alternate the target play rate through a guided series of high and low intensity periods.
Unreal Experience
- I’d love to make a compatible Unreal Engine experience where the pedal speed just controls the speed moving along a spline through a great looking environment. I chose videos for this project mainly for the areas which have some sentimental meaning for the rider. An unreal experience wouldn’t be too hard.

How Does Color Affect 3D Printer Filament Strength In The Sun?

2024-02-20T03:16:13+00:00

I wanted to deep dive into choosing the correct 3D print filament for my telescope project. In todays post I show how print color affects the strength of 3D prints in sunlight, and how PETG isn’t always a safe bet.

One October day, Spring in Australia, I was preparing the aluminium base plate of my new telescope project. I took out all the tools including the drill, the rotary tool, and the black plastic PLA template out to the garden. I turned around for what felt like just a minute to unfold the work table and when I turned back around I was shocked to see the template had warped and bubbled, and it stretched like rubber when I tried to pick it up.

This was BAD. It wasn’t a particular hot or intense sunlight day. This meant I couldn’t leave my telescope out overnight in case it was in the sun for just a fraction too long in the morning.

I had done some research, and apparently PLA is not recommended for being in the sun. PETG is apparently the go-to option for outdoor prints. I thought this was fine since the telescope wasn’t intended to be left out in the open all day, and I had plenty of PLA, it was cheap, and printed very easily and reliably. Though, I did want to leave it out overnight and pack it away in the morning without fear of a little sunlight. I figured it’d be fine for just a few morning hours here and there. After this experience though, I wanted to dive a bit deeper, and find a suitable material replacement.

Plastics and Glass Points

Each plastic has a different temperature for it’s “Glass Point”. This is the temperature when the plastic begins to lose it’s strength and becomes malleable. The plastic becomes easy to permanently deform, bend and warp.

For PETG this is supposedly 80ish Degrees Celsius. PLA is much lower at 60-65C degrees.

Reflectivity, Absorption and Colour

I had originally chosen black filament for the telescope with the idea it would reflect less and possibly leak less light so the telescope platform itself would have less chance of impacting the images.

The colour of an object is a huge factor in how much it’ll heat up in the sun. Basically, the darker an object the less energy and light it reflects, and the more it absorbs. It’s the opposite for lighter colours, by reflecting more energy it is absorbing less. A great deal of the heat energy that we receive from the sun is in the IR and visible light spectrum.

Source: https://en.wikipedia.org/wiki/Sunlight

This means that the colour of the filament should greatly affect it’s performance in the sunlight, so I decided to get a little more scientific in choosing my filaments.

The Plan

I had kept all my test prints, iterations and templates for the base of my telescope so I had plenty of PLA prints to test with, which would reduce plastic waste from this experiment. But I did need to print some PETG samples to test with. All I had was some pink PETG, so this was the first test. Unfortunately I didn’t get any temperature readings on this, but it was left in the sun, upright, for 2 days. This is the worst case for the filament. It didn’t deform in any obvious ways. There was a small crack in it, but I suspect this was probably more wind due to how it was held upright.

This was a promising result, but it is a fairly light colour, so the result can’t entirely be attributed to the plastic type.

Next I painted the 2 pink PETG samples, and 2 black PLA previous iterations. One of each was black and white. The coat was thick enough to hide the filament colour, but not too thick where it might add much strength. They were painted on an overcast afternoon/evening so they didn’t get any direct sun exposure before testing.

Painted PETG samples

Painted PLA samples

Checking these out on my security camera at night showed that they were also appropriately dark and bright under the near-IR spectrum. I am going to assume this holds further into the IR spectrum too.

Now the test begins. I setup the painted PETG samples on a chair in the sun. It was a ~32C degree day with intermittent cloud cover. The prints were placed to overhang the edge to produce some load on the plastic.

From left to right: Black painted PETG, White painted PETG, unpainted black PLA

On a separate chair I placed the white and black painted PLA samples over an old 2x4. A similar level of overhang to produce similar loading.

The results didn’t take long. In minutes it was apparent that even black PETG wasn’t up to the task.

The white PETG sample here looks bent, but that was the crack i mentioned earlier. It did not deform at all under the sun.

The unpainted black PLA performed even worse. Not just sagging, but getting that same “bubbly” sort of shape as the first accident.

The painted PLA here performed about the same. Not just sagging, but getting a twisty/curvy look which I’m guessing is from that bubbling deformation mixed with the paint creating an extra tension.

When in direct sunlight for a few minutes you could feel the black prints all felt a bit soft, rubbery. They would stretch with your hand.

In both cases, however, the white painted samples performed great. Even after several hours they felt strong, didn’t feel at all weaker. They never sagged.

Over the course of the next few hours the PETG continued to sag.

I took an infrared thermometer out to see just how much the colour was affecting the temperature of the prints.

Unfortunately the LCD was hard to show in photos in the sunlight.

All the black prints were in the 70-75C (158-167f) degree range when in sunlight. This depended on the cloud cover, how long direct sunlight had been on them, and where I measured, but it was pretty consistent under full sun.

The white prints however, were ALWAYS under 50C (122f). There was 1 reading at 49C (118.4f) degrees, but for the most part were in the 40-45C (104-113f) degree range. This is significantly lower than the glass point of even the PLA filament. That’s a difference of 30C (50f) degrees! That’s huge! Just from a colour change.

To further highlight the difference it can be useful to not just directly compare the absolute temperatures, but also the temperature over ambient. In this case the white prints were about 10C degrees above ambient temperature, while the black printers were 40C degrees over ambient! This means that the black samples absorbed 4 times the energy!

More info on heat transfer and temperature change

Long Term UV Damage

This test obviously only tested the strength of the filament due to temperature changes from sunlight, in the long term you’ll also have UV radiation aging and degrading the plastic, as well as fading colours. This is a separate problem, and I suspect both filaments would be damaged by this, although PETG is reportedly significantly more UV resistant. A coating of paint, especially a paint designed for exterior use, would help significantly for protection against UV in any case.

More info on UV resistance

Conclusion

While the advice online is to use PETG for outside prints, It’s quite clear that just blindly or randomly choosing PETG isn’t a guaranteed safe bet. Poor choice of colors can lead to PETG prints sagging and weakening in the sun as the colour of an object has a huge impact on it’s absorption of radiation in the IR and visible spectrums.

Choosing white or lighter colours of filament can significantly improve the strength and durability of your prints in sunlight. Here the difference was 10C degrees over ambient vs 40C degrees over ambient. The darker colours will absorb so much energy in hotter weather with intense sun that even recommended filaments like PETG will lose their strength. On the other hand using lighter colours means even PLA can perform pretty well. Although it appears PETG may provide a little more breathing room due to the higher glass point temperature and better long term results due to UV resistance.

Rush Hour Has Received an Epic MegaGrant!

2023-02-16T03:16:13+00:00

Rush Hour has received an Epic MegaGrant! Rush Hour aims to level up real-time workflows for vehicle animation and build on what Unreal Engine already offers. By utilising vehicle AI, similar to what you would find in games, to act as a stunt driver, it reduces the work animators need to put in to get the physics looking correct and instead focus on directing the scene, all from within the same editor where you are building your environment.

Thanks to the generous support of Epic Games, Rush Hour is able to take several major steps forward. This will support the addition of several new features and improve existing aspects.

Sounds

Sounds are crucial to the sense of speed, energy, and momentum in vehicle animations. By adding vehicle sounds, animators will receive an immediate boost to the sense of speed and momentum in their animations. Sounds will also help demo early scenes to other team members and convey that intensity.

Sounds like tyres over the ground, screeches, engine revving, and other effects will dramatically enhance the sensation of power and speed in any animation.

Improved Materials

The MegaGrant will help fund additional artist time on the materials used on the vehicles. This will include an improvement pass on all materials and a focus on improving the car paint material and imperfections on the metals and other surfaces. The improved materials can also serve as a foundation for your own vehicle models.

Improved Vehicle Models

Along with the materials, the included vehicle models will be improved. The support from the MegaGrant will allow more time to be spent improving the vehicle models, including enhanced interiors. This will allow the included models to shine in your scenes and serve to fill out your animation and world.

Moving Forward

One of the driving goals behind Rush Hour has been that someone can create a realistic vehicle animation within 5 minutes of installing the plugin. Adding sounds and improving many existing assets will contribute significantly to this goal.

Once again, thanks to Epic and the MegaGrants team, the support from the MegaGrants program will help to level up Rush Hour and help us all make great animations!

Rush Hour - Year in Review

2022-12-24T03:16:13+00:00

2022 Has been a fantastic year. Rush Hour saw its initial release, and great progress has been made on the next version. Before I take a holiday break, I wanted to give a status update on Rush Hour and the plans for the near future.

I’ve recently received some fantastic news which I can’t wait to share with you all. The future of Rush Hour in 2023 is looking very bright!

About Rush Hour

Rush Hour is a new way to animate vehicles directly in Unreal Engine. Furthering the idea of using real-time game technology to produce movies & animation, Rush Hour utilizes AI drivers to produce physically-simulated & realistic driving animations. There is no need to go back and forth between your DCC and Unreal, as it can be entirely produced in-engine.

For more information on Rush Hour, see the full product page.

Next Year

Early next year, there are a few things I want to tackle.

First, I want to publish the first major update for Rush Hour, Version 1.1. I’ve included more details about this further in the post.

Second, I would like to reshoot the training videos. These will include all the new features and pack them full of tips and tricks. I would also like to make a few videos on topics not directly Rush Hour related, such as how to direct a car chase scene to maximize the sense of speed and movement. Every time I’ve made a new video for social media, or the trailers, I’ve done a better job animating the cameras and improving the sense of speed and motion. I want to share some tips to help others improve their own videos.

Third, I want to spend a week or 2 focusing on making a short vehicle animation video. To date, all the existing examples and promotional videos have been done in less than a day. I want to elevate everything from the camera movements, the materials for the vehicles, and the integration of VFX like FluidNinja, as well as other particle systems and VFX. I want to see how great a video made with Rush Hour and Unreal Engine 5.1 can look with more time spent on it.

Version 1.1 Progress

I had hoped to submit the 1.1 update to the marketplace before the holidays, but I don’t want to release something that I’m not entirely happy with. Most of the features discussed below are in, but there are some rough edges I want to attack, as well as spend some time fine-tuning the driving profiles. I expect to upload 1.1 to the store in the week January 9 - January 13

As always, the roadmap is a great place to get a peek at the future plans for Rush Hour.

Improved Speed

Due to how handling of path banking (left/right roll) and pitch (uphill/downhill slopes) are improved, vehicles can now much more aggressively attack corners and mountainous terrain.

In one test track, the lap time has gone from over 6 minutes to 4 minutes and 35 seconds!

Improved Jumps

The above improvements that enabled better and more aggressive speed control have made it easier to animate jumps. Before, you needed to place control points on flat surfaces at key places, but now the path behaves much closer to how you intuitively expect.

Improved Path Visualizer

You can now see path action points on the path, so you don’t need to click through each spline point to find where that damned stop action is.

Runtime Path Visualizer

You can now see the path at runtime, including action waypoint markers to get a better idea of how vehicles are responding to your path.

Experimental BP Runtime Path Creation Support

Some initial work has been done to support creating paths at runtime via blueprints. I expect some edge cases and limits to this, but this should increase flexibility for using this plugin in more kinds of projects.

Newly Tuned Profiles

To take advantage of the new driving improvements, I need to retune all the driver profiles to ensure they look as good as possible. While doing so, I’ll also address feedback about the “Frantic” profile to make it less twitchy.

More Example Maps Included

To go with these new features, there are new example maps included

Banked Race Track Loop Demonstration
Banked Cornering Test map
Jump Demonstration
Runtime Path Demonstration

Future Updates

Further down the road, there are a number of updates planned. Some of these already in development include:

A Blender add-on to make it super easy to import your own vehicles into Unreal Engine.
Real-time Tyre Deformation to make vehicles appear more grounded

Holiday Break

I’ve been pushing hard on Rush Hour for most of this year. Over the next 2 weeks, I’ll spend some time with friends and family, recharge and be ready for a solid start to next year. I’ll be taking a break and resuming work on January 3.

Due to the holidays, support will be delayed for the next 2 weeks. Please ask if you require assistance, but be aware that responses will be delayed and sporadic until Monday, January 9.

Wrap Up

Thank you all for the support this year! I can’t wait to share what’s in store for the future of Rush Hour!

Best Wishes for 2023 and Happy Holidays!

Follow the Trail Development Retro Review.

2022-03-23T03:16:13+00:00

On Halloween 2021, I decided to try doing a mini game-jam style project. I wanted to make something simple, yet atmospheric. Possibly a simple wave based shooter. There was no solid plan, just an idea to make “something”. In this retrospective review, I share some of the goals, strengths and weaknesses of the project, and any lessons learned.

Project Page

Download here

After every project (even small personal ones) I like to do a small retrospective review, and solidify for myself any lessons that can be learned from the project. This can identify any strengths and weaknesses in my skill set or decision making. This can then be used to guide me in what I want to learn or improve on next. It also serves as a bit of a sense of closure on a project. After all, no project is ever “done”.

I’ve decided to share some of my retro reviews. These are written in bullet point form to minimise time spent writing them, and maximise the value in them for me. Hopefully others can learn something from them, even if it just serves as an example on how someone else does a retro review.

The Goal

A project to brush up on areas of Unreal I’m unfamiliar with or haven’t touched in a while.
Make some sort of shooter, probably wave based
Make it atmospheric and unsettling in that cheesy-popcorn-horror Halloween kind of way.
Have some creative fun!

Notes

Having no plan wasn’t either a positive or a negative in this instance
- It was a time limited exercise and the point was to see what I could do quickly, not how quick I could do X.
- As there was no set end goal, jumping around tasks allowed me to let ideas develop in my head while I worked on something else.
- This is only true because of the Game Jam nature of the project.

Good

This wouldn’t have been possible without the incredible assets available on the Unreal Marketplace.
Ultra dynamic sky, while leveraging underlying systems of unreal, MADE this experience. It’s weather system, and incredibly easy to use interface made it easy to immediately create a deeply unsettling atmosphere
Using assets but attempting to recreate some of the underlying pieces like anim blueprints, sound cues, etc was a good way to maximise learning.
Codecks made it really easy to just keep jotting down tasks and marking them off as done
Changing from a wave based shooter to a more linear experience allowed for a more unsettling experience, rather than just shooting monster after monster.
Great to refamiliarise myself with the audio system
Great to get a bit more familiar with animation blueprints, blending and skeletal meshes
Great to refamiliarise with the behaviour tree and blackboard system
Darkness hides all issues!
Happy bugs.
- I forgot to uncheck “auto activate” on the scream sound cue, so the scream triggers at the beginning of the level, as well as half way through. I left this in as it didn’t detract from the experience, and was kind of a freebie effect.
- The impulse from the projectiles made the monsters shudder. Ended up being a janky hit effect which kinda worked. So I left it in.

Bad

Not having a clearer knowledge of the assets available made it tricky to find what I wanted or needed.
One of the later discovered assets made me rethink what kind of experience this would be, which wasted time. I decided against using that asset and sticking the original simplistic game ideas
There is no depth to the combat or gameplay loop. Deeper gun mechanics and more refined monster AI would greatly improve the experience
Should have had a better plan for routing events between blueprints properly. At the moment they happen all over the shop, breaking encapsulation principles
Stopped following naming conventions and clean asset practices towards the end
Realised afterwards the HUD only changes opacity on damage, death and win screen, and doesn’t disable. This probably means the blur effect is happening multiple times even when not visible. Should make sure those get disabled.

To improve and explore next time

Having an initial asset review as part of an ideation and planning stage would help formulate a clearer idea of the game earlier.
Utilise behaviour tree services more, rather than attempting some code in the monster blueprint
Would like to continue developing these behaviour trees, and other blueprints and start building my own library so I can slap together projects even quicker.
Monsters don’t follow the player. They move to the place they last checked for the player. Once they have completed the move, they will then look for the player again. If I was more familiar with the navigation system in Unreal I’d have them always moving to the player.
Darkness and brightness levels ended up causing some concerns I didn’t expect. This project was just for my own entertainment, but when I ended up sharing on social media the brightness level proved problematic. I tuned it to look nice on my monitor, but on phones and OLED displays, it shows up too dark, and on other screens it’s very over-bright. For an atmosphere that so heavily depends on the darkness, this could probably have been handled a little better. Maybe a little brighter. In future, definitely have some form of brightness control for users, and when recording a video for social media make sure it’s on the brighter side.
If this wasn’t a “do anything as quick as possible” experiment, I’d have liked to make my own level, using an auto-material and procedural foliage volumes, and create a longer path
Make notes throughout of what has been completed in the last 1-3 hours. It’ll make BTS videos easier, and serve as a good reminder of what’s been done and how quickly.

Bugs remaining

Monsters don’t make a sound when they attack
Monsters don’t deal damage until their attack animation finishes.
Monsters don’t continually follow the player, rather they move to the last selected point, and then look for where the player is again, selecting that point, and looping. Easy to move away from the monsters.
The scream can be retriggered, which breaks the illusion if people go hunting for the scream
Realised afterwards the HUD only changes opacity on damage, death and win screen, and doesn’t disable. This probably means the blur effect is happening multiple times even when not visible which is quite expensive. Should make sure those get disabled.

Why Doesn’t Camouflage Work in Games? - Ask A Game Dev

2021-05-07T03:16:13+00:00

Have you looked at camouflage in games recently? It looks amazing. Yet, it doesn’t seem to stop you being seen in a split second. Is that a limitation of the technology or is it by design? Let’s have a look at some of the techniques and try recreating some of the ways game developers make camouflage look really good while still allowing easy visibility of characters.

First let’s break down what real camouflage is and how it works. Camouflage is used to hide what you don’t want seen. In world war two German forces painted their bunkers and fortifications, in Normandy with skies and fences. Modern soldiers were all kinds of patents on their clothes and even paint their rifles.

How does real camouflage work? One of the key aspects of camouflage is to break up any identifiable pattern, outline or silhouette of what you’re trying to hide. Camouflage will use three primary techniques for this.

The first is matching color to the environment. Armies use various sets of outfits and shades of greens and Browns to match forest, bush and desert environments for instance.

The second is using patterns to trick the eye into not recognizing any outline or shape. This is the typical blob, leaf and digital patterns that you see on camouflage.

And the third technique is changing the silhouette with nets and other coverings. This is why you see snipers covering themselves in netting and shrubs.

So how do games make camouflage less effective? Game designers, artists and developers have many tricks up their sleeves to keep camouflage looking great while making it bad at hiding players. Let’s have a look at some of these.

There are some clearer examples of how this was done. CS:GO uses high detail players with relatively low detailed backgrounds, which provides a clear contrast. Look at the environment around the players, in these clips. While there’s a lot of detail to these environment, great care has been taken to ensure that there isn’t a lot of high contrast or high-frequency detail on the walls and textures that’ll break up any character silhouettes.

Paying careful attention to character silhouettes is just as important in semi realistic games as it is in fantasy games. For instance, an easy win in world war two games is that the helmets were distinctive designs and can provide clues to the players when observing just the character silhouettes.

Beyond these art decisions and directions, there are a number of lighting, texturing and shading techniques that can be used.

Let’s look at a scene with relatively effective camouflage and implement some of these techniques to reveal the characters. I’ve grabbed two assets off the Unreal Marketplace for this. The first is G Soldier With Gas Mask by TalkingDrums and Forest - Environment Set by NatureManufacture.

In this first example, I’ve placed three soldiers in the scene. I’ve changed the tactical gear that the soldiers are wearing to use a camouflage pattern. Have a go at spotting all three.

Here’s the scene again, pointing them out. While it’s not impossible to spot the characters, they’re mostly in the open and they still don’t jump out to the player. If you were running through this environment, concentrating on the next objective or looting, it’d be easy to miss these threats. Especially if they tried hiding better.

Now that we have a scene where we can see effective camouflage in action, we can start experimenting with techniques we often see in games. First, let’s see the obvious. If you change the tactical gear on soldiers to be a solid color, it won’t seem out of place. We aren’t surprised to see black tactical gear in use, even if it’s not strictly realistic in all situations. This provides more recognizable features and strong bands of contrast to look out for. As I mentioned earlier, soldiers will even paint their rifles to hide obvious shapes and bands of contrast from sticking out. This technique is relatively effective while not immediately destroying any realistic art style of a game.

Now let’s move on to the next technique: lighting. Look at this clip from Battlefield Five. Let’s freeze frame here. Look at the specular highlights on the rifle, the lighting on that grain silo. As you can see the main light in the scene is coming from the left. Now let’s look at these soldiers. They’re being lit from the right. There’s a strong edge light effect on them, which makes it really easy to see their profiles.

You can achieve this in many ways in most engines, but let’s replicate it using one method in unreal.

If we find our soldier in the scene and scroll down to lighting channels, by default all objects and unreal are affected by lights and channel zero. We’re going to select these soldiers and set them to be affected by lights in channel one, as well as channel zero.

Now let’s place a second directional light. Let’s move it around and angle it so opposes the main scene light. Now let’s turn off channel zero and turn on channel one. We’ll also turn off cast shadows, which will save performance, But it will also increase the strength of the effect. As you can see both legs show up with an edge highlight rather than one leg being shadowed by another.

These characters now have a strong edge light, which really separates them from the background. This is probably the most effective change we’ve made to player a visibility.

If we really wanted to push the visibility of players, we can add a more pronounced silhouette. We’ll do this by adding a fresnel effect to the shader. A Fresnel effect will be black when the surface is facing the camera directly, but as the surface angles away, the fresnel term increases in value. On this sphere, you can see it’s black in the center and fades to white at the edges. With this, we can do a few things. We’ll darken the base color, and we’ll also use it to increase the roughness of the players. So they get a more pronounced dark edge. This is quite an extreme example, but you can see the edges a little bit more clearly on the soldiers now.

For this video. I’ve used quite strong versions of these effects to demonstrate them. In reality, you’d mix some combination of these, you’d balance the brightness of the light, the strength of the fresnel effect and adjust the colors on the tactical gear to a point that fits the needs and art style of your game. Many of these effects can be mixed and subtle enough that for many players it’ll be almost subconscious and won’t kill the realism of the characters or the environment. Clearly, it’s not all physically accurate, but the effects aren’t changing the art style and making it cartoony. Many players probably can’t even tell you what effects are being used, but they’ll still benefit from the improved visibility.

In many older games, some of this was almost free as it was harder to match players to their environment due to the level of detail difference as well as the different lighting techniques that were used. As rendering performance has increased and environment detail has gone up, it’s even more important to spend time making sure that the characters look great while remaining visible to players. These are just some of the techniques that can be used to separate a character from the environment.

Leaning in Games With Your Webcam

2020-07-11T00:17:13+00:00

I’ve always found using Q and E to lean in FPS games awkward. I just don’t have the coordination to maintain strafing and normal movement with WASD while also using those keys. I wanted to see how difficult it would be to use my webcam to detect head tilt and translate that into in-game actions.

I’ve hastily slapped together a proof of concept, imaginatively called FaceLean, and it works surprisingly well. It’s definitely a bit of fun in single player games. See the video below.

Latency isn’t perfect and it’s only sampling at 30hz, depending on your webcam. I’ve added some hard coded delays to stop it constantly firing events when you are hovering on the threshold angle. I haven’t been brave enough to test it online, as I don’t want to upset anti-cheat software and risk my steam account or a hardware ban.

It’s definitely only POC quality. It’s slapped together using a portion of this example from TowardsDataScience to learn how to use the dlib facial recognition model. With this it’s easy to extract the 2 eye locations. With those locations I can calculate a vector between the eyes then calculate a dot product with a horizontal vector.

Future work / Known Issues

There are 3 main issues at the moment.

Undesired input when not in a game

Ideally, this would detect when a game is loaded and only send input events then. This would stop stray characters being put into text documents and chats. This could also reduce background CPU usage.

Stuck leaning

In my experience with ARMA 3 if you start or stop leaning when performing an action in game that prevents you leaning, then the game gets out of sync. Leaning again resolves this. I’m not sure how to cleanly solve this apart from repeatedly firing events. This isn’t a great solution as I fear it may contribute to triggering anti-cheat software. Ideally integration with games would be able to trigger events directly, rather than through simulated key presses.

Latency

Without dedicated hardware designed for high-fps and low-latency video streaming I doubt there is much that can be done about this. It’ll never be at the same level as a purpose built low-latency solution.

Where can I get it

DANGER DO NOT USE THIS IN ONLINE/MULTIPLAYER GAMES! I have no idea if simulated key presses will trigger anti-cheat detection, best to avoid it altogether and only use this in single player games.

Repo

https://github.com/boristsr/FaceLean

Main tech used

Requirements & installation

Ensure you have the following software installed

CMake (latest)
Visual Studio Community or Professional (I used 2019)
Python 3.7 or greater

Download or clone the repo.

Run the following command to install other requirements

pip install -r requirements.txt

Download and extract shape_predictor_5_face_landmarks.dat from https://github.com/davisking/dlib-models/blob/master/shape_predictor_5_face_landmarks.dat.bz2 into the project directory

Then run main.py

python main.py