Vector Databases - pgvector and langchain

This post will dive into the concept of a vector database, and will look at how langchain can take a piece of text, chunk it into vectors, and store those vectors in an underlying database.

We will use pgvector to store our embeddings, and will see how to query these embeddings and find the closest existing vectors to a given point.

The associated video for this post can be found below:


Objectives

In this post, we will:

  • Set up PostgreSQL with the pgvector extension in a Docker container, and create database
  • Use langchain to add embeddings to database, created with OpenAI's  text-embedding-ada-002 embedding model
  • Query the database from langchain to find the most similar embeddings to a given query
  • Query the database with SQL and explore pgvector features
  • Explore the concept of a vector database, and why it may be helpful in developing applications using LLMs

Vectorizing Text Chunks with Langchain

Let's start by taking a piece of text, and splitting it into chunks with Langchain, and then embedding the chunks with OpenAI's embedding models.

To get started, get an API Key from OpenAI, and store in a .env file with the following format.

OPENAI_API_KEY=....

We will then install three libraries that we'll work with in this video with the following command:

pip install langchain openai python-dotenv

Note: if using Jupyter Notebooks, you can add an exclamation mark to the above, and run in a cell.

!pip install langchain openai python-dotenv

We'll use langchain to work with text, embeddings and the integration with pgvector. The openai library is used to call the OpenAI API, in our case to get the embeddings from chunks of text. And finally, we can use python-dotenv to read the API Key from the .env file.

Let's start writing some code. We'll bring some imports in, and will calling the load_dotenv() function from python-dotenv. We'll explain these imports soon.

from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from dotenv import load_dotenv

load_dotenv()

Let's now get some text to use in this tutorial. We'll use the State of the Union text that's referenced on Langchain's documentation. You can find it here.

Add this text file to your local directory, where the code is located.

We can then read this in using Langchain's  TextLoader, as below:

loader = TextLoader('state_of_the_union.txt', encoding='utf-8')
documents = loader.load()

print(documents)  # prints the document objects
print(len(documents))  # 1 - we've only read one file/document into the loader

Once the document is loaded, we are going to use the langchain  RecursiveCharacterTextSplitter object to split this text into chunks.

Rather than embedding the entire document as a single vector, we split it into chunks that have more specificity than the entire document taken as a whole, and embed each chunk individually.

Let's write code to chunk the text:

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
texts = text_splitter.split_documents(documents)

print(texts)
print(len(texts))

This outputs the original document as a set of texts, after splitting into 1000-character chunks, as per the parameters to the RecursiveTextSplitter.

There's also a small overlap between the chunks, to allow a small amount of context to be shared from one chunk to the next - you might want to increase this overlap!

You can look at the content of the first chunk with the following code:

print(texts[0])

Now, let's convert our chunks to embeddings (vectors). We can use the OpenAI integrations in Langchain to do this.

Langchain comes with an  OpenAIEmbeddings object that is used to retrieve embeddings from OpenAI for pieces of text. This object will call the embedding API endpoint with the provided text, which will return the vector embedding.

The following code demonstrates this:

embeddings = OpenAIEmbeddings()

vector = embeddings.embed_query('Testing the embedding model')

print(len(vector))  # 1536 dimensions

The OpenAIEmbeddings object has an embed_query method, which we use to pass in a text query.

The query is embedded to a 1536-dimensional vector, with each dimension in the vector encoding a specific "concept" about the passed-in chunk of text.

These vectors can be compared to other vectors using distance metrics, which we'll see later is added as a feature in vector databases such as pgvector and Chroma.

Let’s now create vectors for the first 5 chunks in the state of the union text.

Note: When we split the text with the RecursiveTextSplitter, we get Langchain Document objects - these have a page_content property that stores the actual text for the chunk.

We’ll reference that property in the following list comprehension, passed as an argument to the OpenAIEmbeddings object's embed_documents() function:

doc_vectors = embeddings.embed_documents([t.page_content for t in texts[:5]])

print(len(doc_vectors))  # 5 vectors in the output
print(doc_vectors[0])    # this will output the first chunk's 1539-dimensional vector

So here, we have the vectors for the first 5 chunks.

Next, we want to get vectors for all chunks, and store them in the pgvector database.

pgvector for Storing Embeddings

Firstly, we need to install PostgreSQL, and enable the pgvector extension. To do this, we'll use Docker, and will pull this image.

You can pull the image with the command: docker pull ankane/pgvector

Once pulled, you can start the container with the following command:

docker run --name pgvector-demo -e POSTGRES_PASSWORD=mysecretpassword -p 5432:5432 -d ankane/pgvector

In this command, we use the ankane/pgvector image we just pulled to run a container, and we give the container a name, set the POSTGRES_PASSWORD environment variable, and map port 5432 between the container and host.

Verify that this is running with: docker ps

You can now install a GUI tool such as pgAdmin to inspect the database that is running in the container, or else use psql on the command-line. When connecting, you can specify the host as localhost, and the password as whatever you used in the above command - mysecretpassword, in our case.

We will now create a database, and then add the pgvector extension to that database, with the following SQL commands:

CREATE DATABASE vector_db;
CREATE EXTENSION pgvector;

The pgvector extension we're adding is already installed in this container, since we pulled from the pgvector Docker image. If you're not using this image, you will need to install pgvector separately - see the instructions on the Github repository here.

Now, let’s make a connection to PostgreSQL in our Jupyter Notebook.

To do so, we need a few libraries:

!pip install psycopg2-binary pgvector

Once these are installed, we can take our document chunks from the State of the Union text, and embed these and store in the database.

To do so, we can import the PGVector object from langchain.vectorstores, and use its from_documents() function:

from langchain.vectorstores.pgvector import PGVector

CONNECTION_STRING = "postgresql+psycopg2://postgres:mysecretpassword@localhost:5432/vector_db"
COLLECTION_NAME = 'state_of_union_vectors'

db = PGVector.from_documents(
    embedding=embeddings,
    documents=texts,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,
)

This code sets the connection string, and the name of the collection, and passes these to the from_documents() function.

We also pass, as a first argument, the embedding object that will be responsible for generating vectors from the texts, and as a second argument, the chunked texts themselves.

Once we execute this code, the embeddings will be stored in the database, using pgvector to do so.

If you inspect the database with pgAdmin or another GUI, you should see something similar to the below:

Note that the table has an embedding column, of type vector - this is a new data type added by the pgvector extension. This stores the vectors that have been produced by OpenAI's embedding model.

So now we have all the chunks embedded, and stored in the database. This is beneficial as we now no longer have to call the embedding API to get embeddings for our chunks - we can call the API once, and store the results in the database, which will cut down on costs.

We can also now easily find the "most similar" chunks to a given query, using a distance metric. The pgvector extension supports the cosine distance, L2 distance and inner-product metrics for finding "similar" chunks.

So we can do this similarity check in the vector space, and we can use the following Langchain code to do so:

query = "What did the president say about Russia"
similar = db.similarity_search_with_score(query, k=2)

for doc in similar:
    print(doc, end="\n\n")

The db object has a similarity_search_with_score() function that takes a query, and optionally the k closest embeddings we want to return from that query.

So this function will return the chunks that are "most similar" to the query we pass in. Implicitly, it'll embed the query with OpenAI's embedding model, and will then query the database to find the closest vectors in that 1536-dimensional vector-space.

pgvector adds the operators to perform these distance calculations in the vector-space. Let's now see how to do it in an SQL query!

Firstly, let's get the 1536-dimensional embedding for the above query, with this code:

vector = embeddings.embed_query(query)
print(vector)

Take the output of this query, and plug it into the following SQL statement (replacing the hard-coded vector [-0.020195066928863525, ..., -0.019898081198334694] below).

SELECT document, (embedding <=> '[-0.020195066928863525, ..., -0.019898081198334694]') as cosine_distance
FROM langchain_pg_embedding 
ORDER BY cosine_distance
LIMIT 2;

Here, we are selecting the document column, as well as the cosine distance between the value in the embedding column and the hard-coded vector provided on the right-hand side of the <=> operator. Note: this operator is added by the pgvector extension, and represents the cosine distance.

We are aliasing the cosine distances as cosine_distance, and then ordering by these values in the ORDER BY statement, and limiting only to the first 2 with the lowest values.

This gives us the most similar vectors to the one we pass in, because a smaller distance represents a more "similar" vector, from a geometric viewpoint.

Let's finish with a quick query that will aggregate all the vectors in the table, and calculate the "average vector" or the "centroid" from the data.

We can use the SQL AVG function for this:

SELECT AVG(embedding) FROM langchain_pg_embedding;

This will return the average vector by aggregating across each dimension, and from an NLP-perspective, this can be thought of as the "average chunk" from the data chunks that we've inserted into the table.

Benefits with LLMs

So we have a data-store containing all our vectors. This can be used when we have a lot of context, and want to only select the most relevant or similar chunks as context when querying and prompting language-models.

Language models tend to have a context length or context window, that limits the number of tokens they can consider at a time. 

If we upload a repository of 1000 documents, we cannot realistically pass all that context in a call to the language-model. The vector database allows us to select chunks that closely match what we've provided in the prompt, and only pass certain segments of the text.

This is powerful, as it allows us to take advantage of greater amounts of text content. We can embed a large corpus of documents into chunks, and can select the relevant segments in our code.

Summary

In this post, we've learned how to take a piece of text, split it into chunks, embed the chunks using OpenAI's embedding API, and then store the resulting embeddings in PostgreSQL by leveraging the pgvector extension.

We saw how to setup PostgreSQL with the pgvector extension in a Docker container, and how to enable the extension on a database.

And finally, we have also shown how to store the embeddings using utilities from Langchain, and how to manually query the embedding table to find similar vectors.

If you enjoyed this post, please subscribe to our YouTube channel and follow us on Twitter to keep up with our new content!

Please also consider buying us a coffee, to encourage us to create more posts and videos!

;