Legislation Chatbot

Streamlit App
Vector database snapshot
Cosine similarity code

Note: the video is out of date- the project is live! Check it out here on Streamlit or HuggingFace!

Inspiration

Through his public policy advocacy and data research, our teammate, Nick, wanted to empower people with knowledge of the laws drafted by our government. This, compounded by the slow process of reading thousands of bills, each thousands of words long, led to the creation of this project. We at Legislation Chatbot intend to make the process of reading and understanding legislation easier for everyone.

What it is

An intelligent, ChatGPT-like chatbot for asking about US legislation, built on cutting-edge LLMs and a database of thousands of Congressional documents. It is also accessible as an API.

Future enhancements include:

Gather more legislation (hundreds of thousands of bills). We were limited by the Congress API limits.
Fetching new Congressional bills daily, adding them to the database. Users could then query for the latest proposed bills and receive notifications when new legislation is related to their interests.
Cluster similar bills and create a network diagram of connected bills.
Anomaly detection in two ways: within bills, to expose clauses that have been discreetly added into larger bills to pass without proper coverage, and contextually, to highlight particularly important or concerning bills.

We hope the project will engage people in the political process, especially those traditionally underrepresented, and improve government transparency. People must know what their elected representatives are proposing.

Why?

Federal legislation impacts the entire nation, but there are hundreds of thousands of documents to sift through. Unfortunately, that makes it possible to hide important changes from the people. Use it to learn more about issues you care about or even as a research assistant.

How we built it

Tech: Sentence-Transformers (HF), Google Cloud SQL (PostgreSQL pgvector), Google Cloud Functions, Anthropic's Claude-2 API

Using the all-MiniLM-L6-v2 sentence-transformer from HuggingFace, we partitioned and encoded all 750 downloaded federal legislation bills retrieved from api.congress.gov into language embeddings. The context for these embeddings was then recursively selected. Bill names, context, and embeddings were stored in a PostgreSQL database (vectorized with pgvector), hosted on Google Cloud SQL.

A front-end interface, built and hosted with Streamlit, allows the user to enter a query to be sent to our public API. The API is an HTTP Google Cloud Function, which embeds the query and uses cosine similarity to find context from the database. The API then generates an answer using Anthropic's Claude-2 API and returns both the generated answer, as well as the original context. Providing the full context enables researchers to understand how the answer was generated, and provides more resources for further investigation. The front-end nicely displays the API's return to the user.

Challenges we ran into

Data

As with most Machine Learning applications, getting and preprocessing the data took the most time. We are grateful for the transparent API provided by Congress to download bills, but the API service had a limit on requests, capping the raw text bills we have to 750. Further, due to the length of each bill, we had 20,000+ lines of SQL data.

Context Splitting & Word Embeddings Model

First, we needed to decide how to partition each of the linguistically complex bills for best embedding and cosine similarity performance. Capturing each block of context accurately is extremely important and warrants further optimization.

We also needed to find a sentence transformer model that balanced legal comprehension ability and word processing speed.

Google Cloud

While both of us have experience in AWS and Oracle Cloud respectively, translating that knowledge to Google Cloud in 24 hours was arduous. While we planned to create Lambda-like functions to retrieve and process the bills in the cloud, we were unable to do so due to time and compute constraints.

Therefore, we decided to do a bulk of processing locally, using our laptops for compute. Unfortunately, setting up the SQL database and connecting from our machines was also a challenge as a first-time endeavor. The same goes for creating the cosine similarity API function on the cloud.

Accomplishments that we're proud of

New ideas

Perhaps the highlight is one others can't see. We spent a good Friday night debating our ideas on what would be most beneficial to society. An NLP-powered file explorer, essay writing helper, city planning image generator, and so on. We hope to continue this discussion and create more utilitarian projects in the future.

Cloud architecture proposal

Using our cloud knowledge, we created a scalable cloud architecture, using Lambda-like functions for efficiency, queues for scalability under heavy load, and an optimized vector database for data storage.

New Google Cloud experience

We created a Postgres SQL database on Google Cloud and uploaded 750 bills in about 45 minutes. We also created an embedding and similarity search process on Google Cloud, providing essential context for the LLM.

Streamlit frontend

We created a front-end using Streamlit where the user can easily input queries to our API. The process was mind-blowingly simple- we love Streamlit!

What we learned

Data

Data preparation always takes more time than necessary. Data is inconsistent, messy, and difficult to organize. We learned that it is important to have a clear idea of what data we need first before we start collecting it.

Google Cloud

Cloud is hard. Even if you have prior experience in other cloud technologies, the knowledge does not transfer as well and new cloud technologies always come up.

Language

The study of NLP is a never-ending one. We have used word embeddings for the most basic purposes, but each time we go deeper, another unexpected quirk of language comes up, like how a bill is structured or knowing what the customer would want to know.

Built With

anthropic
claude-2
google-cloud
langchain
llm
natural-language-processing
postgresql
python
sentence-transformers
similarity-search

Submitted to

AI ATL (Atlanta)
- Winner HuggingFace Best Use of Open Source Models - Overall

Created by

I worked on all things LLM related: text chunking, context extraction, data processing, model selection, creating embeddings, similarity search, and prompt engineering! I also set up the API on a Google Cloud Functions, and created the Streamlit to give it a user-interface.

Nicholas Polimeni
I worked on the backend, particularly data collection, preprocessing, and uploading to Google Cloud's PostgreSQL database which I helped set up

Faris Durrani
Master's in CS @ GT
Justin Singh

Updates

Nicholas Polimeni started this project — Nov 18, 2023 08:32 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.