This repo compares a Graph RAG with a SQL RAG on a dataset with Formula 1 results.
-
Clone the repository:
git clone <repo-url> cd graph_rag
-
Create and activate a virtual environment:
python -m venv .venv source .venv/bin/activate # On Windows use: .venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Download Formula 1 results Create a
resdirectory in this repo and download all files from here in this directory. -
Set up a neo4j graph database Follow these instructions to set up a neo4j graph database.
-
Define configuration Create a
config.pyfile in thesrcdirectory and set following constants to connect to the neo4j and the SQL database:# path to database DATABASE_PATH = "../res/f1-results.db" # Neo4j connection details NEO4J_URL = NEO4J_USERNAME = NEO4J_PASSWORD = DATABASE_NAME = 'f1-results'
-
Populate databases Run the notebooks
notebooks/create_sql_db.ipynbandnotebooks/build_graph_db.ipynbto write the data from the files to the SQL and graph databases. -
To use the OpenAI API for generating questions, set up the API key as follows:
-
Sign up or log in to an OpenAI account at OpenAI’s website.
-
Open the API Keys section under account settings.
-
Generate a new API key and copy it.
-
In your terminal, set the OPENAI_API_KEY environment variable to your key:
On Linux/macOS:
export OPENAI_API_KEY=your-api-keyOn Windows (Command Prompt):
set OPENAI_API_KEY=your-api-key
On Windows (PowerShell):
$env:OPENAI_API_KEY="your-api-key"
Keep the key secure and never share it publicly.
-
The file src/qa_chain contains the class GraphQAChain which transfroms a question to a SQL or cypher query, executes the query and transforms the result to an answer.
This class is initialized by providing a langchain_neo4j.Neo4jGraph objected connected to the graph database or langchain_community.utilities.SQLDatabase objected connected to the SQL database. The interface to the LLM model is provided via a langchain_openai.ChatOpenAI object.
import sys
sys.path.append('../src')
from langchain_community.utilities import SQLDatabase
from langchain_openai import ChatOpenAI
from qa_chain import GraphQAChain
from config import DATABASE_PATH
# connect to database
connection_string = f"sqlite:///{DATABASE_PATH}"
db = SQLDatabase.from_uri(connection_string)
# llm interface
llm = ChatOpenAI(temperature=0, model="gpt-5")
# initialize chain
chain = GraphQAChain(llm, db, db_type='SQL', verbose=True)
# invoke chain
chain.invoke("Give me the driver names and years when the driver on pole position won the Singapore GP.")prints:
{'write_query': {'query': "SELECT d.forename, d.surname, ra.year\nFROM races ra\nJOIN qualifying q ON q.raceId = ra.raceId AND q.position = 1\nJOIN results re ON re.raceId = ra.raceId AND re.positionOrder = 1 AND re.driverId = q.driverId\nJOIN drivers d ON d.driverId = re.driverId\nWHERE ra.name = 'Singapore Grand Prix'\nORDER BY ra.year\nLIMIT 10;"}}
{'execute_query': {'result': "[('Lewis', 'Hamilton', 2009), ('Fernando', 'Alonso', 2010), ('Sebastian', 'Vettel', 2011), ('Sebastian', 'Vettel', 2013), ('Lewis', 'Hamilton', 2014), ('Sebastian', 'Vettel', 2015), ('Nico', 'Rosberg', 2016), ('Lewis', 'Hamilton', 2018), ('Carlos', 'Sainz', 2023), ('Lando', 'Norris', 2024)]"}}
{'generate_answer': {'answer': '- 2009 — Lewis Hamilton\n- 2010 — Fernando Alonso\n- 2011 — Sebastian Vettel\n- 2013 — Sebastian Vettel\n- 2014 — Lewis Hamilton\n- 2015 — Sebastian Vettel\n- 2016 — Nico Rosberg\n- 2018 — Lewis Hamilton\n- 2023 — Carlos Sainz\n- 2024 — Lando Norris'}}
returns:
{'query': 'Give me the driver names and years when the driver on pole position won the Singapore GP.',
'result': '- 2009 — Lewis Hamilton\n- 2010 — Fernando Alonso\n- 2011 — Sebastian Vettel\n- 2013 — Sebastian Vettel\n- 2014 — Lewis Hamilton\n- 2015 — Sebastian Vettel\n- 2016 — Nico Rosberg\n- 2018 — Lewis Hamilton\n- 2023 — Carlos Sainz\n- 2024 — Lando Norris'}
import sys
sys.path.append('../src')
from langchain_neo4j import Neo4jGraph
from langchain_openai import ChatOpenAI
from qa_chain import GraphQAChain
from config import NEO4J_URL, NEO4J_USERNAME, NEO4J_PASSWORD, DATABASE_NAME
# connect to graph database
graph = Neo4jGraph(url=NEO4J_URL, username=NEO4J_USERNAME, password=NEO4J_PASSWORD, database=DATABASE_NAME,
enhanced_schema=True)
# llm interface
llm = ChatOpenAI(temperature=0, model="gpt-5")
# initialize qa chain
chain = GraphQAChain(llm, graph, db_type='Cypher', verbose=True)
# invoke chain
chain.invoke("Give me the driver names and years when the driver on pole position won the Singapore GP.")prints:
{'write_query': {'query': "MATCH (r:Race)-[:AT]->(ci:Circuit)\nWHERE r.name CONTAINS 'Singapore' OR ci.country = 'Singapore'\nMATCH (car:Car)-[q:QUALIFIED]->(r)\nWHERE q.position = 1\nMATCH (car)-[res:RACED]->(r)\nWHERE res.position = 1\nMATCH (d:Driver)-[:DROVE]->(car)\nRETURN DISTINCT d.forename + ' ' + d.surname AS driverName, r.year AS year\nORDER BY year DESC\nLIMIT 10"}}
{'execute_query': {'result': [{'driverName': 'Lando Norris', 'year': 2024}, {'driverName': 'Carlos Sainz', 'year': 2023}, {'driverName': 'Lewis Hamilton', 'year': 2018}, {'driverName': 'Nico Rosberg', 'year': 2016}, {'driverName': 'Sebastian Vettel', 'year': 2015}, {'driverName': 'Lewis Hamilton', 'year': 2014}, {'driverName': 'Sebastian Vettel', 'year': 2013}, {'driverName': 'Sebastian Vettel', 'year': 2011}, {'driverName': 'Fernando Alonso', 'year': 2010}, {'driverName': 'Lewis Hamilton', 'year': 2009}]}}
{'generate_answer': {'answer': '- 2024: Lando Norris\n- 2023: Carlos Sainz\n- 2018: Lewis Hamilton\n- 2016: Nico Rosberg\n- 2015: Sebastian Vettel\n- 2014: Lewis Hamilton\n- 2013: Sebastian Vettel\n- 2011: Sebastian Vettel\n- 2010: Fernando Alonso\n- 2009: Lewis Hamilton'}}
returns:
{'query': 'Give me the driver names and years when the driver on pole position won the Singapore GP.',
'result': '- 2024: Lando Norris\n- 2023: Carlos Sainz\n- 2018: Lewis Hamilton\n- 2016: Nico Rosberg\n- 2015: Sebastian Vettel\n- 2014: Lewis Hamilton\n- 2013: Sebastian Vettel\n- 2011: Sebastian Vettel\n- 2010: Fernando Alonso\n- 2009: Lewis Hamilton'}
Following LLM models were used to invoke the SQL and graph database: gpt-3.5-turbo, gpt-4, gpt-5. Each model-database combination answered following questions in notebooks/cypher_qa.ipynb and notebooks/sql_qa.ipynb:
- How many drivers are from Brazil?
- What's the total number of races?
- Which constructor competed in the most races?
- When was the first F1 season?
- Which circuit hosted the most F1 races?
- What driver won the 1992 Belgian Grand Prix?
- Which team had the fastest pit stop at the Italian Grand Prix 2022?
- How many drivers drove more than 100 races?
- Which driver had the most race retirements?
- Which constructor won the most driver world championships?
- What circuits were included in both: the first and the most recent season?
- How many races had the same driver who became first in qualifying and race?
- Which teams won more than 3 constructor world championships?
- What are the top 5 driver nationalities who competed in the most races?
- What's the driver with the highest number of wins at the same circuit?
- Which driver won the most world championships?
- Who won the Monaco Grand Prix in 2020?
- Who won the Formula 2 championship in 2017?
- Which seasons had more than 25 races?
- In which lap did Michael Schumacher overtake Ayrton Senna at the Brazilian Grand Prix 1995?
The results saved in qa_results.csv.