This repository demonstrates graph virtualization, specifically the ability to run a query against an Amazon Neptune graph database that brings in additional data residing outside the Neptune database in a data lake. In our example, the Neptune database contains a knowledge graph of weather stations, represented in Resource Description Framework (RDF) form. The data lake contains temperature readings from these stations. The readings are not kept in the Neptune database, but we wish to run queries from Neptune that bring in the readings as if they did reside in Neptune. In effect, we want the readings to virtually reside in Neptune.
The next figure shows our design.
First, we decide to split the data as follows. Weather station data goes into a Neptune database. Having it there allows us to maintain relationships between stations and their locations. In general we want to be able to navigate those relationships.
Neptune supports two well known graph representations: labelled property graph and Resource Description Framework (RDF). In this example, we use RDF. The query language for RDF is called SPARQL.
The readings themselves are collected into an Amazon Simple Storage Service (S3) bucket. This data is huge and grows larger by the minute. The data is in Apache Parquet format. We define a table structure for this data using the AWS Glue Data Catalog. The Amazon Athena service allows us to query the data using SQL. The combination of S3, Glue Data Catalog, and Athena forms a data lake.
Here is the flow:
-
A data SME uses a Neptune Jupyter notebook running on Amazon Sagemaker to explore the Neptune database. In this demo, the SME runs a SPARQL query that combines station and readings data. This is a federated query because it hits not only Neptune but also the lake. Significantly, SPARQL supports federated query in a standard way. We leverage that capability here. The SME runs the query on Neptune. Neptune finds the station data and then implicitly queries the lake to get the readings data.
-
The data lake can be queried using SQL. It cannot be directly queried using SPARQL. SPARQL federation federates SPARQL endpoints only. But we leverage a third-party tool Ontop Virtual Knowledge Graph (VKG), which maps SPARQL to SQL. It does this by offering a SPARQL endpoint, enabling Neptune to send SPARQL queries to it. We configure Ontop with a mapping, instructing it how to map the incoming SPARQL query to a SQL query.
-
Ontop in turn sends the SQL query to a target database, in this case the data lake via Athena. Ontop uses JDBC, enabling it to connect to a variety of relational databases. We deploy Ontop as a Docker container on Amazon Elastic Container Service (ECS).
-
Athena gets the readings table structure from the Glue data catalog.
-
Athena gets the data from the S3 bucket.
We are presenting this demo in several public presentations and are writing a post for the AWS Database blog on this topic. We recommend these to better understand the motivation for virtualization and our design.
- Download cfn/ontop_main.yaml.
- Go to the CloudFormation console in your AWS account. Create a CloudFormation stack from your local copy of ontop_main.yaml. The stack creates a Neptune cluster, a SageMaker notebook instance, an ECS Cluster, a Glue database and crawler, an S3 bucket, and some IAM roles. The notebook instance serves two purposes:
- It enables you to bulk-load data to Neptune and run SPARQL queries against Neptune and the ontop endpoint.
- It acts as a build machine to build the container. This is optional; use it if you do not wish to use your own build machine.
- (Optional) If using the SageMaker instance as a build machine, add permissions needed to build.
- In the SageMaker console, locate the notebook instance created from the list of notebooks. Select it to show details of the instance.

- In the SageMaker console, locate the IAM role for the instance under
Permissions and encryption. Select it to open it in the IAM console.
- In the IAM console, add permissions similar to the following:
AmazonECS_FullAccessmanaged policy or some subset of it to enable creation of ECS resources.AWSCloudFormationFullAccessmanaged policy or some subset of it to enable viewing your existing stack and creating a new stack.CloudWatchLogsFullAccessmanaged policy or some subset of it to enable ECS stack to setup log group.EC2InstanceProfileForImageBuilderECRContainerBuildsmanaged policy or some subset of it to enable creation of ECR image.IAMFullAccessmanaged policy or some subset of it to enable ECS stack to create IAM role.
- In the SageMaker console, locate the notebook instance created from the list of notebooks. Select it to show details of the instance.
- (Optional) If using notebook instance as a build machine, open Jupyter or Jupyter Labs in the notebook instance. Once in, open a terminal.

- From a terminal on the build machine, run the following:
# get the code
git clone https://github.com/aws-samples/amazon-neptune-graph-virtualization.git
# make scripts executable
cd amazon-neptune-graph-virtualization/scripts/
chmod +x *.sh
# Copy data files to your bucket. This might take several minutes.
./s3cp.sh <yourstackname>
Keep the terminal open. Will come back to this.
-
Next, in a separate tab in your browser, go to the Glue console. Under Crawlers, check you have a crawler called ClimateCrawler. Run the crawler. Wait for the crawler to complete.
-
Still in Glue console, check Tables under Data Catalog. There should be a table called climate. Confirm.
-
Go to the Athena console. Go to the Query Editor. Under Settings, click Manage and set the location for query results to the data bucket created by CFN.
-
Still in Athena console, go to the query editor, select the climate table, and run a preview query. Confirm that valid temperature readings comes back in the result set.
-
Return to the teriminal on the build machine. Run the following:
./build.sh <yourstackname>
./deploy.sh <yourstackname> ecs1
-
In a separate tab in your browser, go to the ECS console. Find the running task for the ontop-lake table. Get its private IP address and copy it.
-
If Jupyter is not already open, from the SageMaker console, open Jupyter or Jupyter Lab. In Jupyter, open climate-data-queryes.ipynb and replace any IP addresses referenced in the queries with the private IP address of the ECS task. Run the queries.
To avoid costs, delete the resources as follows:
- Remove the ECS service containing Ontop by deleting the CloudFormation stack that you created using the deploy.sh step during setup. That stack is based on template cfn/ecs-task.yaml.
- Remove the main CloudFormation stack based on template cfn/ontop_main.yaml. This removes the Neptune cluster and notebook instance, the Glue catalog database, as well the ECS cluster.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.
