COVID-19 is a global crisis that has affected us all. A massive research effort is underway to gain knowledge on every facet of the virus, including symptoms, treatments, and risk factors. To aid in the relief effort, AWS has created the public COVID-19 data lake, which contains various datasets you can use to help in the fight against the pandemic. For more information, see A public data lake for analysis of COVID-19 data and Exploring the public AWS COVID-19 data lake.
A large amount of data on coronavirus exists in research publications. One of the datasets in the data lake is a massive corpus of these publications, which the Allen Institute for AI aggregates and updates. The problem lies in how to find and extract the information you need. This post walks you through solving this problem using knowledge graphs.
Amazon Neptune is a fast, reliable, fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets. The core of Neptune is a purpose-built, high-performance graph database engine optimized for storing billions of relationships and querying the graph with millisecond latency. Neptune supports popular graph models Property Graph and W3C’s RDF, and their respective query languages Apache TinkerPop Gremlin and SPARQL, which allows you to easily build queries that efficiently navigate highly connected datasets. In this walkthrough, you build a Property Graph and use Apache TinkerPop Gremlin to query the data.
To appreciate a highly connected network, you often need to see it. An excellent counterpart to Neptune is Tom Sawyer Software. The Tom Sawyer Graph Database Browser allows you to easily view and interact with data stored in a graph database. It’s an end-to-end visualization application that can import data directly from Amazon Simple Storage Service (Amazon S3) into Neptune, which removes the need for command line tools. You can connect to your database and immediately begin exploring the data, inspecting the properties of graph elements, and modifying the appearance of nodes and edges to fit your needs.
Graph databases excel at connecting pieces of data together. It isn’t simply for storing research publications, it’s a place for linking pieces of semantically important data so you can create queries that reveal interesting and relevant information. This post uses the COVID-19 Open Research Dataset, which contains tens of thousands of papers and associated metadata, such as the authors, publication date, journal, and Digital Object Identifier. This metadata is a good fit for the graph because you can link papers based on common authors and citations.
However, for this use case, you want to link papers based on their content. Semantically linking papers is tricky, but luckily there is a tool that provides precisely what you need. Amazon Comprehend Medical is a natural language processing service that makes it easy to use machine learning to extract relevant medical information from unstructured text. You can use Amazon Comprehend Medical to enrich the graph data with concepts extracted from every paper by running the text of each paper through the Amazon Comprehend Medical analysis endpoint. The data is prepared and ready for you to use after you set up your graph database.
The following table summarizes the nodes and descriptions for the graph schema.
|A paper from the dataset.|
|Authors of the paper.|
|Institutions that authors are affiliated with.|
|Entities returned from Amazon Comprehend Medical.|
|Classifications returned from running the papers through a multilabel classifier. Topics are assigned from a pre-defined ontology of 10 possible topics.|
The following table summarizes the edge information for the graph schema.
|Edge Name||Source Node||Destination Node||Edge Weight|
|Paper||Concept||Amazon Comprehend Medical confidence scores|
|Paper||Topic||Amazon Comprehend Medical confidence scores|
The solution contains three main components: the Neptune database, the data processing and ingestion, and the two methods of interfacing with the knowledge graph.
The following diagram illustrates the solution architecture.
The data processing and ingestion follows these sequential steps:
- Data is added to an S3 bucket.
- This event triggers an AWS Lambda function that loads the data and begins processing.
- The data is deduplicated, and the relevant metadata is extracted for the
Institutionnodes that are added to the graph (for example,
- The full text for every paper goes to Amazon Comprehend Medical, which returns a set of entities (called concepts in the graph), which represent semantic pieces of information present in that paper.
- The data is also sent through a multi-label classifier, which returns topics, a list of semantic information for every paper (not shown in the architecture). Topics are similar to concepts.
- The nodes and edges are stored as CSVs following the format Neptune uses. For more information, see Gremlin Load Data Format.
- The Lambda function places the data back in Amazon S3, ready for ingestion.
This data processing is quite expensive, and the processed data is already available for you in the data late. You ingest it into the graph using an HTTPS request against one of the Neptune API endpoints.
The Neptune database is a cluster that consists of a writer and a set of read replicas. This use case creates a small cluster with a single read instance, but you can expand upon the database’s capabilities. For more information, see Adding Neptune Replicas to a DB Cluster.
Neptune is always launched inside a VPC, and you can further enhance its security by requiring AWS Identity and Access Management (IAM) authentication for access.
The two methods of interfacing with this Neptune graph are Tom Sawyer Software’s graph visualization tool and Jupyter notebooks hosted on Amazon SageMaker. This post reviews both interfaces.
Deploying the Neptune graph using AWS CloudFormation
You first create the database (and its associated network infrastructure) that you need to create the knowledge graph. You also simultaneously create the Amazon SageMaker notebook instance that you use for querying later. You accomplish this with an AWS CloudFormation template, which this post provides: Launch Template
At the time of this writing, you can only launch the stack in
The CloudFormation stack performs the following actions:
- Creates a Neptune DB cluster.
- Sets up a VPC with private and public subnets, which gives Neptune access to the internet and protects it from unauthorized access.
- Creates an Amazon SageMaker notebook instance and gives it permission to access the Neptune cluster within the VPC. You can see the libraries in the Lifecycle configurations section on the Amazon SageMaker console when the stack is complete. This allows you to easily interact with the graph.
The stack can take up to 10 minutes to deploy. When it’s complete, on the AWS CloudFormation console, the Outputs tab, choose the base stack. You need some of the information on this tab later in the walkthrough.
You’re now ready to ingest the graph data.
Ingesting data into the graph using Python and Amazon SageMaker notebooks
You can use these same instructions to ingest future datasets. Complete the following steps:
- On the Amazon SageMaker console, choose Notebook instances.
- Choose the notebook instance that contains Neptune in the name.
- Choose Open Jupyter.
After it loads, you see a few notebooks in the menu bar.
- Choose the notebook Ingesting Data.
In the first few cells, you can see how to create a connection to the graph using WebSocket. The next cell executes a function called bulkLoad(). This function tells the Neptune graph to pull data from a location in Amazon S3, and is the fastest method of loading data into the graph. All the data for this graph is stored in the public COVID-19 data lake.
When you run the bulk load cell, your notebook sends an HTTPS request to the Neptune cluster to reach out to the Amazon S3 location that you specified and pull any well-formatted data that exists there.
When the request is complete, scroll to the bottom of the notebook and run the two cells that define and run the
graph_status function. Observe the number and types of every node and edge in the graph.
Keep these notebooks open in a separate tab to use later in the walkthrough. You’re now ready to create your visualization.
Visualizing with Tom Sawyer Software
The Tom Sawyer Graph Database Browser begins with a 5-day free trial, after which you incur charges. For more information, see Tom Sawyer Graph Database Browser for Amazon Neptune, Neo4j, & TinkerPop. You can unsubscribe at any time. After you unsubscribe, your visualization instance no longer works, but you can still access the data through your notebook.
Setting up the Tom Sawyer Graph Database Browser
To set up the Tom Sawyer Graph Database Browser, complete the following steps.
- Navigate to Tom Sawyer Graph Database Browser for Amazon Neptune, Neo4j, & TinkerPop on AWS Marketplace.
- In the Additional Resources pane, choose Graph Database Browser AIM Deployment Guide.
- Follow the instructions in the Quick Start Instructions
- For Instance type, choose a size appropriate for your graph and within your desired cost window.
- This post uses m5.4xlarge, but a t3.* also allows you to run queries.
- When launching your instance, refer to the Outputs tab of your CloudFormation stack for the VPC and subnet that you should launch your instance in.
- To access the application for the first time and sign in, follow the instructions in the Using the Application
Connecting the data
Before you can visualize the data, you must specify the Neptune endpoint connection details. Complete the following steps:
- From the Graph Database Browser Databases page, choose Add Database.
- For Vendor, choose Amazon.
Neptune automatically populates in the Database field.
- Enter a name for the database (for example,
- Choose Save.
- On the Databases page, select the
- From the Actions drop-down menu, choose Connections.
- Choose Add connection.
- On the New connection page, enter the following connection details for your database.
Query Language Gremlin Protocol Web Socket Cluster Endpoint Choose your base stack on the AWS CloudFormation console, and choose the Outputs tab. Copy the value of
DBClusterEndpointand enter it here.
Port Number 8182 IAM DB Authentication For this post, leave this deselected. If you use IAM authentication, you can restrict access to a specific AWS access key ID, AWS secret access key, and service Region. SSL Select this check-box.
- Choose Save.
- To connect to the database and start browsing the data, choose Actions and Connect.
The following screenshot shows Tom Sawyer Graph Database Browser after executing a gremlin query
You can now use the Query view to enter a query and run it to visualize the data.
Using the Gremlin traversal language
Gremlin is a query language for graph databases. It allows you to create queries that traverse through the graph, and filters and sorts edges and nodes as according to your needs. For more information about building queries, see PRACTICAL GREMLIN: An Apache TinkerPop Tutorial. To start using Gremlin, complete the following steps:
- In the Settings tree view, choose Result Limit Per Loading.
- Enter 500.
This step makes sure that your query returns enough results.
- Enter the following code:
This query looks for all nodes labeled as
Topic, finds all the connected
Papernodes, and looks for edges attached to said
- Choose Load Data.
The following screenshot shows the output as a visualization. Because graph data continuously updates, this screenshot may be different than your actual results.
Tips for using Tom Sawyer Software
The following tips may be useful when using Tom Sawyer Software:
- Before each new query, clear all elements in the graph view from the previous query by choosing Clear All.
- Queries work best when they end with an edge component such as
- The query editor provides syntax highlighting to make it easier to read and write queries. For Gremlin queries, either use Ctrl+Spaceor the period key to access an auto-complete list of expressions. Hover over each expression to view a tooltip about each one. To select an expression, choose it and enter
()at the end of the populated string.
- To change the default node limit, under General, choose the current default number after clearing the graph and change it. When starting out, it’s recommended to change the limit to 500.
- A default template sets the appearance of graph elements, but you can change settings like text color, font, type (the node’s shape), and color by choosing a node (right-click) and choosing Edit Appearance Rule. The preceding graph visualization has modified node types.
- Use one of the zoom options to get a closer look at the Author To see which institutions are connected to a particular author, choose the author (right-click) and choose Load Neighbors. Try loading the neighbors of other nodes to incrementally expand the graph and explore the data.
To search for all
Paper nodes with associated concepts drawn from a pre-defined set, enter the following code:
To find all
Institution nodes and the
Author nodes affiliated with them, enter the following code:
To find all
Concept nodes with fewer than 20
Paper nodes and display all connected papers, enter the following code:
Gremlin queries from notebooks for data exploration
Using Gremlin from a notebook is slightly different than using it in the Tom Sawyer Graph Database Browser. Instead of ending with
inE, all queries must end with a terminal step. The list of terminal steps includes
.toSet(). For a complete list, see PRACTICAL GREMLIN: An Apache TinkerPop Tutorial.
To use Gremlin queries from notebooks, open the notebook from the menu Gremlin Queries. You can experiment with the queries in the notebook to get a better idea of what you can do with the graph.
Gremlin also offers a method of gauging query performance through the profiling tool. To use the profiling tool, invoke the
%%gremlin profile magic before you launch a query. See the following code example:
The preceding code is a complex query that finds papers related to COVID-19 and risk factors, and orders the papers based on how prolific their authors are. Using the profiling tool, you can see where most of the time is spent when Neptune processes the query. You can choose Gremlin Profile to experiment with another example and see how query optimization can result in significant time savings.
Analyzing with Amazon Comprehend Medical
Papers can be related to each other within the knowledge graph by using the same concepts as an intermediate connector. Papers that share many of the same concepts are likely to be related to each other. The following diagram illustrates the connections between two papers.
The papers form a network of similarity relationships, which you can use to garner paper recommendations with properly formulated queries. You can experiment with Gremlin to link papers as shown in the preceding diagram (stay tuned for later posts on this topic).
When you’re finished, you should remove the resources you created to avoid incurring additional charges.
- On the AWS CloudFormation console, delete the base stack (which automatically deletes nested stacks and resources).
- To remove your Tom Sawyer Graph Database Browser instance, on the AWS Marketplace website, cancel your subscription to the Tom Sawyer Graph Database Browser.
- On the Amazon Elastic Compute Cloud (Amazon EC2) console, choose Actions.
- Choose the instance that the Tom Sawyer Graph Database Browser ran on.
- Choose Terminate.
This post walked you through using graph databases and highly connected data. You can launch the CloudFormation stack in this post and see what answers you get by formulating queries.
For more information about using Gremlin with Neptune, see Accessing the Neptune Graph with Gremlin. For a deeper dive on doing analysis with the Covid Knowledge Graph, see Building and querying the AWS COVID-19 knowledge graph. For more information about using Neptune with Amazon SageMaker notebooks, see Analyze Amazon Neptune Graphs using Amazon SageMaker Jupyter Notebooks.
If you have any questions or comments about this post, leave your thoughts in the comments.
About the Authors
George Price is a Deep Learning Architect at the Amazon Machine Learning Solutions Lab, where he helps build models and architectures for AWS customers. Previously, he was a software engineer working on Amazon Alexa.
Colby Wise is a Data Scientist and manager at the Amazon Machine Learning Solutions Lab, where he helps AWS customers across different industries accelerate their AI and cloud adoption.
Miguel Romero is a Data Scientist at the Amazon Machine Learning Lab where he helps AWS customers adress business problems with AI and cloud capabilities. Most recently, he has built CV and NLP solutions for sports and healthcare.
Ninad Kulkarni is a Data Scientist at the Amazon Machine Learning Solutions Lab. He helps customers adopt ML and AI solutions by building solutions to address their business problems. Most recently, he has built predictive models for sports customers for on-screen consumption to improve fan engagement.