Data We are given a graph. Each node represents a page from a social media website. Two pages are connected if there is a link between them. Each page has a type (categorical number between 1 and 4) and a description (list of integers representing encrypted words)

Therefore we have many types of data. Relationship data, categorical data, and sparse, potentially noisy sentence data. There’s a lot of preprocessing techniques possible here to extract features and relevant information from this data

Our job is, given two nodes, to predict whether there is a link between them. There are edges removed from the ground truth graph and we are given the training graph. Our job is to learn from the training graph and predict on the ground truth test graph

How do we train and test There’s a lot we can do to extract information. However, if we only have one graph, how do we train and test on the graph? One thing we did is that we artificially removed edges from the training graph, then we trained on that reduced graph, and observed if we can predict whether or not a test edge should exist. For this it's important to have both positive and negative samples, so our classification task isn’t biased towards any type of prediction. We removed 10% of the training graph's edges and labeled them as positive samples. We constructed an equal sized set of negative samples (edges that don’t exist).

Intuition At first we wanted to see what would work. We said “a page is likely to be connected to another page if they have many overlapping connections”. One way to express this is with jaccard similarity (intersection of neighbors over union). So we tried a naive model that, given 2 nodes, computes the jaccard similarity between them, and predicts an edge if the similarity is greater than a certain amount. This yields 58% accuracy. Not great, but also not bad for a first step

Embedding-Based Approach Then we asked about embedding based approaches. What if we could create a vector space from the nodes such that if two embeddings are close together, they have a high likelihood of having an edge between them? In this case we could possibly learn a function that does this, that takes in a vector, its neighbors, some features about it, and learn this mapping from nodes to vectors?

Our first approach was to randomize an embedding space for each of the nodes. Then loop through all the edges in the graph. Then for each edge we move the two nodes connected by it closer to each other by some parameter. The hope is that this converges to a good embedding space. And if not, maybe we can put some negative edges in there and move the nodes further apart if they are not connected.

Next, we looked to other libraries that did this for us. DeepWalk and node2vec seemed to be decent examples. These techniques used random walks, Node2vec at least also can learn not only homophily but structural equivalence (which seemed promising)

So we tried node2vec to get embeddings for nodes, then say if two nodes are within a certain distance in the embedding space, we predict an edge. This gave 61% accuracy with hyperparameter tuning.

Link Prediction using Graph Neural Networks For this approach we use a feature vector for each node composed of the concatenation of the one-hot encoded vector and the classification of the node itself, we decompressed the one-hot encoded given file into a sparse matrix.

We used Deep Graph Library to represent the graph. So we built our model consisting of two GraphSAGE layers neural network, dgl provides library functions that assists with creating the layers.

For our learning rule we used the dot product function to create new edge features ( used in the learning process ) based on the existing node and edge features.

For our loss function we used a binary cross entropy loss, For the evaluation matric we used the Area under the curve metric ( AUC ).

Road blocks We struggled a lot to find a good model that we understand and can implement and reason about, moreover after finding the Graph Neural Network model we struggled for hours to make it work as we had errors with the data types not being compatible with tensorflow.

TODO: Compute accuracy from positive score and negative score instead of just the AUC metric.

Built With

Share this project:

Updates