Link & Recommendation, Rice Datathon 2022

By Tianjian Sun, Yuhan Yang, Haijiao Lu and Yun Sun.

Project Description

This repository is for the project of Rice Datathon 2022, and this track is offered by Bill.com. These days, graph is widely used to represent data with much inter-relation. But most of time it's impossible to get a graph showing all the edges which exist, thus it's necessary to build a model learn from a part of the graph while most edges are missing, and let the model predicts existence of potential edges. This technique is a nature choice in making recommendation systems.

What it does

Given part of a undirected graph with node & edge information, on one hand the goal of our project is to make a prediction on whether an edge (link) exists between two nodes, as the edge information is not in our training set. On the other hand, the project can find k node most likely to have a connection with the given node, which is a good simulation of a rel-world a recommendation system.

How we built it

As node features are various number of different "words" (index), to make it a structured input for our model, first we vectorize the "words". To make it clear, we map each the nodes into an 200-dimension vector. This is accomplished by a Doc2Vec model, which puts two nodes closer if their features are more similar (or related).
We use the negative sampling method to define weights for no-exist edges. Besides, we split our train data into three set: train set, validation set and test set, to keep track of how well our model performs.
Normal cnn/rnn works well if we just train on node features, but it fails to make use of edges information. For this reason we use graph neural network in (py)torch-geometry. Our neural network consists two parts: one "encoder" to embed nodes, and one "decoder" to calculate "scores" between each pair of nodes. In encoder, we use two different GNN networks: one graph attention network and one graph convolution network, connected by a tanh activation layer.
After reading testing node pairs, our project searches the corresponding scores from score matrix, and let positive score to be "likely to have a edge" while negative score means the opposite.
To make it easier to use, we also build a GUI with edge prediction and node recommendation (print k nodes with highest scores) based on user's input node ids and an integer k. Some necessary data is stored locally upon the first used.
We also try the HTML web site to make the interface with another style which we can make improvement in the future.

Challenges we met

It's hard for us to build a gnn, which all of us have nearly no experience and knowledge about it.
Running time is too long since the model and score matrix is so large, making it slower to debug. Even with smaller dataset will it takes quite a long time for pytorch to set up in pycharm and vscode.
After we choose to store some variable locally, the file size is also quite large and it's difficult to share with teammate as it exceeds github file size limitation.
It's hard to find a way to add regularization part to avoid overfitting.
Not enough time to read essays and compared with different gnn models in pytorch

Accomplishments that we're proud of

Our project can get a high accuracy in validation and test(split from train) data, and we successfully finish the functionality finding k nodes have highest score with user input node number.
We build a nice gui for our project.
After the first search in gui, following search takes much less time (in 10s).

What's next for L&R

Try to deploy the project on cloud workspace, which will save the time setting up gui and load local variables
Make GUI more user-friendly
Train multiply model, use mean of all models prediction results to be our output rather than single prediction.

Built With

numpy & matplotlib: based vector calculation and plot function
Qtdesigner: build GUI
Prettier : build HTML
gensim : build doc2vec model to vectorize nodes
pandas : load csv to our program
pytorch-geometry : gnn related functions
other tools: jypyter, pycharm, VS Code

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
.vscode		.vscode
HTML		HTML
imgs		imgs
.DS_Store		.DS_Store
.gitignore		.gitignore
Frame.py		Frame.py
Main.py		Main.py
README.md		README.md
Search.py		Search.py
datapre.py		datapre.py
gcn.py		gcn.py
interact.py		interact.py
interact_sun.py		interact_sun.py
logo_rc.py		logo_rc.py
realback_rc.py		realback_rc.py
sign_rc.py		sign_rc.py
test_doc2vec.model		test_doc2vec.model
testrecord.npy		testrecord.npy
train_GCV_prediction_model.ipynb		train_GCV_prediction_model.ipynb
trainrecord.npy		trainrecord.npy
validrecord.npy		validrecord.npy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Link & Recommendation, Rice Datathon 2022

Project Description

What it does

How we built it

Challenges we met

Accomplishments that we're proud of

What's next for L&R

Built With

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Link & Recommendation, Rice Datathon 2022

Project Description

What it does

How we built it

Challenges we met

Accomplishments that we're proud of

What's next for L&R

Built With

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages