-
Notifications
You must be signed in to change notification settings - Fork 266
Improve Performance when Creating Large Graphs from Pandas DataFrames #419
Description
When creating large graphs from Pandas DataFrames, the helper method Graph.DataFrame becomes painfully slow. For my current project, I need to create a graph with 700k vertices having a lot of attributes and 2.4 million edges. This takes more than 3 hours with the current implementation of Graph.DataFrame.
I have run some experiments to analyze the runtime complexity of this method. Each experiment consist of the following steps:
- Create a ring graph of given size (equal number of vertices and edges)
- Add some dummy attributes to vertices and edges
- Get edge and vertex DataFrames of graph (map vertex IDs to names if
use_vids=False) - Reconstruct graph from DataFrames using
Graph.DataFrame
The experiment code is given below.
def run_experiment(n: int, use_vids: bool):
test_g = Graph.Ring(n=n, directed=True)
test_g.vs["name"] = [f"v{i}" for i in range(test_g.vcount())]
test_g.vs["x"] = [float(i) for i in range(test_g.vcount())]
test_g.es["w"] = [1.0] * test_g.ecount()
edges = test_g.get_edge_dataframe()
vertices = test_g.get_vertex_dataframe()
if not use_vids:
edges.iloc[:, 0] = edges.iloc[:, 0].map(vertices["name"])
edges.iloc[:, 1] = edges.iloc[:, 1].map(vertices["name"])
g = Graph.DataFrame(edges, vertices=vertices, use_vids=use_vids)
return gI tracked the runtime of Graph.DataFrame using line_profiler while changing the graph size n and DataFrame format use_vids. The observed runtimes are given in the table below.
| n=1,000 | n=10,000 | n=100,000 | n=1,000,000 | |
|---|---|---|---|---|
| use_vids=True | 0.4 s | 3.6 s | 35.9 s | 365.7 s |
| use_vids=False | 0.4 s | 4.7 s | 157.8 s | 12949.3 s |
If use_vids=True, the runtime grows linearly with the number of vertices, edges and their attributes. The critical parts of the code are the for-loops for transferring vertex and edge attributes from the DataFrames to the graph.
If use_vids=False, the runtime additionally grows quadratically with the number of vertices. This growth originates from a call of np.setdiff1d for validating the vertices DataFrame against the edges DataFrame. For larger graphs, this DataFrame validation easily dominates the runtime.
For details, see attached profiling report.
With PR #418 , I propose a reimplementation of Graph.DataFrame which is ~500 times faster than the current implementation and has expected (near) linear growth. The new runtimes of the same experiments are shown in the table below.
| n=1,000 | n=10,000 | n=100,000 | n=1,000,000 | |
|---|---|---|---|---|
| use_vids=True | 9.5 ms | 14.6 ms | 98.0 ms | 677.4 ms |
| use_vids=False | 10.4 ms | 18.1 ms | 145.5 ms | 1839.6 ms |