Skip to content

Improve Performance when Creating Large Graphs from Pandas DataFrames #419

@fwitter

Description

@fwitter

When creating large graphs from Pandas DataFrames, the helper method Graph.DataFrame becomes painfully slow. For my current project, I need to create a graph with 700k vertices having a lot of attributes and 2.4 million edges. This takes more than 3 hours with the current implementation of Graph.DataFrame.

I have run some experiments to analyze the runtime complexity of this method. Each experiment consist of the following steps:

  • Create a ring graph of given size (equal number of vertices and edges)
  • Add some dummy attributes to vertices and edges
  • Get edge and vertex DataFrames of graph (map vertex IDs to names if use_vids=False)
  • Reconstruct graph from DataFrames using Graph.DataFrame

The experiment code is given below.

def run_experiment(n: int, use_vids: bool):
    test_g = Graph.Ring(n=n, directed=True)
    test_g.vs["name"] = [f"v{i}" for i in range(test_g.vcount())]
    test_g.vs["x"] = [float(i) for i in range(test_g.vcount())]
    test_g.es["w"] = [1.0] * test_g.ecount()
    edges = test_g.get_edge_dataframe()
    vertices = test_g.get_vertex_dataframe()
    if not use_vids:
        edges.iloc[:, 0] = edges.iloc[:, 0].map(vertices["name"])
        edges.iloc[:, 1] = edges.iloc[:, 1].map(vertices["name"])
    g = Graph.DataFrame(edges, vertices=vertices, use_vids=use_vids)
    return g

I tracked the runtime of Graph.DataFrame using line_profiler while changing the graph size n and DataFrame format use_vids. The observed runtimes are given in the table below.

n=1,000 n=10,000 n=100,000 n=1,000,000
use_vids=True 0.4 s 3.6 s 35.9 s 365.7 s
use_vids=False 0.4 s 4.7 s 157.8 s 12949.3 s

If use_vids=True, the runtime grows linearly with the number of vertices, edges and their attributes. The critical parts of the code are the for-loops for transferring vertex and edge attributes from the DataFrames to the graph.
If use_vids=False, the runtime additionally grows quadratically with the number of vertices. This growth originates from a call of np.setdiff1d for validating the vertices DataFrame against the edges DataFrame. For larger graphs, this DataFrame validation easily dominates the runtime.
For details, see attached profiling report.

With PR #418 , I propose a reimplementation of Graph.DataFrame which is ~500 times faster than the current implementation and has expected (near) linear growth. The new runtimes of the same experiments are shown in the table below.

n=1,000 n=10,000 n=100,000 n=1,000,000
use_vids=True 9.5 ms 14.6 ms 98.0 ms 677.4 ms
use_vids=False 10.4 ms 18.1 ms 145.5 ms 1839.6 ms

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions