Improve Performance when Creating Large Graphs from Pandas DataFrames

When creating large graphs from Pandas DataFrames, the helper method `Graph.DataFrame` becomes painfully slow. For my current project, I need to create a graph with 700k vertices having a lot of attributes and 2.4 million edges. This takes more than 3 hours with the current implementation of `Graph.DataFrame`.

I have run some experiments to analyze the runtime complexity of this method. Each experiment consist of the following steps:

- Create a ring graph of given size (equal number of vertices and edges)
- Add some dummy attributes to vertices and edges
- Get edge and vertex DataFrames of graph (map vertex IDs to names if `use_vids=False`)
- Reconstruct graph from DataFrames using `Graph.DataFrame`

The experiment code is given below.

```python
def run_experiment(n: int, use_vids: bool):
    test_g = Graph.Ring(n=n, directed=True)
    test_g.vs["name"] = [f"v{i}" for i in range(test_g.vcount())]
    test_g.vs["x"] = [float(i) for i in range(test_g.vcount())]
    test_g.es["w"] = [1.0] * test_g.ecount()
    edges = test_g.get_edge_dataframe()
    vertices = test_g.get_vertex_dataframe()
    if not use_vids:
        edges.iloc[:, 0] = edges.iloc[:, 0].map(vertices["name"])
        edges.iloc[:, 1] = edges.iloc[:, 1].map(vertices["name"])
    g = Graph.DataFrame(edges, vertices=vertices, use_vids=use_vids)
    return g
```

I tracked the runtime of `Graph.DataFrame` using [line_profiler](https://github.com/pyutils/line_profiler) while changing the graph size `n` and DataFrame format `use_vids`. The observed runtimes are given in the table below.

|                | n=1,000 | n=10,000 | n=100,000 | n=1,000,000 |
|:--------------:|--------:|---------:|----------:|------------:|
| use_vids=True  |   0.4 s |    3.6 s |    35.9 s |     365.7 s |
| use_vids=False |   0.4 s |    4.7 s |   157.8 s |   12949.3 s |

If `use_vids=True`, the runtime grows linearly with the number of vertices, edges and their attributes. The critical parts of the code are the for-loops for transferring vertex and edge attributes from the DataFrames to the graph.
If `use_vids=False`, the runtime additionally grows quadratically with the number of vertices. This growth originates from a call of `np.setdiff1d` for validating the vertices DataFrame against the edges DataFrame. For larger graphs, this DataFrame validation easily dominates the runtime.
For details, see attached [profiling report](https://github.com/igraph/python-igraph/files/6692621/profile_report.txt).

With PR #418 , I propose a reimplementation of `Graph.DataFrame` which is ~500 times faster than the current implementation and has expected (near) linear growth. The new runtimes of the same experiments are shown in the table below.

|                | n=1,000 | n=10,000 | n=100,000 | n=1,000,000 |
|:--------------:|--------:|---------:|----------:|------------:|
| use_vids=True  |  9.5 ms |  14.6 ms |   98.0 ms |    677.4 ms |
| use_vids=False | 10.4 ms |  18.1 ms |  145.5 ms |   1839.6 ms |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Performance when Creating Large Graphs from Pandas DataFrames #419

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	n=1,000	n=10,000	n=100,000	n=1,000,000
use_vids=True	0.4 s	3.6 s	35.9 s	365.7 s
use_vids=False	0.4 s	4.7 s	157.8 s	12949.3 s

	n=1,000	n=10,000	n=100,000	n=1,000,000
use_vids=True	9.5 ms	14.6 ms	98.0 ms	677.4 ms
use_vids=False	10.4 ms	18.1 ms	145.5 ms	1839.6 ms

Improve Performance when Creating Large Graphs from Pandas DataFrames #419

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions