Skip to content

High-Performance Bulk Edge Creation #3621

@lvca

Description

@lvca

Summary

New engine-level GraphBatchImporter that provides 6-11x speedup over the standard edge creation API for bulk graph loading scenarios. Designed to close the performance gap with Neo4j's batch importer.

Problem

Creating edges one-at-a-time through the standard Vertex API (even with transaction batching) is bottlenecked by:

  • Random I/O to edge segments (each edge touches source + destination vertex)
  • Per-record overhead: lock acquisition, validation, event callbacks, page space search
  • Unsorted access patterns causing page cache thrashing at scale

Solution

  GraphBatchImporter buffers edges in flat primitive arrays, then flushes them in optimized sorted passes:

  try (GraphBatchImporter importer = GraphBatchImporter.builder(db)
      .withExpectedEdgeCount(1_000_000)  // auto-tunes batch size
      .withLightEdges(false)
      .withWAL(false)
      .build()) {

    RID[] vertices = importer.createVertices("Person", 100_000);

    for (Edge e : edgeSource)
      importer.newEdge(vertices[e.src], "KNOWS", vertices[e.dst],
          "weight", e.weight, "timestamp", e.ts);
  }
  // All edges connected (both OUT and IN) when close() returns

Key Optimizations

  ┌───────────────────────────┬───────────────┬──────────────────────────────────────────────────────────────────────────────────────┐
  │       Optimization        │    Impact     │                                     Description                                      │
  ├───────────────────────────┼───────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
  │ Sorted flush              │ ~2-3x         │ Outgoing edges sorted by source vertex → sequential segment access                   │
  ├───────────────────────────┼───────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
  │ Deferred incoming edges   │ ~1.5x         │ IN edges accumulated across flushes, connected in one sorted pass at close()         │
  ├───────────────────────────┼───────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
  │ Vectorized segment writes │ ~1.3x         │ addManyAtEndDirect() writes N edges with single getUsed/setUsed pair                 │
  ├───────────────────────────┼───────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
  │ Exact-sized segments      │ ~1.2x         │ Pre-compute VLQ bytes needed → no wasted space in segments                           │
  ├───────────────────────────┼───────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
  │ Fill-then-persist         │ ~1.2x         │ Segments filled in memory before createRecord (1 I/O vs 2)                           │
  ├───────────────────────────┼───────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
  │ Deferred vertex updates   │ ~1.4x         │ Head chunk pointers batch-updated at close() instead of per-edge                     │
  ├───────────────────────────┼───────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
  │ Bulk record creation      │ ~1.3x (props) │ LocalBucket.createRecordsBulk() bypasses per-record lock/validate/findAvailableSpace │
  ├───────────────────────────┼───────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
  │ Auto-tuned batch size     │ ~1.5-2x       │ withExpectedEdgeCount() sets optimal batch = edgeCount/5 (100K-5M range)             │
  └───────────────────────────┴───────────────┴──────────────────────────────────────────────────────────────────────────────────────┘

Benchmark Results

500K edges, 50K vertices:

  ┌──────────────────────────────────┬───────────┬───────────┬─────────┐
  │              Method              │ Time (ms) │ Edges/sec │ Speedup │
  ├──────────────────────────────────┼───────────┼───────────┼─────────┤
  │ Standard API (tx/1000)           │ 12,162    │ 41K       │ 1.0x    │
  ├──────────────────────────────────┼───────────┼───────────┼─────────┤
  │ Old GraphImporter (integration)  │ 8,623     │ 58K       │ 1.4x    │
  ├──────────────────────────────────┼───────────┼───────────┼─────────┤
  │ GraphBatchImporter (light edges) │ 2,031     │ 246K      │ 6.0x    │
  ├──────────────────────────────────┼───────────┼───────────┼─────────┤
  │ Standard API + properties        │ 14,267    │ 35K       │ 1.0x    │
  ├──────────────────────────────────┼───────────┼───────────┼─────────┤
  │ GraphBatchImporter + properties  │ 2,149     │ 233K      │ 6.6x    │
  └──────────────────────────────────┴───────────┴───────────┴─────────┘

Scaling with batch size (500K edges):

  ┌────────────┬───────────┬─────────┐
  │ Batch Size │ Edges/sec │ Speedup │
  ├────────────┼───────────┼─────────┤
  │ 50K        │ 175K      │ 4.0x    │
  ├────────────┼───────────┼─────────┤
  │ 100K       │ 320K      │ 7.3x    │
  ├────────────┼───────────┼─────────┤
  │ 200K       │ 366K      │ 8.4x    │
  ├────────────┼───────────┼─────────┤
  │ 400K       │ 421K      │ 9.6x    │
  ├────────────┼───────────┼─────────┤
  │ 800K       │ 504K      │ 11.5x   │
  └────────────┴───────────┴─────────┘

10M edges, 1M vertices (best batch = 2M):

  ┌────────────┬───────────┬────────────────────┐
  │ Batch Size │ Edges/sec │      Speedup       │
  ├────────────┼───────────┼────────────────────┤
  │ 2M         │ 153K      │ best at this scale │
  └────────────┴───────────┴────────────────────┘

API

Builder Options

  ┌────────────────────────────────────┬─────────┬──────────────────────────────────────────────────────┐
  │               Method               │ Default │                     Description                      │
  ├────────────────────────────────────┼─────────┼──────────────────────────────────────────────────────┤
  │ withBatchSize(int)                 │ 100,000 │ Max edges buffered before auto-flush                 │
  ├────────────────────────────────────┼─────────┼──────────────────────────────────────────────────────┤
  │ withExpectedEdgeCount(int)         │ —       │ Auto-tunes batch size to edgeCount/5 (100K-5M)       │
  ├────────────────────────────────────┼─────────┼──────────────────────────────────────────────────────┤
  │ withLightEdges(boolean)            │ true    │ Use light edges when no properties                   │
  ├────────────────────────────────────┼─────────┼──────────────────────────────────────────────────────┤
  │ withBidirectional(boolean)         │ true    │ Connect IN edges (deferred to close())               │
  ├────────────────────────────────────┼─────────┼──────────────────────────────────────────────────────┤
  │ withWAL(boolean)                   │ false   │ Write-Ahead Logging during import                    │
  ├────────────────────────────────────┼─────────┼──────────────────────────────────────────────────────┤
  │ withCommitEvery(int)               │ 50,000  │ Edges per commit within a flush (0 = once per flush) │
  ├────────────────────────────────────┼─────────┼──────────────────────────────────────────────────────┤
  │ withEdgeListInitialSize(int)       │ 2048    │ Initial edge segment size in bytes                   │
  ├────────────────────────────────────┼─────────┼──────────────────────────────────────────────────────┤
  │ withPreAllocateEdgeChunks(boolean) │ true    │ Pre-allocate edge segments at vertex creation        │
  ├────────────────────────────────────┼─────────┼──────────────────────────────────────────────────────┤
  │ withParallelFlush(boolean)         │ false   │ Parallel edge connection (experimental)              │
  └────────────────────────────────────┴─────────┴──────────────────────────────────────────────────────┘

Key Methods

  • createVertices(typeName, count) — Bulk vertex creation, returns RID array
  • newEdge(srcRID, edgeType, dstRID, properties...) — Buffer an edge (auto-flushes at batch size)
  • flush() — Manual flush of buffered edges
  • close() — Flushes remaining edges, connects all deferred IN edges, batch-updates vertex head chunks

Files Changed

  • New: engine/.../graph/GraphBatchImporter.java (1704 lines) — Main importer
  • New: engine/.../graph/GraphBatchImporterTest.java (616 lines) — 8 unit tests
  • New: integration/.../GraphImporterBenchmarkTest.java (445 lines) — 3-way benchmark
  • Modified: engine/.../engine/LocalBucket.java — Added createRecordsBulk() for bulk page writes
  • Modified: engine/.../graph/EdgeSegment.java — Added addAtEndDirect(), addManyAtEndDirect() interfaces
  • Modified: engine/.../graph/MutableEdgeSegment.java — Implemented vectorized segment writes

Recommended Batch Sizes

  ┌─────────────┬─────────────────────────────┬──────────────────┐
  │ Graph Size  │         Recommended         │ RAM for Importer │
  ├─────────────┼─────────────────────────────┼──────────────────┤
  │ < 1M edges  │ Use withExpectedEdgeCount() │ ~200MB           │
  ├─────────────┼─────────────────────────────┼──────────────────┤
  │ 1-5M edges  │ 500K-1M                     │ ~300MB           │
  ├─────────────┼─────────────────────────────┼──────────────────┤
  │ 5-20M edges │ 1-2M                        │ ~500MB           │
  ├─────────────┼─────────────────────────────┼──────────────────┤
  │ 20M+ edges  │ 2-5M                        │ ~1GB             │
  └─────────────┴─────────────────────────────┴──────────────────┘

Future Optimizations

  • Pre-serialized edge template: For same-schema edges, pre-compute binary layout once, stamp values per edge
  • Direct page cursor: Maintain page+offset cursor instead of per-record findAvailableSpace
  • Buffer pool for serialization: Avoid copyOfContent() per edge by using a buffer pool
  • Parallel sort+flush: Sort and connect edges across multiple threads partitioned by bucket

Metadata

Metadata

Assignees

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions