Skip to content

Add GraphBatchLoad client-streaming RPC to gRPC module #3678

@robfrank

Description

@robfrank

Add a new GraphBatchLoad client-streaming gRPC RPC that exposes the same GraphBatch-based bulk graph loading as the HTTP POST /api/v1/batch/{database} endpoint (see #3675).

Motivation

The HTTP batch endpoint allows high-performance bulk loading of vertices and edges, but gRPC clients have no equivalent. The existing gRPC Insert RPCs (BulkInsert, InsertStream, InsertBidirectional) are generic record-insert operations with no graph awareness — no temporary ID mapping, no vertex→edge linking, no GraphBatch tuning parameters.

Design

RPC signature: rpc GraphBatchLoad (stream GraphBatchChunk) returns (GraphBatchResult)

Client-streaming: the client sends one or more GraphBatchChunk messages, then receives a single GraphBatchResult when the import completes.

Proto messages

message GraphBatchOptions {
  int32          batch_size               = 1;  // default 100000
  bool           light_edges              = 2;  // default false
  bool           wal                      = 3;  // default false
  optional bool  parallel_flush           = 4;  // default true (unset = true)
  optional bool  pre_allocate_edge_chunks = 5;  // default true (unset = true)
  int32          edge_list_initial_size   = 6;  // default 2048
  optional bool  bidirectional            = 7;  // default true (unset = true)
  int32          commit_every             = 8;  // default 50000
  int32          expected_edge_count      = 9;  // default 0
}

message GraphBatchRecord {
  enum Kind { VERTEX = 0; EDGE = 1; }
  Kind   kind      = 1;
  string type_name = 2;
  string temp_id   = 3;  // vertex temp ID (for edge references)
  string from_ref  = 4;  // edge source: temp ID or "#bucket:pos"
  string to_ref    = 5;  // edge target: temp ID or "#bucket:pos"
  map<string, GrpcValue> properties = 6;
}

message GraphBatchChunk {
  string database                    = 1;
  DatabaseCredentials credentials    = 2;
  GraphBatchOptions options          = 3;
  repeated GraphBatchRecord records  = 4;
}

message GraphBatchResult {
  int64 vertices_created         = 1;
  int64 edges_created            = 2;
  int64 elapsed_ms               = 3;
  map<string, string> id_mapping = 4;  // temp_id → RID
}

Protocol

  1. First chunk must contain database (and optionally credentials and options)
  2. All VERTEX records must appear before any EDGE records (across all chunks)
  3. Vertices can have temporary IDs (temp_id) that edges reference via from_ref/to_ref
  4. Edges can also reference existing database RIDs directly (e.g., #12:0)
  5. The response includes an id_mapping of temp IDs to assigned RIDs

Tuning parameters

All GraphBatch.Builder parameters from the HTTP endpoint are exposed via GraphBatchOptions:

Parameter Default Description
batch_size 100000 Max edges buffered before auto-flush
light_edges false Property-less edges stored as connectivity only
wal false Enable Write-Ahead Logging
parallel_flush true Parallelize edge connection across async threads
pre_allocate_edge_chunks true Pre-allocate edge segments on vertex creation
edge_list_initial_size 2048 Initial segment size in bytes (64–8192)
bidirectional true Connect both outgoing and incoming edges
commit_every 50000 Edges per sub-transaction within a flush
expected_edge_count 0 Hint for auto-tuning batch size

Important notes

  • The endpoint is NOT atomic by design (same as the HTTP batch endpoint). GraphBatch commits internally in chunks for maximum throughput.
  • For very large batches with many temp IDs, the id_mapping response may exceed the default gRPC message size limit (4 MB). Callers should increase maxInboundMessageSize or avoid temp IDs when mapping is not needed.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions