Skip to content

Batch HTTP endpoint #3675

@lvca

Description

@lvca

Not everybody can use Java embedded to do fast batch graph import, so we should have a new streaming http endpoint to dump tons of vertices and edges in CSV and JSONL format.

POST /api/v1/batch/{database}

Should support two input formats: JSONL (newline-delimited JSON) and CSV. Both are streamed — the server never loads the entire payload into memory, so you can push millions of records in a single request.

JSONL Format

{"@type":"vertex","@class":"Person","@id":"t1","name":"Alice","age":30}
{"@type":"vertex","@class":"Person","@id":"t2","name":"Bob","age":25}
{"@type":"edge","@class":"KNOWS","@from":"t1","@to":"t2","since":2020}

CSV Format

@type,@class,@id,name,age
vertex,Person,t1,Alice,30
vertex,Person,t2,Bob,25
---
@type,@class,@from,@to,since
edge,KNOWS,t1,t2,2020

In both formats, vertices come first, then edges. Vertices can have temporary IDs (@id) that edges reference via @from/@to. Edges can also reference existing database RIDs directly (e.g., #12:0).

Temporary ID Mapping

The response includes an idMapping object so you know what RIDs were assigned:

{
  "verticesCreated": 2,
  "edgesCreated": 1,
  "elapsedMs": 42,
  "idMapping": {"t1": "#9:0", "t2": "#9:1"}
}

Tuning via Query Parameters

All GraphBatch configuration options are exposed as query parameters:

Parameter Default Description
batchSize 100000 Max edges buffered before auto-flush
lightEdges false Property-less edges stored as connectivity only (saves ~33% I/O)
wal false Enable Write-Ahead Logging for crash safety
parallelFlush true Parallelize edge connection across async threads
preAllocateEdgeChunks true Pre-allocate edge segments on vertex creation
edgeListInitialSize 2048 Initial segment size in bytes (64–8192)
bidirectional true Connect both outgoing and incoming edges
commitEvery 50000 Edges per sub-transaction within a flush
expectedEdgeCount 0 Hint for auto-tuning batch size

Examples

curl (JSONL):

curl -X POST "http://localhost:2480/api/v1/batch/mydb?lightEdges=true" \
  -u root:password \
  -H "Content-Type: application/x-ndjson" \
  --data-binary @graph-data.jsonl

curl (CSV):

curl -X POST "http://localhost:2480/api/v1/batch/mydb" \
  -u root:password \
  -H "Content-Type: text/csv" \
  --data-binary @graph-data.csv

Python:

import requests

data = (
    '{"@type":"vertex","@class":"Person","@id":"p1","name":"Alice"}\n'
    '{"@type":"vertex","@class":"Person","@id":"p2","name":"Bob"}\n'
    '{"@type":"edge","@class":"KNOWS","@from":"p1","@to":"p2"}\n'
)

resp = requests.post(
    "http://localhost:2480/api/v1/batch/mydb?lightEdges=true",
    auth=("root", "password"),
    headers={"Content-Type": "application/x-ndjson"},
    data=data,
)
print(resp.json())
# {'verticesCreated': 2, 'edgesCreated': 1, 'elapsedMs': 15, 'idMapping': {'p1': '#9:0', 'p2': '#9:1'}}

JavaScript (Node.js):

const resp = await fetch("http://localhost:2480/api/v1/batch/mydb", {
  method: "POST",
  headers: {
    "Content-Type": "application/x-ndjson",
    Authorization: "Basic " + btoa("root:password"),
  },
  body: [
    '{"@type":"vertex","@class":"Person","@id":"p1","name":"Alice"}',
    '{"@type":"vertex","@class":"Person","@id":"p2","name":"Bob"}',
    '{"@type":"edge","@class":"KNOWS","@from":"p1","@to":"p2"}',
  ].join("\n"),
});
console.log(await resp.json());

Tip: For maximum throughput, group vertices by type in the input. The endpoint batches consecutive same-type vertices into a single createVertices() call. Interleaving types forces smaller batches.

Tip: The endpoint is NOT atomic by design — GraphBatch commits internally in chunks for maximum throughput. Treat it as a bulk-loading operation, not a transactional one. The response tells you exactly how many records were committed.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions