#3678 feat(grpc): add GraphBatchLoad RPC for bulk graph loading#3680
#3678 feat(grpc): add GraphBatchLoad RPC for bulk graph loading#3680
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the gRPC API by introducing a client-streaming endpoint for efficient bulk loading of graph data. This new capability allows users to stream large volumes of vertices and edges, including support for temporary ID resolution and configurable batching options, thereby improving performance and flexibility for graph data ingestion. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
Code ReviewGood addition — the streaming design is sound and the integration tests cover the happy paths well. A few issues need addressing before merge: 🐛 Bug: Proto3 boolean defaults break
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new client-streaming gRPC RPC, GraphBatchLoad, for bulk graph loading, mirroring the HTTP POST /api/v1/batch/{database} endpoint. It includes new proto messages for options, records, chunks, and results, supporting streaming vertices and edges with temporary ID mapping and all GraphBatch tuning parameters. The changes also include integration tests to verify the functionality of the new RPC. The code introduces a new service method and associated data structures for handling graph batch loads, including vertex and edge processing, temporary ID mapping, and configuration options. The tests cover vertex+edge, vertex-only, and empty stream scenarios.
| counts[1]++; | ||
| } else { | ||
| if (inEdgePhase.get()) | ||
| throw new IllegalArgumentException("Vertex record received after edges. All vertices must appear before edges"); |
| } | ||
| } | ||
| } catch (final Exception e) { | ||
| resp.onError(Status.INTERNAL.withDescription("graphBatchLoad: " + e.getMessage()).asException()); |
| } | ||
| } catch (final Exception e) { | ||
| resp.onError(Status.INTERNAL.withDescription("graphBatchLoad: " + e.getMessage()).asException()); | ||
| closeQuietly(batchRef.get()); |
| if (batch != null) { | ||
| try { batch.close(); } catch (final Exception ignored) { } | ||
| } |
There was a problem hiding this comment.
The closeQuietly method catches Exception but doesn't log it. At least log the exception at the DEBUG level to understand why the batch is failing to close.
try { batch.close(); } catch (final Exception ignored) { LogManager.instance().log(this, Level.DEBUG, "Error during GraphBatch close", ignored); }
Code ReviewGood addition — the feature mirrors the HTTP batch endpoint cleanly and the test coverage is solid. A few issues worth addressing before merging: Bug:
|
Coverage summary from CodacySee diff coverage on Codacy
Coverage variation details
Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: Diff coverage details
Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: See your quality gate settings Change summary preferences |
Code ReviewThis PR bundles two distinct features: a new gRPC gRPC GraphBatchLoadPotential bug — mid-stream exception leaves transaction open In Hardcoded batch size The vertex flush threshold of "Vertices before edges" protocol constraint The proto definition requires all vertex records to appear before edge records in the stream, and this is enforced server-side. This constraint is non-obvious and should be documented in the proto file itself (a comment on the
The HTTP Test coverage The new tests (
OpenCypher Count Push-Down Optimizations
if (variable instanceof RID rid) {
final Object rawValue = rid.asVertex().get(propertyName);
return convertFromStorage(rawValue);
}
final Vertex v = rid.asVertex();
if (v == null)
return null;
final Object rawValue = v.get(propertyName);Also: algorithm procedures may legitimately store edge RIDs. If an edge RID ends up in this branch,
For star-join patterns, the count is computed as the product of per-arm degrees: count *= armDegree; // (conceptually)For a high-degree central node (e.g., 10M+ edges per arm), the product can overflow if (count > Long.MAX_VALUE / armDegree) return Long.MAX_VALUE; // or throw
count *= armDegree;
The tolerance in
The
The new optimizer path handles star-join patterns with OPTIONAL MATCH. Tests cover the happy path (
Inserting a new fast-path before the existing optimizer dispatch is fragile. If the star-join pattern detection has a false positive (matches a query it shouldn't), it silently produces wrong results instead of falling through to the correct path. Consider adding a Minor / Style
Summary
The gRPC feature and the OpenCypher optimizations are both well-structured and show clear design intent. Addressing the NPE in 🤖 Generated with Claude Code |
…ading Add a new GraphBatchLoad RPC to the gRPC module that exposes the same GraphBatch-based bulk loading as the HTTP POST /api/v1/batch endpoint. Supports streaming vertices and edges with temporary ID mapping, and all GraphBatch tuning parameters (batchSize, lightEdges, WAL, etc.). Closes #3678 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use optional bool for proto fields that default to true (parallel_flush, pre_allocate_edge_chunks, bidirectional) to fix proto3 default semantics - Guard against empty from_ref/to_ref in resolveRef - Guard against empty database name on first chunk - Replace unnecessary AtomicBoolean/AtomicReference with plain arrays - Document vertex-before-edge constraint in proto comments - Add error-path tests (unknown temp ID, vertex-after-edge, direct RID refs) - Fix elapsed_ms assertion to avoid potential flake Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Pass database explicitly to RID constructor in resolveRef instead of relying on thread-local (safer for gRPC thread pool) - Skip call.request(1) after sending error to avoid spurious requests - Document partial-commit behavior of closeQuietly (by design, same as HTTP) - Document configureGraphBatchOptions asymmetry for true-default fields - Update credentials proto comment to reflect optional channel-level auth - Add message-size warning for large id_mapping responses - Remove sv() alias, use stringValue() consistently in tests - Add clarifying comment to direct RID reference test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
36109e0 to
425efec
Compare
|
test |
|
PLACEHOLDER_TO_EDIT |
|
full review here |
Code Review - Part 1 of 2The prior reviews covered the main design issues well. This pass focuses on a few items not yet addressed. Bug: Double-close risk on GraphBatch when transport error follows server error When onNext throws (e.g. unknown temp ID), the server calls resp.onError() and then closeQuietly(batchRef.get()). The batch is now closed. However, if a transport error (client disconnect, network reset) arrives shortly after, gRPC will invoke the server-side onError callback on the request stream, which calls closeQuietly(batchRef.get()) on the already-closed batch. GraphBatch.close() calls flush(), connectDeferredIncomingEdges(), and batchUpdateVertexHeadChunks(). If any of these are non-idempotent (edge buffer state not zeroed after close), a second call can double-commit deferred incoming edges or corrupt vertex head pointers. Fix: null out the reference immediately after closing, so a second call is a no-op: Use this everywhere closeQuietly is currently called (onNext catch, onError, onCompleted catch). |
|
SPDX-License-Identifier: Apache-2.0 ArcadeDB This product includes software developed at ================================================================================ This product includes components originally developed for OrientDB, an Open ArcadeDB was created by Luca Garulli, the same founder of OrientDB, and written OrientDB source code is available at: https://github.com/orientechnologies/orientdb ================================================================================ This product includes openCypher grammar files and Technology Compatibility Kit Copyright (c) "Neo4j" Licensed under the Apache License, Version 2.0 (the "License"); openCypher project website: https://opencypher.org/ Cypher® is a registered trademark of Neo4j Inc. The included openCypher TCK tests contain the following attribution notice: "This work was created by the collective efforts of the openCypher community. The following files contain openCypher content:
This product includes software from the OpenCypher-Gremlin project. Copyright 2018-2019 Neueda The following file is a security-patched override of unmaintained upstream code: OpenCypher-Gremlin project: https://github.com/opencypher/cypher-for-gremlin ================================================================================ This product includes software developed by The Apache Software Foundation Apache Lucene This product includes software developed by Includes software from other Apache copyright owners, as follows:
Apache ANTLR4 Apache TinkerPop (Gremlin) Apache Commons Compress Apache Commons Lang Apache Commons Codec Apache HttpComponents ================================================================================ This product includes software developed by GraalVM (Oracle). GraalVM SDK, Truffle API, and JavaScript implementation ================================================================================ This product includes software developed by Google Inc. Protocol Buffers Guava Libraries Gson gRPC ================================================================================ This product includes software developed by Red Hat, Inc. Undertow ================================================================================ This product includes software developed by the Netty project. Netty ================================================================================ This product includes JUnit 5 test framework. JUnit 5 ================================================================================ This product includes Testcontainers framework. Testcontainers ================================================================================ This product includes software developed by FasterXML, LLC. Jackson JSON processor https://github.com/FasterXML/jackson ================================================================================ For the complete list of third-party components and their licenses, see the |
Code Review test |
|
SPDX-License-Identifier: Apache-2.0 ArcadeDB This product includes software developed at ================================================================================ This product includes components originally developed for OrientDB, an Open ArcadeDB was created by Luca Garulli, the same founder of OrientDB, and written OrientDB source code is available at: https://github.com/orientechnologies/orientdb ================================================================================ This product includes openCypher grammar files and Technology Compatibility Kit Copyright (c) "Neo4j" Licensed under the Apache License, Version 2.0 (the "License"); openCypher project website: https://opencypher.org/ Cypher® is a registered trademark of Neo4j Inc. The included openCypher TCK tests contain the following attribution notice: "This work was created by the collective efforts of the openCypher community. The following files contain openCypher content:
This product includes software from the OpenCypher-Gremlin project. Copyright 2018-2019 Neueda The following file is a security-patched override of unmaintained upstream code: OpenCypher-Gremlin project: https://github.com/opencypher/cypher-for-gremlin ================================================================================ This product includes software developed by The Apache Software Foundation Apache Lucene This product includes software developed by Includes software from other Apache copyright owners, as follows:
Apache ANTLR4 Apache TinkerPop (Gremlin) Apache Commons Compress Apache Commons Lang Apache Commons Codec Apache HttpComponents ================================================================================ This product includes software developed by GraalVM (Oracle). GraalVM SDK, Truffle API, and JavaScript implementation ================================================================================ This product includes software developed by Google Inc. Protocol Buffers Guava Libraries Gson gRPC ================================================================================ This product includes software developed by Red Hat, Inc. Undertow ================================================================================ This product includes software developed by the Netty project. Netty ================================================================================ This product includes JUnit 5 test framework. JUnit 5 ================================================================================ This product includes Testcontainers framework. Testcontainers ================================================================================ This product includes software developed by FasterXML, LLC. Jackson JSON processor https://github.com/FasterXML/jackson ================================================================================ For the complete list of third-party components and their licenses, see the |
How to contribute to ArcadeDBContribution LicenseArcadeDB is licensed under the Apache License 2.0. By submitting a contribution (pull request, patch, or any other form), you agree that your contribution is licensed under the same terms, as stated in Section 5 of the Apache License 2.0. This means:
By submitting a contribution, you represent that you have the right to license it under the Apache License 2.0 — i.e., the work is your own, or you have permission from the copyright holder. Did you find a bug?
Did you write a patch that fixes a bug?
Did you fix whitespace, format code, or make a purely cosmetic patch?Changes that are cosmetic in nature and do not add anything substantial to the stability, functionality, or testability will generally not be accepted. Do you intend to add a new feature or change an existing one?
Are you adding a new dependency?When adding new third-party dependencies to ArcadeDB:
For questions about licensing, please ask in the discussion forum before adding the dependency. Prepare your environmentPre-commitThis project uses pre-commit. Every developer should install it locally, please Developer guideIn order to use the latest distribution before the official release, you have to build it from the latest SNAPSHOT version in
It's super easy, just follow the following steps: (1) Download the sources If you have git installed, execute the following command on a shell or command prompt: $ git clone git@github.com:ArcadeData/arcadedb.gitWith this command above you have just cloned the (2) Now build the project using Maven. Assure you have Apache Maven installed on your computer and execute the following commands: $ cd arcadedb
$ mvn clean install -DskipTestsThe build should take less than a minute. (3) Use the fresh built distribution You will find a fresh built distribution under the directory For example, if the latest release is the Some build targetsBuild and run default tests: $ mvn clean installTo run additional integration test locally use: $ mvn clean install -P integrationRebuild Docker images for $ mvn clean install -DskipTests -PdockerBuild and run in fail-at-end mode: $ mvn clean install -faePrint dependency tree: $ mvn dependency:tree |
|
line1\nline2\n### heading |
ArcadeDB Plugin ArchitectureOverviewArcadeDB supports a plugin architecture that allows extending the server functionality through isolated plugins. Each plugin runs in its own class loader, enabling plugins to have different versions of dependencies without conflicts. Plugin TypesArcadeDB includes the following built-in plugins:
ArchitectureClass LoadingThe plugin system uses isolated class loaders with the following strategy:
This approach ensures:
ComponentsPluginManager
PluginClassLoader
PluginDescriptor
Plugin Lifecycle
Installation PrioritiesPlugins are started in phases based on their installation priority:
Creating a Plugin1. Implement ServerPlugin Interfacepackage com.example.myplugin;
import com.arcadedb.ContextConfiguration;
import com.arcadedb.server.ArcadeDBServer;
import com.arcadedb.server.ServerPlugin;
public class MyPlugin implements ServerPlugin {
private ArcadeDBServer server;
private ContextConfiguration configuration;
@Override
public void configure(ArcadeDBServer arcadeDBServer, ContextConfiguration configuration) {
this.server = arcadeDBServer;
this.configuration = configuration;
// Initialize your plugin configuration
}
@Override
public void startService() {
// Start your plugin services
System.out.println("MyPlugin started!");
}
@Override
public void stopService() {
// Stop your plugin services and clean up resources
System.out.println("MyPlugin stopped!");
}
@Override
public INSTALLATION_PRIORITY getInstallationPriority() {
return INSTALLATION_PRIORITY.AFTER_HTTP_ON;
}
}2. Create Service Provider ConfigurationCreate file: Content: 3. Build Plugin JARmvn clean package4. Deploy PluginCopy the plugin JAR to the cp target/myplugin-1.0.0.jar $ARCADEDB_HOME/lib/plugins/5. Start ArcadeDBThe plugin will be automatically discovered and loaded when ArcadeDB starts: cd $ARCADEDB_HOME
bin/server.shCheck the logs for: Plugin DependenciesServer API DependenciesPlugin POMs should include server dependencies with <dependency>
<groupId>com.arcadedb</groupId>
<artifactId>arcadedb-server</artifactId>
<version>${arcadedb.version}</version>
<scope>provided</scope>
</dependency>Plugin-Specific DependenciesPlugin-specific dependencies use normal <dependency>
<groupId>com.example</groupId>
<artifactId>my-library</artifactId>
<version>1.0.0</version>
<scope>compile</scope>
</dependency>Building Distributions with PluginsMaven AssemblyThe Maven assembly descriptor automatically places plugin JARs in <dependencySet>
<outputDirectory>lib/plugins</outputDirectory>
<includes>
<include>com.arcadedb:arcadedb-gremlin</include>
<include>com.arcadedb:arcadedb-postgresw</include>
<include>com.arcadedb:arcadedb-mongodbw</include>
<include>com.arcadedb:arcadedb-redisw</include>
<include>com.arcadedb:arcadedb-grpcw</include>
</includes>
<useTransitiveDependencies>false</useTransitiveDependencies>
</dependencySet>Advanced TopicsAccessing Server ResourcesPlugins have full access to the ArcadeDB server instance: @Override
public void configure(ArcadeDBServer arcadeDBServer, ContextConfiguration configuration) {
this.server = arcadeDBServer;
// Access databases
ServerDatabase db = server.getDatabase("mydb");
// Access HTTP server for custom endpoints
HttpServer httpServer = server.getHttpServer();
// Access security
ServerSecurity security = server.getSecurity();
}Thread Context Class LoaderThe PluginManager automatically sets the thread context class loader during plugin operations:
This ensures proper class loading for frameworks that use the thread context class loader. HTTP Endpoint RegistrationPlugins can register custom HTTP endpoints: @Override
public void registerAPI(HttpServer httpServer, PathHandler routes) {
routes.addExactPath("/api/myplugin", exchange -> {
exchange.getResponseHeaders().put(Headers.CONTENT_TYPE, "application/json");
exchange.getResponseSender().send("{\"status\":\"ok\"}");
});
}TroubleshootingPlugin Not DiscoveredCheck that:
ClassNotFoundExceptionIf you see
If you see
Plugin ConflictsIf two plugins have conflicting dependencies:
Migration from Legacy Plugin LoadingThe new plugin system is backward compatible with the legacy configuration-based loading. Both systems can coexist: Legacy Method (still supported)arcadedb.server.plugins=gremlin:com.arcadedb.server.gremlin.GremlinServerPluginNew Method (recommended)
Best Practices
ExamplesSee the built-in plugins for complete examples:
|
|
SPDX-License-Identifier: Apache-2.0 ArcadeDB This product includes software developed at ================================================================================ This product includes components originally developed for OrientDB, an Open ArcadeDB was created by Luca Garulli, the same founder of OrientDB, and written OrientDB source code is available at: https://github.com/orientechnologies/orientdb ================================================================================ This product includes openCypher grammar files and Technology Compatibility Kit Copyright (c) "Neo4j" Licensed under the Apache License, Version 2.0 (the "License"); openCypher project website: https://opencypher.org/ Cypher® is a registered trademark of Neo4j Inc. The included openCypher TCK tests contain the following attribution notice: "This work was created by the collective efforts of the openCypher community. The following files contain openCypher content:
This product includes software from the OpenCypher-Gremlin project. Copyright 2018-2019 Neueda The following file is a security-patched override of unmaintained upstream code: OpenCypher-Gremlin project: https://github.com/opencypher/cypher-for-gremlin ================================================================================ This product includes software developed by The Apache Software Foundation Apache Lucene This product includes software developed by Includes software from other Apache copyright owners, as follows:
Apache ANTLR4 Apache TinkerPop (Gremlin) Apache Commons Compress Apache Commons Lang Apache Commons Codec Apache HttpComponents ================================================================================ This product includes software developed by GraalVM (Oracle). GraalVM SDK, Truffle API, and JavaScript implementation ================================================================================ This product includes software developed by Google Inc. Protocol Buffers Guava Libraries Gson gRPC ================================================================================ This product includes software developed by Red Hat, Inc. Undertow ================================================================================ This product includes software developed by the Netty project. Netty ================================================================================ This product includes JUnit 5 test framework. JUnit 5 ================================================================================ This product includes Testcontainers framework. Testcontainers ================================================================================ This product includes software developed by FasterXML, LLC. Jackson JSON processor https://github.com/FasterXML/jackson ================================================================================ For the complete list of third-party components and their licenses, see the |
|
1 |
1 similar comment
Test ReviewHeading 1Some content here. Heading 2More content. |
|
short text ### "hello" |
Code ReviewThe prior reviews covered the main design issues well. This pass focuses on a few items not yet addressed. Bug: Double-close risk on GraphBatch when transport error follows server errorWhen onNext throws (e.g. unknown temp ID), the server calls resp.onError() and then closeQuietly(batchRef.get()). The batch is now closed. However, if a transport error (client disconnect, network reset) arrives shortly after, gRPC will invoke the server-side onError callback on the request stream, which calls closeQuietly(batchRef.get()) on the already-closed batch. GraphBatch.close() calls flush(), connectDeferredIncomingEdges(), and batchUpdateVertexHeadChunks(). If any of these are non-idempotent (edge buffer state not zeroed after close), a second call can double-commit deferred incoming edges or corrupt vertex head pointers. Fix: null out the reference immediately after closing, so a second call is a no-op: closeQuietly(batchRef.getAndSet(null)); Use this everywhere closeQuietly is currently called (onNext catch, onError, onCompleted catch). Missing validation: type_name is never checked for blankIf a client sends a GraphBatchRecord without setting type_name (proto3 default ""), the call propagates to GraphBatch.createVertices("", ...) and fails with a cryptic schema exception or NPE deep inside the engine. A guard at the record-processing level produces a clear error: if (rec.getTypeName().isEmpty()) Minor: Test query assertion fragile for multi-page resultsassertThat(queryResp.getResultsList().get(0).getRecordsList()).hasSize(2); getResultsList().get(0) assumes a single-page response. For 2 records this is fine in practice, but the intent is clearer if aggregated across all pages: final long total = queryResp.getResultsList().stream() Positive notes
|
Coverage summary from CodacySee diff coverage on Codacy
Coverage variation details
Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: Diff coverage details
Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: See your quality gate settings Change summary preferences |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3680 +/- ##
==========================================
- Coverage 65.63% 65.62% -0.01%
==========================================
Files 1557 1557
Lines 111125 111265 +140
Branches 23299 23333 +34
==========================================
+ Hits 72933 73019 +86
- Misses 28354 28384 +30
- Partials 9838 9862 +24 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|

Summary
GraphBatchLoadclient-streaming gRPC RPC that mirrors the HTTPPOST /api/v1/batch/{database}endpoint (issue Batch HTTP endpoint #3675)GraphBatchOptions,GraphBatchRecord,GraphBatchChunk,GraphBatchResultGraphBatchtuning parametersTest plan
mvn generate-sourcesin grpc module)mvn clean compile)graphBatchLoadVerticesAndEdges— streams 2 vertices + 1 edge, verifies counts and ID mappinggraphBatchLoadVerticesOnly— streams 2 vertices only, verifies 0 edgesgraphBatchLoadEmptyStream— immediately completes, verifies empty result🤖 Generated with Claude Code