Cypher load csv by lvca · Pull Request #3450 · ArcadeData/arcadedb

lvca · 2026-02-16T05:53:32Z

gemini-code-assist · 2026-02-16T05:54:01Z

Summary of Changes

Hello @lvca, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances ArcadeDB's data ingestion capabilities by introducing the LOAD CSV clause to Cypher queries. This feature allows users to seamlessly import data from various CSV sources, including local files, HTTP/S URLs, and compressed archives, directly into the database. It provides flexible parsing options, robust security controls for file access, and introduces transactional batching for subqueries, making large-scale data imports more efficient and reliable.

Highlights

New LOAD CSV Cypher Clause: Introduced the LOAD CSV clause, enabling direct import of data from CSV files into ArcadeDB, supporting both local files and URLs.
CSV Parsing Options: Implemented support for WITH HEADERS to use the first row as column names and FIELDTERMINATOR for custom delimiters. Robust parsing handles quoted fields, embedded delimiters, newlines, and backslash escaping.
File Access Security: Added global configurations arcadedb.opencypher.loadCsv.allowFileUrls and arcadedb.opencypher.loadCsv.importDirectory to control local file access and prevent path traversal vulnerabilities.
Contextual Functions for LOAD CSV: Provided new Cypher functions file() and linenumber() to retrieve the current CSV file URL and 1-based line number during processing.
Transactional Batching for Subqueries: Enhanced CALL {} subqueries with IN TRANSACTIONS support, allowing operations within subqueries to be committed in configurable batches, improving performance and memory usage for large imports.
Compressed File Support: Enabled automatic decompression for GZIP (.gz) and ZIP (.zip) compressed CSV files, simplifying data ingestion from compressed sources.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

engine/src/main/java/com/arcadedb/GlobalConfiguration.java
- Added OPENCYPHER_LOAD_CSV_ALLOW_FILE_URLS to control local file URL access.
- Added OPENCYPHER_LOAD_CSV_IMPORT_DIRECTORY to define a root directory for CSV imports and prevent path traversal.
engine/src/main/java/com/arcadedb/function/cypher/LoadCSVFileFunction.java
- Added new stateless function file() to return the current CSV file URL.
engine/src/main/java/com/arcadedb/function/cypher/LoadCSVLineNumberFunction.java
- Added new stateless function linenumber() to return the current CSV line number.
engine/src/main/java/com/arcadedb/query/opencypher/ast/ClauseEntry.java
- Added LOAD_CSV as a new ClauseType to the Cypher Abstract Syntax Tree (AST).
engine/src/main/java/com/arcadedb/query/opencypher/ast/LoadCSVClause.java
- Added new AST class LoadCSVClause to represent the LOAD CSV statement and its parameters.
engine/src/main/java/com/arcadedb/query/opencypher/ast/SubqueryClause.java
- Modified SubqueryClause to include inTransactions and batchSize properties for transactional subqueries.
engine/src/main/java/com/arcadedb/query/opencypher/executor/CypherExecutionPlan.java
- Imported LoadCSVClause and LoadCSVStep.
- Integrated LOAD_CSV clause into the execution plan builder, creating a LoadCSVStep.
engine/src/main/java/com/arcadedb/query/opencypher/executor/CypherFunctionFactory.java
- Registered file() and linenumber() as Cypher-specific functions.
- Added factory methods to create instances of LoadCSVFileFunction and LoadCSVLineNumberFunction.
engine/src/main/java/com/arcadedb/query/opencypher/executor/ExpressionEvaluator.java
- Added logic to propagate __loadCSV_file and __loadCSV_linenumber context variables for file() and linenumber() functions.
engine/src/main/java/com/arcadedb/query/opencypher/executor/steps/LoadCSVStep.java
- Added new execution step LoadCSVStep to handle reading, parsing, and processing of CSV data.
- Implemented methods for opening raw input streams with security checks, resolving file paths, and handling compressed files (GZIP, ZIP).
- Developed robust CSV line reading and parsing logic, including support for quoted fields and various escape sequences.
engine/src/main/java/com/arcadedb/query/opencypher/executor/steps/SubqueryStep.java
- Modified syncPull to dispatch to syncPullNormal or syncPullInTransactions based on the subquery clause.
- Added syncPullInTransactions method to handle batch processing and committing within subqueries.
- Implemented resolveBatchSize to determine the batch size for transactional subqueries.
- Updated prettyPrint to reflect the IN TRANSACTIONS clause.
engine/src/main/java/com/arcadedb/query/opencypher/parser/ClauseDispatcher.java
- Registered loadCSVClause to be handled by handleLoadCSV.
- Added handleLoadCSV method to add LoadCSVClause to the statement builder.
engine/src/main/java/com/arcadedb/query/opencypher/parser/CypherASTBuilder.java
- Modified visitSubqueryClause to parse IN TRANSACTIONS parameters, including batch size.
- Added visitLoadCSVClause method to parse the LOAD CSV clause from the ANTLR grammar.
engine/src/main/java/com/arcadedb/query/opencypher/parser/CypherSemanticValidator.java
- Updated validateBoundVariables and validateVariableScope to correctly handle variables introduced by the LOAD_CSV clause.
engine/src/main/java/com/arcadedb/query/opencypher/parser/FunctionValidator.java
- Registered file and linenumber functions with their argument counts for validation.
engine/src/main/java/com/arcadedb/query/opencypher/parser/StatementBuilder.java
- Added addLoadCSV method to add a LoadCSVClause to the statement's clause order.
engine/src/test/java/com/arcadedb/query/opencypher/OpenCypherLoadCSVTest.java
- Added a new comprehensive test suite for LOAD CSV functionality.
- Included tests for basic CSV loading, headers, custom terminators, node creation, file() and linenumber() functions, quoted fields, empty files, parameterized URLs, bare file paths, and compressed files (GZIP, ZIP).
- Added security tests for disabled file URLs and import directory path traversal prevention.
- Included tests for CALL {} IN TRANSACTIONS with and without specified batch sizes.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

mergify · 2026-02-16T05:54:31Z

🧪 CI Insights

Here's what we observed from your CI run for b693c08.

🟢 All jobs passed!

But CI Insights is watching 👀

gemini-code-assist

Code Review

This pull request introduces LOAD CSV functionality to the Cypher engine, a significant feature for data import. It also adds CALL {} IN TRANSACTIONS for batched writes within subqueries, which is very useful for large data loading operations.

The implementation of LOAD CSV is comprehensive, supporting various URL schemes, security controls to prevent path traversal, transparent decompression of .gz and .zip files, and contextual functions like file() and linenumber(). The IN TRANSACTIONS feature for CALL subqueries is also well-implemented, enabling periodic commits crucial for memory management during large imports.

The code is well-structured with new execution steps, AST nodes, and parser rules, and is accompanied by a thorough test suite. I've identified a high-severity issue concerning a potential resource leak in LoadCSVStep when handling compressed files, along with a minor improvement for processing ZIP archives. Overall, this is an excellent contribution that significantly enhances ArcadeDB's data import capabilities via Cypher.

gemini-code-assist · 2026-02-16T05:56:03Z

engine/src/main/java/com/arcadedb/query/opencypher/executor/steps/LoadCSVStep.java

+  static BufferedReader openReader(final String url, final CommandContext context) throws IOException {
+    InputStream is = openRawInputStream(url, context);
+
+    if (url.endsWith(".gz"))
+      is = new GZIPInputStream(is);
+    else if (url.endsWith(".zip")) {
+      final ZipInputStream zis = new ZipInputStream(is);
+      final ZipEntry entry = zis.getNextEntry();
+      if (entry == null)
+        throw new CommandExecutionException("ZIP file is empty: " + url);
+      is = zis;
+    }
+
+    return new BufferedReader(new InputStreamReader(is, StandardCharsets.UTF_8));
+  }


The current implementation of openReader has a potential resource leak. If an exception occurs during the wrapping of the InputStream (e.g., in new GZIPInputStream(is) or zis.getNextEntry()), the underlying stream is obtained from openRawInputStream will not be closed.

Additionally, when reading from a .zip file, the code only gets the first entry and does not handle cases where that entry might be a directory, which would result in reading an empty stream.

I suggest refactoring this method to use a try-catch block to ensure resource cleanup on failure and to correctly handle directories within ZIP archives by iterating until the first file entry is found.

static BufferedReader openReader(final String url, final CommandContext context) throws IOException { InputStream is = openRawInputStream(url, context); try { if (url.endsWith(".gz")) { is = new GZIPInputStream(is); } else if (url.endsWith(".zip")) { final ZipInputStream zis = new ZipInputStream(is); ZipEntry entry = zis.getNextEntry(); while (entry != null && entry.isDirectory()) { entry = zis.getNextEntry(); } if (entry == null) { zis.close(); throw new CommandExecutionException("ZIP file contains no files: " + url); } is = zis; } return new BufferedReader(new InputStreamReader(is, StandardCharsets.UTF_8)); } catch (IOException | RuntimeException e) { try { is.close(); } catch (IOException ignored) { // suppress } throw e; } }

claude · 2026-02-16T05:57:06Z

PR Review: Cypher LOAD CSV Implementation

Overview

This PR implements Cypher's LOAD CSV clause along with CALL {} IN TRANSACTIONS support for batch processing. The implementation adds ~1350 lines of well-tested code addressing issue #3441.

Positive Aspects

1. Excellent Test Coverage

Comprehensive test suite (531 lines) covering basic CSV loading with/without headers, custom field terminators, quoted fields and edge cases, security features (file URL restrictions, path traversal protection), compression support (.gz, .zip), batch transaction processing, empty files and parameter-based URLs
Uses AssertJ assertions matching project style

2. Strong Security Implementation

Two-layer security model: OPENCYPHER_LOAD_CSV_ALLOW_FILE_URLS to disable file access in multi-tenant environments, and OPENCYPHER_LOAD_CSV_IMPORT_DIRECTORY for sandboxed import directory with path traversal protection
Path normalization prevents ../ attacks (LoadCSVStep.java:271-276)
Well-documented security tests

3. Robust CSV Parsing

RFC 4180 compliant with extensions: handles quoted fields with embedded delimiters, multiline field support, both "" and " escape sequences, custom field terminators
Efficient parsing with proper edge case handling

4. Good Architecture

Clean separation: AST classes, execution steps, functions
Follows existing Cypher execution patterns
Proper resource cleanup with try-finally blocks
Memory-efficient streaming (doesn't load entire CSV into memory)

5. Feature Completeness

file() and linenumber() functions for debugging
Transparent compression support
HTTP/HTTPS URL support
Batch transaction support via CALL {} IN TRANSACTIONS

Issues & Concerns

1. Resource Leak Risk (High Priority)

Location: LoadCSVStep.java:209-222

The closeReader() method swallows IOException in the catch block. If the outer close() is called during exception handling and closeReader() fails, the exception is lost.

Recommendation: Log the IOException at minimum to avoid silent failures.

2. Transaction Management (Medium Priority)

Location: SubqueryStep.java:273-299

In syncPullInTransactions, the code calls database.begin() without checking if a transaction is already active. This could lead to nested transaction issues.

Recommendation: Check transaction state before beginning or document that IN TRANSACTIONS should only be used outside existing transactions.

3. Performance: Inefficient Quote Counting (Low Priority)

Location: LoadCSVStep.java:330-342

countUnescapedQuotes() is called on progressively larger strings in the multiline loop (line 322), creating O(n²) behavior for files with very long multiline fields.

Recommendation: Track quote count incrementally rather than recounting the entire accumulated string.

4. Edge Case: ZIP with Multiple Entries (Low Priority)

Location: LoadCSVStep.java:291-296

The code only reads the first entry in a ZIP file. If a ZIP contains multiple files, there's no way to specify which one to load.

Consideration: Document this limitation or add support for file.zip!/entry.csv syntax.

5. Missing Import Validation (Medium Priority)

Location: LoadCSVStep.java:238-258

When file URLs are enabled but no import directory is set, any file path is allowed. This could be a security risk in server deployments.

Recommendation: Add a warning in logs when allowFileUrls=true but importDirectory is empty, or consider making the import directory mandatory when file access is enabled.

6. Potential NPE (Low Priority)

Location: LoadCSVStep.java:226-231

createOutputRow iterates inputRow.getPropertyNames() without null checking inputRow. Add null check to prevent NPE.

7. ParseCSVLine Edge Case (Low Priority)

Location: LoadCSVStep.java:354-360

The trailing delimiter check at line 357 could throw StringIndexOutOfBoundsException if delimLen > i. Add bounds check.

Minor Suggestions

Code Style

LoadCSVStep.java:147 - Consider extracting header mapping to a separate method for readability
Variable naming: currentLineNumber is 0-based internally but 1-based in output - consider renaming for clarity

Documentation

Add JavaDoc examples to LoadCSVStep class showing common usage patterns
Document behavior when URL expression evaluates to different URLs per input row
Add security best practices to GlobalConfiguration descriptions

Testing Gaps

No test for very large CSV files (performance/memory validation)
No test for malformed CSV (e.g., mismatched quote counts)
No test for HTTP URLs (currently only file URLs tested)
No test for concurrent LOAD CSV operations

Performance Considerations

Good: Streaming approach avoids loading entire CSV into memory
Good: Buffered reading with appropriate batch sizes
Concern: No limit on individual field size - malicious CSV with giant quoted field could cause OOM
Concern: HTTP URLs have no timeout configuration (relies on JVM defaults)

Recommendation: Add configuration for max field size and HTTP timeout.

Security Assessment

Strengths:

Path traversal protection via normalization
Configurable file access restrictions
Clear security boundaries

Recommendations:

Consider making import directory mandatory in server mode
Add rate limiting for HTTP URL fetches (prevent SSRF amplification)
Document that file:/// access should be disabled in multi-tenant deployments
Consider adding a whitelist for allowed HTTP/HTTPS domains

Overall Assessment

This is a high-quality implementation that follows ArcadeDB coding standards well. The code is clean, well-tested, and addresses a real need for CSV import functionality. The security model is well-designed with appropriate safeguards.

Priority Fixes Before Merge

Address resource leak in closeReader() (add logging)
Fix/document transaction management in IN TRANSACTIONS mode
Add bounds check in parseCSVLine trailing delimiter logic

Recommended Follow-ups

Add max field size configuration
Add HTTP timeout configuration
Document security best practices for production deployments
Consider performance testing with large CSV files

Recommendation

Approve with minor fixes - The implementation is solid and ready for merge after addressing the high-priority resource leak logging and potential edge cases. The security model is appropriate for the feature.

Review generated by Claude Code

codacy-production · 2026-02-16T07:02:43Z

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation	Diff coverage
✅ -10.46%	✅ 76.12%

Coverage variation details

	Coverable lines	Covered lines	Coverage
Common ancestor commit (`58da446`)	85339	60830	71.28%
Head commit (`b693c08`)	115250 (+29911)	70090 (+9260)	60.82% (-10.46%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details

	Coverable lines	Covered lines	Diff coverage
Pull request (#3450)	356	271	76.12%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings Change summary preferences

codecov · 2026-02-16T07:03:27Z

Codecov Report

❌ Patch coverage is 65.16854% with 124 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.87%. Comparing base (58da446) to head (b693c08).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...b/query/opencypher/executor/steps/LoadCSVStep.java	61.05%	56 Missing and 25 partials ⚠️
.../query/opencypher/executor/steps/SubqueryStep.java	56.33%	21 Missing and 10 partials ⚠️
.../arcadedb/query/opencypher/ast/SubqueryClause.java	50.00%	6 Missing ⚠️
...dedb/query/opencypher/parser/CypherASTBuilder.java	89.47%	0 Missing and 2 partials ⚠️
.../arcadedb/function/cypher/LoadCSVFileFunction.java	66.66%	1 Missing ⚠️
...edb/function/cypher/LoadCSVLineNumberFunction.java	66.66%	1 Missing ⚠️
...query/opencypher/executor/CypherExecutionPlan.java	83.33%	0 Missing and 1 partial ⚠️
...query/opencypher/executor/ExpressionEvaluator.java	66.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3450      +/-   ##
==========================================
+ Coverage   62.37%   62.87%   +0.50%     
==========================================
  Files        1291     1295       +4     
  Lines       85339    85692     +353     
  Branches    17532    17614      +82     
==========================================
+ Hits        53226    53882     +656     
+ Misses      24535    24144     -391     
- Partials     7578     7666      +88

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

* feat: OpenCypher -> load csv * feat: added LOAD CSV in OpenCypher Issue #3441 (cherry picked from commit 9037185)

lvca added 2 commits February 16, 2026 00:32

feat: OpenCypher -> load csv

941a0af

feat: added LOAD CSV in OpenCypher

b693c08

Issue #3441

lvca self-assigned this Feb 16, 2026

lvca added this to the 26.2.1 milestone Feb 16, 2026

lvca merged commit 9037185 into main Feb 16, 2026
8 of 10 checks passed

lvca deleted the CypherLoadCSV branch February 16, 2026 05:53

gemini-code-assist bot reviewed Feb 16, 2026

View reviewed changes

robfrank pushed a commit that referenced this pull request Feb 17, 2026

Cypher load csv (#3450)

f1a5523

* feat: OpenCypher -> load csv * feat: added LOAD CSV in OpenCypher Issue #3441 (cherry picked from commit 9037185)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cypher load csv#3450

Cypher load csv#3450
lvca merged 2 commits intomainfrom
CypherLoadCSV

lvca commented Feb 16, 2026

Uh oh!

Uh oh!

gemini-code-assist bot commented Feb 16, 2026

Uh oh!

mergify bot commented Feb 16, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 16, 2026

Uh oh!

claude bot commented Feb 16, 2026

Uh oh!

codacy-production bot commented Feb 16, 2026

Uh oh!

codecov bot commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lvca commented Feb 16, 2026

Uh oh!

Uh oh!

gemini-code-assist bot commented Feb 16, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

mergify bot commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧪 CI Insights

🟢 All jobs passed!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Feb 16, 2026

PR Review: Cypher LOAD CSV Implementation

Overview

Positive Aspects

1. Excellent Test Coverage

2. Strong Security Implementation

3. Robust CSV Parsing

4. Good Architecture

5. Feature Completeness

Issues & Concerns

1. Resource Leak Risk (High Priority)

2. Transaction Management (Medium Priority)

3. Performance: Inefficient Quote Counting (Low Priority)

4. Edge Case: ZIP with Multiple Entries (Low Priority)

5. Missing Import Validation (Medium Priority)

6. Potential NPE (Low Priority)

7. ParseCSVLine Edge Case (Low Priority)

Minor Suggestions

Code Style

Documentation

Testing Gaps

Performance Considerations

Security Assessment

Overall Assessment

Priority Fixes Before Merge

Recommended Follow-ups

Recommendation

Uh oh!

codacy-production bot commented Feb 16, 2026

Coverage summary from Codacy

See diff coverage on Codacy

See your quality gate settings Change summary preferences

Uh oh!

codecov bot commented Feb 16, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mergify bot commented Feb 16, 2026 •

edited

Loading