Skip to content

Cypher load csv#3450

Merged
lvca merged 2 commits intomainfrom
CypherLoadCSV
Feb 16, 2026
Merged

Cypher load csv#3450
lvca merged 2 commits intomainfrom
CypherLoadCSV

Conversation

@lvca
Copy link
Member

@lvca lvca commented Feb 16, 2026

Issue #3441

@lvca lvca self-assigned this Feb 16, 2026
@lvca lvca added this to the 26.2.1 milestone Feb 16, 2026
@lvca lvca merged commit 9037185 into main Feb 16, 2026
8 of 10 checks passed
@lvca lvca deleted the CypherLoadCSV branch February 16, 2026 05:53
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @lvca, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances ArcadeDB's data ingestion capabilities by introducing the LOAD CSV clause to Cypher queries. This feature allows users to seamlessly import data from various CSV sources, including local files, HTTP/S URLs, and compressed archives, directly into the database. It provides flexible parsing options, robust security controls for file access, and introduces transactional batching for subqueries, making large-scale data imports more efficient and reliable.

Highlights

  • New LOAD CSV Cypher Clause: Introduced the LOAD CSV clause, enabling direct import of data from CSV files into ArcadeDB, supporting both local files and URLs.
  • CSV Parsing Options: Implemented support for WITH HEADERS to use the first row as column names and FIELDTERMINATOR for custom delimiters. Robust parsing handles quoted fields, embedded delimiters, newlines, and backslash escaping.
  • File Access Security: Added global configurations arcadedb.opencypher.loadCsv.allowFileUrls and arcadedb.opencypher.loadCsv.importDirectory to control local file access and prevent path traversal vulnerabilities.
  • Contextual Functions for LOAD CSV: Provided new Cypher functions file() and linenumber() to retrieve the current CSV file URL and 1-based line number during processing.
  • Transactional Batching for Subqueries: Enhanced CALL {} subqueries with IN TRANSACTIONS support, allowing operations within subqueries to be committed in configurable batches, improving performance and memory usage for large imports.
  • Compressed File Support: Enabled automatic decompression for GZIP (.gz) and ZIP (.zip) compressed CSV files, simplifying data ingestion from compressed sources.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • engine/src/main/java/com/arcadedb/GlobalConfiguration.java
    • Added OPENCYPHER_LOAD_CSV_ALLOW_FILE_URLS to control local file URL access.
    • Added OPENCYPHER_LOAD_CSV_IMPORT_DIRECTORY to define a root directory for CSV imports and prevent path traversal.
  • engine/src/main/java/com/arcadedb/function/cypher/LoadCSVFileFunction.java
    • Added new stateless function file() to return the current CSV file URL.
  • engine/src/main/java/com/arcadedb/function/cypher/LoadCSVLineNumberFunction.java
    • Added new stateless function linenumber() to return the current CSV line number.
  • engine/src/main/java/com/arcadedb/query/opencypher/ast/ClauseEntry.java
    • Added LOAD_CSV as a new ClauseType to the Cypher Abstract Syntax Tree (AST).
  • engine/src/main/java/com/arcadedb/query/opencypher/ast/LoadCSVClause.java
    • Added new AST class LoadCSVClause to represent the LOAD CSV statement and its parameters.
  • engine/src/main/java/com/arcadedb/query/opencypher/ast/SubqueryClause.java
    • Modified SubqueryClause to include inTransactions and batchSize properties for transactional subqueries.
  • engine/src/main/java/com/arcadedb/query/opencypher/executor/CypherExecutionPlan.java
    • Imported LoadCSVClause and LoadCSVStep.
    • Integrated LOAD_CSV clause into the execution plan builder, creating a LoadCSVStep.
  • engine/src/main/java/com/arcadedb/query/opencypher/executor/CypherFunctionFactory.java
    • Registered file() and linenumber() as Cypher-specific functions.
    • Added factory methods to create instances of LoadCSVFileFunction and LoadCSVLineNumberFunction.
  • engine/src/main/java/com/arcadedb/query/opencypher/executor/ExpressionEvaluator.java
    • Added logic to propagate __loadCSV_file and __loadCSV_linenumber context variables for file() and linenumber() functions.
  • engine/src/main/java/com/arcadedb/query/opencypher/executor/steps/LoadCSVStep.java
    • Added new execution step LoadCSVStep to handle reading, parsing, and processing of CSV data.
    • Implemented methods for opening raw input streams with security checks, resolving file paths, and handling compressed files (GZIP, ZIP).
    • Developed robust CSV line reading and parsing logic, including support for quoted fields and various escape sequences.
  • engine/src/main/java/com/arcadedb/query/opencypher/executor/steps/SubqueryStep.java
    • Modified syncPull to dispatch to syncPullNormal or syncPullInTransactions based on the subquery clause.
    • Added syncPullInTransactions method to handle batch processing and committing within subqueries.
    • Implemented resolveBatchSize to determine the batch size for transactional subqueries.
    • Updated prettyPrint to reflect the IN TRANSACTIONS clause.
  • engine/src/main/java/com/arcadedb/query/opencypher/parser/ClauseDispatcher.java
    • Registered loadCSVClause to be handled by handleLoadCSV.
    • Added handleLoadCSV method to add LoadCSVClause to the statement builder.
  • engine/src/main/java/com/arcadedb/query/opencypher/parser/CypherASTBuilder.java
    • Modified visitSubqueryClause to parse IN TRANSACTIONS parameters, including batch size.
    • Added visitLoadCSVClause method to parse the LOAD CSV clause from the ANTLR grammar.
  • engine/src/main/java/com/arcadedb/query/opencypher/parser/CypherSemanticValidator.java
    • Updated validateBoundVariables and validateVariableScope to correctly handle variables introduced by the LOAD_CSV clause.
  • engine/src/main/java/com/arcadedb/query/opencypher/parser/FunctionValidator.java
    • Registered file and linenumber functions with their argument counts for validation.
  • engine/src/main/java/com/arcadedb/query/opencypher/parser/StatementBuilder.java
    • Added addLoadCSV method to add a LoadCSVClause to the statement's clause order.
  • engine/src/test/java/com/arcadedb/query/opencypher/OpenCypherLoadCSVTest.java
    • Added a new comprehensive test suite for LOAD CSV functionality.
    • Included tests for basic CSV loading, headers, custom terminators, node creation, file() and linenumber() functions, quoted fields, empty files, parameterized URLs, bare file paths, and compressed files (GZIP, ZIP).
    • Added security tests for disabled file URLs and import directory path traversal prevention.
    • Included tests for CALL {} IN TRANSACTIONS with and without specified batch sizes.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify
Copy link
Contributor

mergify bot commented Feb 16, 2026

🧪 CI Insights

Here's what we observed from your CI run for b693c08.

🟢 All jobs passed!

But CI Insights is watching 👀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces LOAD CSV functionality to the Cypher engine, a significant feature for data import. It also adds CALL {} IN TRANSACTIONS for batched writes within subqueries, which is very useful for large data loading operations.

The implementation of LOAD CSV is comprehensive, supporting various URL schemes, security controls to prevent path traversal, transparent decompression of .gz and .zip files, and contextual functions like file() and linenumber(). The IN TRANSACTIONS feature for CALL subqueries is also well-implemented, enabling periodic commits crucial for memory management during large imports.

The code is well-structured with new execution steps, AST nodes, and parser rules, and is accompanied by a thorough test suite. I've identified a high-severity issue concerning a potential resource leak in LoadCSVStep when handling compressed files, along with a minor improvement for processing ZIP archives. Overall, this is an excellent contribution that significantly enhances ArcadeDB's data import capabilities via Cypher.

Comment on lines +286 to +300
static BufferedReader openReader(final String url, final CommandContext context) throws IOException {
InputStream is = openRawInputStream(url, context);

if (url.endsWith(".gz"))
is = new GZIPInputStream(is);
else if (url.endsWith(".zip")) {
final ZipInputStream zis = new ZipInputStream(is);
final ZipEntry entry = zis.getNextEntry();
if (entry == null)
throw new CommandExecutionException("ZIP file is empty: " + url);
is = zis;
}

return new BufferedReader(new InputStreamReader(is, StandardCharsets.UTF_8));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of openReader has a potential resource leak. If an exception occurs during the wrapping of the InputStream (e.g., in new GZIPInputStream(is) or zis.getNextEntry()), the underlying stream is obtained from openRawInputStream will not be closed.

Additionally, when reading from a .zip file, the code only gets the first entry and does not handle cases where that entry might be a directory, which would result in reading an empty stream.

I suggest refactoring this method to use a try-catch block to ensure resource cleanup on failure and to correctly handle directories within ZIP archives by iterating until the first file entry is found.

  static BufferedReader openReader(final String url, final CommandContext context) throws IOException {
    InputStream is = openRawInputStream(url, context);
    try {
      if (url.endsWith(".gz")) {
        is = new GZIPInputStream(is);
      } else if (url.endsWith(".zip")) {
        final ZipInputStream zis = new ZipInputStream(is);
        ZipEntry entry = zis.getNextEntry();
        while (entry != null && entry.isDirectory()) {
          entry = zis.getNextEntry();
        }

        if (entry == null) {
          zis.close();
          throw new CommandExecutionException("ZIP file contains no files: " + url);
        }
        is = zis;
      }
      return new BufferedReader(new InputStreamReader(is, StandardCharsets.UTF_8));
    } catch (IOException | RuntimeException e) {
      try {
        is.close();
      } catch (IOException ignored) {
        // suppress
      }
      throw e;
    }
  }

@claude
Copy link

claude bot commented Feb 16, 2026

PR Review: Cypher LOAD CSV Implementation

Overview

This PR implements Cypher's LOAD CSV clause along with CALL {} IN TRANSACTIONS support for batch processing. The implementation adds ~1350 lines of well-tested code addressing issue #3441.

Positive Aspects

1. Excellent Test Coverage

  • Comprehensive test suite (531 lines) covering basic CSV loading with/without headers, custom field terminators, quoted fields and edge cases, security features (file URL restrictions, path traversal protection), compression support (.gz, .zip), batch transaction processing, empty files and parameter-based URLs
  • Uses AssertJ assertions matching project style

2. Strong Security Implementation

  • Two-layer security model: OPENCYPHER_LOAD_CSV_ALLOW_FILE_URLS to disable file access in multi-tenant environments, and OPENCYPHER_LOAD_CSV_IMPORT_DIRECTORY for sandboxed import directory with path traversal protection
  • Path normalization prevents ../ attacks (LoadCSVStep.java:271-276)
  • Well-documented security tests

3. Robust CSV Parsing

  • RFC 4180 compliant with extensions: handles quoted fields with embedded delimiters, multiline field support, both "" and " escape sequences, custom field terminators
  • Efficient parsing with proper edge case handling

4. Good Architecture

  • Clean separation: AST classes, execution steps, functions
  • Follows existing Cypher execution patterns
  • Proper resource cleanup with try-finally blocks
  • Memory-efficient streaming (doesn't load entire CSV into memory)

5. Feature Completeness

  • file() and linenumber() functions for debugging
  • Transparent compression support
  • HTTP/HTTPS URL support
  • Batch transaction support via CALL {} IN TRANSACTIONS

Issues & Concerns

1. Resource Leak Risk (High Priority)

Location: LoadCSVStep.java:209-222

The closeReader() method swallows IOException in the catch block. If the outer close() is called during exception handling and closeReader() fails, the exception is lost.

Recommendation: Log the IOException at minimum to avoid silent failures.

2. Transaction Management (Medium Priority)

Location: SubqueryStep.java:273-299

In syncPullInTransactions, the code calls database.begin() without checking if a transaction is already active. This could lead to nested transaction issues.

Recommendation: Check transaction state before beginning or document that IN TRANSACTIONS should only be used outside existing transactions.

3. Performance: Inefficient Quote Counting (Low Priority)

Location: LoadCSVStep.java:330-342

countUnescapedQuotes() is called on progressively larger strings in the multiline loop (line 322), creating O(n²) behavior for files with very long multiline fields.

Recommendation: Track quote count incrementally rather than recounting the entire accumulated string.

4. Edge Case: ZIP with Multiple Entries (Low Priority)

Location: LoadCSVStep.java:291-296

The code only reads the first entry in a ZIP file. If a ZIP contains multiple files, there's no way to specify which one to load.

Consideration: Document this limitation or add support for file.zip!/entry.csv syntax.

5. Missing Import Validation (Medium Priority)

Location: LoadCSVStep.java:238-258

When file URLs are enabled but no import directory is set, any file path is allowed. This could be a security risk in server deployments.

Recommendation: Add a warning in logs when allowFileUrls=true but importDirectory is empty, or consider making the import directory mandatory when file access is enabled.

6. Potential NPE (Low Priority)

Location: LoadCSVStep.java:226-231

createOutputRow iterates inputRow.getPropertyNames() without null checking inputRow. Add null check to prevent NPE.

7. ParseCSVLine Edge Case (Low Priority)

Location: LoadCSVStep.java:354-360

The trailing delimiter check at line 357 could throw StringIndexOutOfBoundsException if delimLen > i. Add bounds check.

Minor Suggestions

Code Style

  1. LoadCSVStep.java:147 - Consider extracting header mapping to a separate method for readability
  2. Variable naming: currentLineNumber is 0-based internally but 1-based in output - consider renaming for clarity

Documentation

  1. Add JavaDoc examples to LoadCSVStep class showing common usage patterns
  2. Document behavior when URL expression evaluates to different URLs per input row
  3. Add security best practices to GlobalConfiguration descriptions

Testing Gaps

  1. No test for very large CSV files (performance/memory validation)
  2. No test for malformed CSV (e.g., mismatched quote counts)
  3. No test for HTTP URLs (currently only file URLs tested)
  4. No test for concurrent LOAD CSV operations

Performance Considerations

Good: Streaming approach avoids loading entire CSV into memory
Good: Buffered reading with appropriate batch sizes
Concern: No limit on individual field size - malicious CSV with giant quoted field could cause OOM
Concern: HTTP URLs have no timeout configuration (relies on JVM defaults)

Recommendation: Add configuration for max field size and HTTP timeout.

Security Assessment

Strengths:

  • Path traversal protection via normalization
  • Configurable file access restrictions
  • Clear security boundaries

Recommendations:

  1. Consider making import directory mandatory in server mode
  2. Add rate limiting for HTTP URL fetches (prevent SSRF amplification)
  3. Document that file:/// access should be disabled in multi-tenant deployments
  4. Consider adding a whitelist for allowed HTTP/HTTPS domains

Overall Assessment

This is a high-quality implementation that follows ArcadeDB coding standards well. The code is clean, well-tested, and addresses a real need for CSV import functionality. The security model is well-designed with appropriate safeguards.

Priority Fixes Before Merge

  1. Address resource leak in closeReader() (add logging)
  2. Fix/document transaction management in IN TRANSACTIONS mode
  3. Add bounds check in parseCSVLine trailing delimiter logic

Recommended Follow-ups

  1. Add max field size configuration
  2. Add HTTP timeout configuration
  3. Document security best practices for production deployments
  4. Consider performance testing with large CSV files

Recommendation

Approve with minor fixes - The implementation is solid and ready for merge after addressing the high-priority resource leak logging and potential edge cases. The security model is appropriate for the feature.


Review generated by Claude Code

@codacy-production
Copy link

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation Diff coverage
-10.46% 76.12%
Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (58da446) 85339 60830 71.28%
Head commit (b693c08) 115250 (+29911) 70090 (+9260) 60.82% (-10.46%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#3450) 356 271 76.12%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings    Change summary preferences

@codecov
Copy link

codecov bot commented Feb 16, 2026

Codecov Report

❌ Patch coverage is 65.16854% with 124 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.87%. Comparing base (58da446) to head (b693c08).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...b/query/opencypher/executor/steps/LoadCSVStep.java 61.05% 56 Missing and 25 partials ⚠️
.../query/opencypher/executor/steps/SubqueryStep.java 56.33% 21 Missing and 10 partials ⚠️
.../arcadedb/query/opencypher/ast/SubqueryClause.java 50.00% 6 Missing ⚠️
...dedb/query/opencypher/parser/CypherASTBuilder.java 89.47% 0 Missing and 2 partials ⚠️
.../arcadedb/function/cypher/LoadCSVFileFunction.java 66.66% 1 Missing ⚠️
...edb/function/cypher/LoadCSVLineNumberFunction.java 66.66% 1 Missing ⚠️
...query/opencypher/executor/CypherExecutionPlan.java 83.33% 0 Missing and 1 partial ⚠️
...query/opencypher/executor/ExpressionEvaluator.java 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3450      +/-   ##
==========================================
+ Coverage   62.37%   62.87%   +0.50%     
==========================================
  Files        1291     1295       +4     
  Lines       85339    85692     +353     
  Branches    17532    17614      +82     
==========================================
+ Hits        53226    53882     +656     
+ Misses      24535    24144     -391     
- Partials     7578     7666      +88     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

robfrank pushed a commit that referenced this pull request Feb 17, 2026
* feat: OpenCypher -> load csv

* feat: added LOAD CSV in OpenCypher

Issue #3441

(cherry picked from commit 9037185)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant