Skip to content

fix: prevent incorrect filter pushdown for WHERE predicates with nest…#3534

Merged
robfrank merged 3 commits intomainfrom
fix/faiing-cypher-tests
Feb 28, 2026
Merged

fix: prevent incorrect filter pushdown for WHERE predicates with nest…#3534
robfrank merged 3 commits intomainfrom
fix/faiing-cypher-tests

Conversation

@robfrank
Copy link
Collaborator

…ed variable references

The filter pushdown optimization in WhereClause was using a text-based heuristic (collectFromText) to detect referenced variables in InExpression and FunctionCallExpression. This heuristic only found variables followed by a dot (e.g., n.name) but missed bare variable references like b in labels(b). This caused WHERE predicates referencing variables bound in later MATCH steps to be incorrectly pushed down to earlier node scans, where those variables were not yet available, resulting in empty results.

Fixed by properly traversing sub-expressions of InExpression, FunctionCallExpression, ListComprehensionExpression, ArithmeticExpression, ListExpression, and BooleanWrapperExpression to collect all referenced variables accurately.

…ed variable references

The filter pushdown optimization in WhereClause was using a text-based
heuristic (collectFromText) to detect referenced variables in InExpression
and FunctionCallExpression. This heuristic only found variables followed
by a dot (e.g., n.name) but missed bare variable references like b in
labels(b). This caused WHERE predicates referencing variables bound in
later MATCH steps to be incorrectly pushed down to earlier node scans,
where those variables were not yet available, resulting in empty results.

Fixed by properly traversing sub-expressions of InExpression,
FunctionCallExpression, ListComprehensionExpression, ArithmeticExpression,
ListExpression, and BooleanWrapperExpression to collect all referenced
variables accurately.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an issue in the query optimizer where WHERE predicates containing nested variable references were sometimes incorrectly pushed down, resulting in empty result sets. The core of the fix involves enhancing the variable collection mechanism within the abstract syntax tree (AST) traversal to ensure all variables are properly identified across various expression types. This prevents premature filtering and guarantees correct query execution, as demonstrated by the addition of a new, targeted test case.

Highlights

  • Filter Pushdown Fix: Addressed a critical bug where WHERE predicates with nested variable references were incorrectly pushed down to earlier query steps, leading to erroneous empty results. This was due to an incomplete variable detection heuristic.
  • Improved Variable Collection Logic: Refactored the collectVariablesRecursive and collectExpressionVariables methods to accurately identify all referenced variables within complex expression types, including InExpression, FunctionCallExpression, ListComprehensionExpression, ArithmeticExpression, ListExpression, and BooleanWrapperExpression. This replaces a less robust text-based heuristic.
  • New Test Case Added: Introduced a new test case, listComprehensionInWhereWithLabelsAndToLower, to specifically validate the corrected filter pushdown behavior for scenarios involving list comprehensions, labels(), and toLower() functions within WHERE clauses.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • engine/src/main/java/com/arcadedb/query/opencypher/ast/WhereClause.java
    • Updated collectVariablesRecursive to correctly traverse InExpression elements for variable collection.
    • Modified collectExpressionVariables to recursively process arguments of FunctionCallExpression and sub-expressions of ListComprehensionExpression, ArithmeticExpression, ListExpression, and BooleanWrapperExpression to accurately gather all referenced variables.
    • Removed reliance on text-based heuristics (collectFromText) for variable detection in these complex expression types.
  • engine/src/test/java/com/arcadedb/query/opencypher/OpenCypherExpressionTest.java
    • Added a new test method, listComprehensionInWhereWithLabelsAndToLower, to verify the fix for filter pushdown with list comprehensions and nested function calls in WHERE clauses.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify
Copy link
Contributor

mergify bot commented Feb 28, 2026

🧪 CI Insights

Here's what we observed from your CI run for 4a6709f.

🟢 All jobs passed!

But CI Insights is watching 👀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a critical bug regarding incorrect filter pushdown by replacing a fragile text-based variable detection heuristic with a robust AST traversal. This is a significant improvement for query correctness. The new test case is also well-designed to validate the fix. I have one suggestion for a further refinement in WhereClause.java to make the variable collection even more accurate for ListComprehensionExpression, which could improve optimization opportunities in the future.

Comment on lines +166 to +171
} else if (expr instanceof ListComprehensionExpression listComp) {
collectExpressionVariables(listComp.getListExpression(), vars);
if (listComp.getWhereExpression() != null)
collectExpressionVariables(listComp.getWhereExpression(), vars);
if (listComp.getMapExpression() != null)
collectExpressionVariables(listComp.getMapExpression(), vars);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This correctly traverses the sub-expressions of ListComprehensionExpression. However, it doesn't account for the fact that the iteration variable is local to the comprehension. As a result, the local variable is collected as if it were an external dependency.

This is safe because it will prevent pushdown, thus avoiding the original bug of incorrect results. However, it is not fully accurate and can be overly conservative, preventing valid optimizations where a predicate could be pushed down.

For example, in [x IN n.items | x > 10], x is local. The expression only depends on n. The current logic will collect both {n, x}. Since x is not defined in the outer scope, this will prevent pushdown. A more accurate collection would only yield {n}, allowing for pushdown.

To fix this, the variable collection logic would need to be aware of local scopes introduced by constructs like list comprehensions. This might require a larger refactoring (e.g., passing a set of locally-scoped variables to ignore during traversal). While maybe out of scope for this PR, it's a potential future improvement for the optimizer.

@claude
Copy link

claude bot commented Feb 28, 2026

Code Review

This PR correctly identifies and fixes a class of bugs where filter pushdown in WhereClause used text-based heuristics (collectFromText) to detect referenced variables, missing bare variable references like b in labels(b). Replacing those heuristics with proper AST traversal is the right approach.

Positive Aspects

  • The root cause is correctly diagnosed and the fix is minimal and focused.
  • Using getExpression()/getList()/getArguments() etc. instead of text scanning is the correct design direction.
  • The test follows the project's assertion style (assertThat(...).isTrue()) and includes a clear comment referencing the TCK scenario.
  • The fix for ListComprehensionExpression correctly processes all three sub-expressions (listExpression, whereExpression, mapExpression) without mistakenly treating the iteration variable as an outer-scope reference.

Remaining collectFromText() Usages (Same Bug Class)

After this PR, the following expression types still fall back to the text heuristic in collectVariablesRecursive(), which has the same potential for incorrect filter pushdown:

} else if (expr instanceof StringMatchExpression) {
    collectFromText(expr.getText(), vars);  // ⚠️ heuristic
} else if (expr instanceof RegexExpression) {
    collectFromText(expr.getText(), vars);  // ⚠️ heuristic
} else if (expr instanceof LabelCheckExpression) {
    collectFromText(expr.getText(), vars);  // ⚠️ heuristic
} else if (expr instanceof PatternPredicateExpression) {
    collectFromText(expr.getText(), vars);  // ⚠️ heuristic

These have direct accessor methods that could be used instead:

  • StringMatchExpressiongetExpression(), getPattern()
  • RegexExpressiongetExpression(), getPattern()
  • LabelCheckExpressiongetVariableExpression()

PatternPredicateExpression is more complex, but the others are straightforward. These are pre-existing issues not introduced by this PR, but since you are already in this area it would be worth fixing them together, or at minimum opening a follow-up issue so they don't get forgotten.

Expression Types With No Handling At All

A few expression types that implement Expression (not BooleanExpression) have no branch in collectExpressionVariables(): ExistsExpression, ListPredicateExpression (all(), any(), none(), single()). If any of these appear inside a pushdown predicate, the filter pushdown will silently treat them as having no variable references. Again, pre-existing and outside this PR's scope, but worth a follow-up.

Test Coverage

The single test covers the exact scenario from the bug report (TCK List12). It would strengthen the regression suite to add at least one more case directly exercising the other newly-fixed paths, e.g., a bare FunctionCallExpression reference (WHERE someFunc(b) = ...) and an ArithmeticExpression case (WHERE b.age + 1 > 5 with b bound in a later step). Without them, regressions in those code paths would not be caught automatically.

Minor Nit

The test does not explicitly assert the returned node is (:C). Given it could theoretically match (:B) with a different data shape, a property assertion on the returned result would make the test more precise:

final Result result = resultSet.next();
assertThat(result.<String>getProperty("@type")).isEqualTo("C"); // or check a distinguishing property

Summary

The fix is correct and the approach is sound. The two main suggestions are: (1) fix or track the remaining collectFromText() usages for StringMatchExpression/RegexExpression/LabelCheckExpression while in this area, and (2) add test cases for the other newly-covered expression types. Neither is a blocker for this PR.

…versal

Address code review feedback: replace text-based variable detection in
StringMatchExpression, RegexExpression, and LabelCheckExpression with
proper accessor-based traversal. Also assert the returned node type
in the regression test.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 28, 2026

Code Review

This is a well-targeted fix that replaces an unreliable text-based heuristic with proper AST traversal. The root cause (the collectFromText fallback only detecting variable.property patterns, not bare variable references) is clearly diagnosed and the approach is correct.

Strengths

  • The fix correctly uses Java pattern-matching instanceof throughout, consistent with the existing code style in this file.
  • Proper use of final on loop variables and parameters.
  • The test case is well-structured, validates both the positive match and the absence of spurious results, and links to the originating TCK scenario.
  • assertThat(...).as("...") with a descriptive message is a nice touch for test diagnostics.

Issues

1. List comprehension loop variable leaks into outer variable set

In collectExpressionVariables for ListComprehensionExpression, the loop variable (e.g., x in [x IN labels(b) | toLower(x)]) is recursed into the map and where expressions without filtering it out. This means x gets added to vars as if it were an outer query variable.

} else if (expr instanceof ListComprehensionExpression listComp) {
    collectExpressionVariables(listComp.getListExpression(), vars);  // correctly collects `b`
    if (listComp.getWhereExpression() != null)
        collectExpressionVariables(listComp.getWhereExpression(), vars);
    if (listComp.getMapExpression() != null)
        collectExpressionVariables(listComp.getMapExpression(), vars);  // also collects `x` ← false positive
}

The VariableExpression for x inside toLower(x) will match the vars.add(varExpr.getVariableName()) branch and add it. The loop variable should be excluded. A correct approach would be:

} else if (expr instanceof ListComprehensionExpression listComp) {
    collectExpressionVariables(listComp.getListExpression(), vars);
    final Set<String> inner = new HashSet<>(vars);
    if (listComp.getWhereExpression() != null)
        collectExpressionVariables(listComp.getWhereExpression(), inner);
    if (listComp.getMapExpression() != null)
        collectExpressionVariables(listComp.getMapExpression(), inner);
    inner.remove(listComp.getVariable()); // strip loop variable
    vars.addAll(inner);
}

In practice this is a conservative over-approximation (prevents pushdown when the query happens to have a match variable with the same name as a loop variable), but it's technically wrong semantics. For example, a query like:

MATCH (x)-->(b) WHERE x.name IN [x IN labels(b) | x] RETURN b

...would incorrectly report that the predicate references x as both an outer match variable and a list-comprehension loop variable, but the outer x is the real variable here. This edge-case shadowing could cause confusing behavior.

2. PatternPredicateExpression still uses text fallback

The collectFromText heuristic is preserved for PatternPredicateExpression:

} else if (expr instanceof PatternPredicateExpression) {
    // Pattern predicates reference multiple variables, collect from text
    collectFromText(expr.getText(), vars);
}

This is the same incomplete heuristic that caused the original bug — it only detects var.property patterns, not bare variable references. Pattern predicates like EXISTS { (n)-[:R]->(m) } do reference bare variables. This is an existing gap, but worth tracking as a follow-up since the same class of bug can still occur here.

3. Missing expression types

There are other Expression subtypes not yet handled (e.g., CaseExpression, UnaryExpression, MapLiteralExpression, ReduceExpression, etc.). For these, collectExpressionVariables silently does nothing, which means variables inside them are missed. This is a pre-existing issue — not introduced here — but the PR's approach of enumerating concrete types leaves an unbounded set of unhandled cases. A visitor/interface pattern on the AST node types would be more robust long-term.

Test Coverage

The single test case covers the reported scenario well. Consider adding:

  • A case with StringMatchExpression (e.g., WHERE n.name STARTS WITH b.prefix) to cover the newly fixed branch.
  • A case with RegexExpression (e.g., WHERE n.name =~ b.pattern) similarly.
  • A case verifying that a predicate referencing only early-bound variables is still correctly pushed down (regression guard for the optimization itself).

Minor

The @Test annotation is missing @DisplayName — the other tests in the file use method names directly, which is consistent, so this isn't a requirement, just an observation.


Overall: the core fix is correct and necessary. The list comprehension loop variable scoping issue (point 1) is the main thing worth addressing before merge; the others are either pre-existing or minor.

…ization

ExplainStatementExecutionTest: match actual plan output "SCAN WITH FILTER BUCKET"
SelectStatementExecutionTest: accept FetchFromTypeWithFilterStep as valid no-index step

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 28, 2026

Follow-up Review Notes

Corrections and observations not covered in the earlier reviews:

Correction to First Review

My first review comment incorrectly stated that StringMatchExpression, RegexExpression, and LabelCheckExpression still fall back to collectFromText. Looking at the diff again, these three are fully fixed in this PR — they use proper AST accessors (getExpression(), getPattern(), getVariableExpression()). That portion of the first review was wrong. Apologies for the noise.

SQL Test Changes Need Explanation

The PR description focuses on the Cypher filter pushdown fix, but two SQL execution plan tests are also modified:

ExplainStatementExecutionTest — assertion changed from:

assertThat(executionPlanAsString).contains("FILTER ITEMS WHERE");

to:

assertThat(executionPlanAsString).contains("SCAN WITH FILTER BUCKET");

SelectStatementExecutionTest — three assertions relaxed from FetchFromTypeExecutionStep only to accepting FetchFromTypeExecutionStep || FetchFromTypeWithFilterStep.

These changes suggest the SQL query planner is now generating FetchFromTypeWithFilterStep (scan-with-filter) where it previously emitted a separate filter step. This is unrelated to the Cypher WhereClause change, so they are presumably incidental adaptations from rebasing onto a main branch that already had a SQL planner change.

It would help to clarify in the PR description or commit message whether these SQL test changes:

  1. Are purely incidental (tests needed updating due to another change already on main), or
  2. Are intentional and tie to this fix somehow

If incidental, it's worth noting since reviewers might otherwise wonder if the SQL planner was accidentally changed.

Loop Variable Issue (Clarification)

The loop variable over-approximation noted in the previous review (e.g., x from [x IN labels(b) | toLower(x)] leaking into the outer variable set) is a conservative over-approximation, not a correctness bug. In the worst case it prevents a valid pushdown (missed optimization), but it will never cause an incorrect pushdown. This is a minor issue that can be addressed as a follow-up.

@robfrank robfrank merged commit 5614a32 into main Feb 28, 2026
20 of 23 checks passed
@codacy-production
Copy link

codacy-production bot commented Feb 28, 2026

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation Diff coverage
-9.07% 79.31%
Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (6292916) 102731 67295 65.51%
Head commit (5216c63) 133299 (+30568) 75225 (+7930) 56.43% (-9.07%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#3534) 29 23 79.31%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings    Change summary preferences

robfrank added a commit that referenced this pull request Mar 5, 2026
…ed variable references (#3534)

(cherry picked from commit 5614a32)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant