Skip to content

fix: IN (SELECT ...) returns wrong results when field has an index (#3565)#3566

Merged
robfrank merged 3 commits intomainfrom
fix/3565-in-subquery-wrong-results-with-index
Mar 3, 2026
Merged

fix: IN (SELECT ...) returns wrong results when field has an index (#3565)#3566
robfrank merged 3 commits intomainfrom
fix/3565-in-subquery-wrong-results-with-index

Conversation

@robfrank
Copy link
Collaborator

@robfrank robfrank commented Mar 3, 2026

The index-based execution path for IN (SELECT ...) failed to find matches because FetchFromIndexStep.init() evaluated the subquery alias before GlobalLetQueryStep had populated it in the context. Additionally, subquery aliases were misidentified as field identifiers (for BY-ITEM inverted syntax), and subquery Result objects were not unwrapped to their raw property values before index lookup.

…3565)

The index-based execution path for IN (SELECT ...) failed to find matches
because FetchFromIndexStep.init() evaluated the subquery alias before
GlobalLetQueryStep had populated it in the context. Additionally, subquery
aliases were misidentified as field identifiers (for BY-ITEM inverted
syntax), and subquery Result objects were not unwrapped to their raw
property values before index lookup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@robfrank robfrank added this to the 26.3.1 milestone Mar 3, 2026
@robfrank robfrank linked an issue Mar 3, 2026 that may be closed by this pull request
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical bug where IN (SELECT ...) queries produced incorrect results when the target field was indexed. The core issue stemmed from improper timing of subquery result evaluation and a failure to correctly unwrap subquery Result objects into their raw values before performing index lookups. Additionally, the system was misidentifying subquery aliases as field identifiers, impacting query processing. The changes ensure that subquery data is correctly prepared and utilized during index-based query execution, leading to accurate results.

Highlights

  • Query Execution Order: Adjusted the execution order within FetchFromIndexStep to ensure that subquery results are fully populated in the context before the index condition is evaluated, preventing incorrect lookups.
  • Subquery Result Unwrapping: Introduced a mechanism to unwrap Result objects returned by subqueries, extracting their raw property values for accurate comparison against indexed fields. This addresses cases where IN (SELECT ...) failed to match due to type mismatch between Result objects and raw index values.
  • Field Identifier Recognition: Refined the logic for identifying field expressions in InCondition to correctly differentiate between actual field identifiers and internal subquery aliases. This prevents subquery aliases from being mistakenly treated as field identifiers, which could lead to incorrect query planning.
  • Regression Tests: Added comprehensive regression tests to validate the fix for IN (SELECT ...) queries across various scenarios, including unique, non-unique, and no-index conditions, as well as multiple subquery results.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • engine/src/main/java/com/arcadedb/query/sql/executor/FetchFromIndexStep.java
    • Reordered pullPrevious call before init in syncPull to ensure context variables are populated.
    • Applied unwrapSubQueryResult to values used in createCursor within processInCondition and cartesianProduct.
    • Added unwrapSubQueryResult static method to extract raw property values from single-property Result objects.
  • engine/src/main/java/com/arcadedb/query/sql/parser/InCondition.java
    • Updated isRightSideFieldIdentifier logic to better distinguish field identifiers from subquery aliases, specifically by checking for '$' characters.
  • engine/src/test/java/com/arcadedb/query/sql/executor/SelectStatementExecutionTest.java
    • Added inSubqueryWithUniqueIndex test case to verify IN (SELECT ...) with a unique index.
    • Added inSubqueryWithNotUniqueIndex test case to verify IN (SELECT ...) with a non-unique index.
    • Added inSubqueryWithoutIndex test case to verify IN (SELECT ...) without an index.
    • Added inSubqueryWithMultipleValuesAndIndex test case to verify IN (SELECT ...) with multiple subquery results and an index.
Activity
  • No explicit activity (comments, reviews, or progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify
Copy link
Contributor

mergify bot commented Mar 3, 2026

🧪 CI Insights

Here's what we observed from your CI run for a560f17.

🟢 All jobs passed!

But CI Insights is watching 👀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a solid fix for an issue where IN (SELECT ...) queries with an indexed field returned incorrect results. The changes correctly address the three root causes described: the initialization order of query steps is fixed, subquery aliases are no longer misidentified as field identifiers, and subquery Result objects are now properly unwrapped to their raw values before index lookups. The addition of a comprehensive set of regression tests for various index configurations is a great way to ensure this bug does not reappear. The code is clean and the fixes are well-targeted. I have no further suggestions.

@claude
Copy link

claude bot commented Mar 3, 2026

Code Review

The fix correctly addresses the three root causes described in the PR. Here are my observations:

Correctness

Execution order fix (FetchFromIndexStep.java, syncPull) — Moving pullPrevious before init is the right fix. GlobalLetQueryStep.syncPull calls calculate which populates the context variable via context.setVariable. The previous order evaluated init (and therefore processInCondition) before the subquery variable existed in the context, yielding an empty/null result set.

unwrapSubQueryResult — The guard !result.isElement() correctly limits unwrapping to projection results (not full documents). The single-property restriction is a reasonable heuristic for the common SELECT field FROM Type pattern. However, there is a silent failure mode: when a subquery returns multiple properties (e.g. SELECT name, age FROM Person), the method returns the raw Result object unchanged and the index lookup will still silently fail to find a match. It would be worth noting in the Javadoc that multi-column subqueries are unsupported in this path, or throwing a meaningful CommandExecutionException so the user knows why.

isRightSideFieldIdentifier regex — The regex approach is cleaner than the previous isBaseIdentifier() shortcut. The pattern [a-zA-Z_][a-zA-Z0-9_](.[a-zA-Z_][a-zA-Z0-9_])* correctly rejects subquery aliases containing dollar signs and properly handles nested property paths.

Style / Conventions

  1. Fully qualified type name in unwrapSubQueryResult:
    The declaration uses java.util.Set but java.util.* is already imported — this should just be Set, per the project convention of never using FQNs when an import is present.

  2. Regex compiled on every call in isRightSideFieldIdentifier:
    String.matches compiles the Pattern on every invocation. Given this is called during query planning it is a small cost, but the idiomatic fix is a static compiled constant:

    private static final Pattern FIELD_IDENTIFIER_PATTERN =
    Pattern.compile("[a-zA-Z_][a-zA-Z0-9_](\.[a-zA-Z_][a-zA-Z0-9_])*");
    // ...
    return FIELD_IDENTIFIER_PATTERN.matcher(normalizedRightStr).matches();

Minor Inconsistency

In the anonymous Iterator created inside processInCondition, getKey() returns the original item (potentially a wrapped Result), while createCursor was given unwrapSubQueryResult(item). If any downstream step uses the key stored on the intermediate result it would see the wrapped object rather than the raw value. This is not a regression from the pre-existing code, but worth noting.

Test Coverage

The four regression tests cover the important cases well (unique index, non-unique index, no index, multiple values). One path that lacks a test is the inverted syntax guarded by isRightSideFieldIdentifier — e.g. SELECT FROM T WHERE 'hello' IN tags where tags is a multi-value field — to confirm the regex change does not accidentally break that execution path.

Summary

The three independent bugs (execution order, result unwrapping, alias misclassification) are correctly identified and fixed. Please address the FQN usage and the per-call regex compilation before merging.

Address code review feedback:
- Remove FQN java.util.Set in favor of short form (java.util.* is imported)
- Precompile the field-identifier regex into a static Pattern constant
  to avoid recompilation on every call to isRightSideFieldIdentifier

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Mar 3, 2026

Code Review

The fix correctly identifies and addresses all three root causes described in the PR description. The logic is sound and the test coverage is good. A few things worth noting:

Style violation

InCondition.java, line 46-47: The java.util.regex.Pattern type is used as a fully qualified name instead of being imported. Per CLAUDE.md: "don't use fully qualified names if possible, always import the class and just use the name". An import java.util.regex.Pattern; should be added to the imports.

unwrapSubQueryResult — silent fallthrough for multi-column subqueries

The method only unwraps a Result when it has exactly one property:

if (propertyNames.size() == 1)
    return result.getProperty(propertyNames.iterator().next());

When a subquery selects multiple columns (e.g. WHERE a IN (SELECT x, y FROM B)), the Result object is returned as-is and the index lookup silently finds nothing. This is arguably the correct behavior for an ill-formed query, but a comment noting that multi-column results are intentionally not unwrapped would help future maintainers, and a test documenting the expected behavior would guard against accidental changes.

Minor: getKey() inconsistency in anonymous Map.Entry

In processInCondition(), the anonymous Map.Entry.getKey() returns the original item (the raw Result object before unwrapping), while the cursor is created with the unwrapped value. The index cursor itself yields the correct key via cursor.getKeys() (used in fetchNextEntry()), but the customIterator path puts entry.getKey() (the un-unwrapped Result) into nextEntry.first, which then becomes the key property in the returned ResultInternal. This is a pre-existing inconsistency rather than a regression — key is metadata and does not affect record retrieval — but worth noting.

Missing test cases

  • Inverted syntax regression: isRightSideFieldIdentifier is the most logic-altering change in InCondition. A test covering the existing value IN list_field (BY-ITEM index) scenario would confirm the refactor does not silently break it.
  • Multi-column subquery: a test documenting expected behavior (presumably: returns no results, no exception) would guard the edge case mentioned above.

Execution order fix — correctness confirmed

Reviewing GlobalLetQueryStep: it uses an executed flag so the subquery only runs once, then populates the context variable. Calling pullPrevious() before init() correctly triggers that population, and subsequent syncPull calls are idempotent due to both the executed and inited guards. The fix is correct.

Summary

The core fix is sound and the four regression tests cover the key scenarios. The main actionable items are the missing import for Pattern (style) and adding an inverted-syntax test to cover the isRightSideFieldIdentifier change.

@codacy-production
Copy link

codacy-production bot commented Mar 3, 2026

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation Diff coverage
-9.97% 73.81%
Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (be74dc6) 103048 77039 74.76%
Head commit (a560f17) 133602 (+30554) 86555 (+9516) 64.79% (-9.97%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#3566) 42 31 73.81%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings    Change summary preferences

@claude
Copy link

claude bot commented Mar 3, 2026

Code Review

The fix is well-targeted and addresses a real bug. Here is my analysis of each change.


FetchFromIndexStep.syncPull() — execution order swap ✅

// Before
init(context.getDatabase());
pullPrevious(context, nRecords);

// After
pullPrevious(context, nRecords);
init(context.getDatabase());

This is the correct fix. init() evaluates the IN-condition and builds index cursors, which requires subquery aliases (e.g. _$$$SUBQUERY$$$_0) to already be present in the context. GlobalLetQueryStep (which populates those aliases) runs as part of the upstream chain and is only triggered by pullPrevious(). Calling init() first meant it would always see an empty/stale context for those values.

Since init() is guarded by an inited flag, subsequent calls to syncPull() on the same step are not affected.


unwrapSubQueryResult() — new helper ⚠️

The logic is sound: a subquery like SELECT name FROM TypeA returns Result objects wrapping {name: "hello"}, while the index stores the raw string "hello". Unwrapping the single-property result is the right approach.

A few observations:

  1. Silent no-op for multi-column subqueries: If someone writes WHERE ref IN (SELECT name, age FROM TypeA), the method silently returns the raw Result object (2 properties → no unwrap), and the index lookup will produce no matches. This is arguably correct behaviour (the expression is semantically ambiguous), but it would help to add a comment explaining why multi-property results are not unwrapped.

  2. Null property value: If the single property is null (e.g. SELECT nullField FROM TypeA), the method returns null. The downstream createCursor(equals, null, context) must handle a null key gracefully. Worth verifying that the index lookup for a null key is either handled or a defensive null-check is added here.

  3. isElement() guard: Correctly excludes real document/edge/vertex Results from unwrapping, preventing RID-valued results from being converted to a raw property. Good.


InCondition.isRightSideFieldIdentifier() — regex-based approach ✅ with one note

private static final Pattern FIELD_IDENTIFIER_PATTERN =
    Pattern.compile("[a-zA-Z_][a-zA-Z0-9_]*(\\.[a-zA-Z_][a-zA-Z0-9_]*)*");

Fix is correct: the old code returned true immediately from isBaseIdentifier() without checking for $, so subquery aliases like _$$$SUBQUERY$$$_0 were mistakenly treated as field identifiers. The regex correctly rejects them.

Performance improvement: using a pre-compiled static final Pattern instead of an inline String.matches() (which recompiles the regex on every call) is a welcome micro-optimisation in a query hot path.

One edge case to confirm: The old code stripped backtick-quoting and then fell through to the regex. The new code does the same (.replace("", "")). However, if ArcadeDB ever supports backtick-escaped identifiers containing characters outside [a-zA-Z0-9_](e.g. ``my-field` ``), they would now be treated as non-field expressions. This is probably fine given ArcadeDB's identifier rules, but worth a quick sanity-check.


Regression tests in SelectStatementExecutionTest

The four new test cases are good:

  • inSubqueryWithUniqueIndex — directly covers the reported bug
  • inSubqueryWithNotUniqueIndex — non-unique LSM index path
  • inSubqueryWithoutIndex — ensures the no-index path still works (important regression guard)
  • inSubqueryWithMultipleValuesAndIndex — multiple rows from the subquery

One suggestion: consider adding a test for the null property edge case in unwrapSubQueryResult (subquery returns a row where the projected field is null). Even a simple SELECT FROM TypeB WHERE ref IN (SELECT nullField FROM TypeA) would clarify expected behaviour.


Import / style cleanups ✅

All the replacements of fully-qualified names with proper imports (java.io.FileFile, java.util.CollectionsCollections, com.arcadedb.schema.LocalTimeSeriesTypeLocalTimeSeriesType, etc.) are correct and align with the coding standard.


Summary

The core fix is correct and necessary. The two-part solution (execution ordering + result unwrapping) properly addresses both root causes described in the PR. Test coverage is solid. The minor points above are mostly hardening suggestions, not blockers.

@robfrank robfrank merged commit dd735f1 into main Mar 3, 2026
21 of 23 checks passed
@robfrank robfrank deleted the fix/3565-in-subquery-wrong-results-with-index branch March 3, 2026 14:18
@codecov
Copy link

codecov bot commented Mar 3, 2026

Codecov Report

❌ Patch coverage is 52.38095% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.68%. Comparing base (0fbdfdb) to head (a560f17).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...va/com/arcadedb/query/sql/antlr/SQLASTBuilder.java 55.55% 0 Missing and 4 partials ⚠️
...rcadedb/query/sql/executor/FetchFromIndexStep.java 55.55% 2 Missing and 2 partials ⚠️
...m/arcadedb/query/sql/executor/SaveElementStep.java 0.00% 3 Missing ⚠️
...edb/query/sql/executor/SelectExecutionPlanner.java 0.00% 2 Missing and 1 partial ⚠️
...m/arcadedb/engine/timeseries/TimeSeriesBucket.java 66.66% 1 Missing and 1 partial ⚠️
...main/java/com/arcadedb/database/LocalDatabase.java 0.00% 0 Missing and 1 partial ⚠️
...adedb/engine/timeseries/codec/DictionaryCodec.java 66.66% 1 Missing ⚠️
...rcadedb/engine/timeseries/codec/Simple8bCodec.java 66.66% 1 Missing ⚠️
.../java/com/arcadedb/index/hash/HashIndexBucket.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3566      +/-   ##
==========================================
+ Coverage   65.62%   65.68%   +0.05%     
==========================================
  Files        1506     1506              
  Lines      103035   103053      +18     
  Branches    21360    21366       +6     
==========================================
+ Hits        67620    67689      +69     
+ Misses      26175    26108      -67     
- Partials     9240     9256      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

robfrank added a commit that referenced this pull request Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

IN (SELECT ...) returns wrong results

1 participant