Skip to content

Resolve 3 SQL builder bugs found by user-path tests#578

Merged
auxten merged 5 commits into
mainfrom
fix/sql-builder-bugs-and-user-path-tests
May 21, 2026
Merged

Resolve 3 SQL builder bugs found by user-path tests#578
auxten merged 5 commits into
mainfrom
fix/sql-builder-bugs-and-user-path-tests

Conversation

@auxten

@auxten auxten commented May 21, 2026

Copy link
Copy Markdown
Member

Summary

Stacked on top of #577. Fixes three SQL builder bugs the new user-path /
property-based test sweep uncovered, and flips the three xfailed
regression slots in datastore/tests/journeys/ into permanent passing
guards.

Bugs fixed

  1. head(N).groupby().agg() returns "first N groups" instead of
    "agg over first N rows"
    (query_planner._build_layers).
    LIMIT was folded into the same SQL layer as GROUP BY. The planner
    now starts a new nested-subquery layer when LazyGroupByAgg
    follows LIMIT/OFFSET.

  2. result[result['count'] > 0] after groupby().agg(...) silently
    binds the WHERE to the source column of the same name

    (sql_executor._render_first_layer_sql).
    The post-aggregation conditions were being appended to the
    pre-agg WHERE list "as a defensive fallback". They are now emitted
    as a real HAVING clause (composed with any existing
    self.ds._having_condition).

  3. assign(x=...).groupby().agg(...) on a Python() DataFrame source
    emits SELECT *, expr AS x, agg(...) ... GROUP BY ...
    which
    ClickHouse rejects with NOT_AN_AGGREGATE
    (sql_executor._build_sql_for_dataframe, _build_layer).
    The single-layer DataFrame inline path was a parallel implementation
    that lacked the WHERE-by-position split, alias rewriting and
    include_star=False GROUP BY guard. Any plan containing
    groupby_agg / where_ops / multi-layer is now routed through the
    unified _build_layered_sql. _build_layer forces
    include_star=False whenever the layer has a groupby_agg.

Supporting fixes in the unified pipeline

  • _build_layer: append col IS NOT NULL for the groupby_cols when
    dropna=True (pandas default), so DataFrame source GROUP BY no
    longer leaks NULL groups - matches the legacy inline path.
  • _render_wrapper_layer_sql: new where_temp_alias_columns parameter
    so the wrapper layer renames __tmp_col__ back to user-visible
    names; postpone WHERE/ORDER BY/LIMIT/OFFSET to the outer rename
    layer so they bind to the masked values (LazyMask/LazyWhere
    semantic); suppress ORDER BY _row_id fallback when GROUP BY is
    present.
  • Planner splits a pure-projection SELECT into a new layer when an
    in-layer WHERE references a column the projection would drop (e.g.
    assign(x=...)[where x>0][['id']], as generated by ColumnExpr's
    per-column pruning).

Test coverage added

  • datastore/tests/journeys/test_mark_slack_amazon_reviews.py
    Verbatim mirror of the original Slack-reported pipeline (15 cases
    covering all 4 filter idioms x parquet/DataFrame sources, plus
    follow-on operations).
  • datastore/tests/journeys/test_agg_then_use_result.py
    Cartesian 5 aggregations x 5 follow-ons = 25 result-as-input
    user paths.
  • datastore/tests/journeys/test_kaggle_style_exploration.py
    4 end-to-end Kaggle-style EDA chains (top products by region,
    customer order histogram, top active users, multi-stage revenue
    filter).
  • datastore/tests/test_cartesian_result_followups.py
    Mechanically generated 8 builders x 6 follow-ups = 48 cases.
  • datastore/tests/test_naming_conflict_fuzz.py
    17 alias-collision scenarios.
  • datastore/tests/test_sql_snapshot_assertions.py
    17 structural SQL assertions (presence/absence of GROUP BY, WHERE,
    LIMIT, ORDER BY, __subq aliases).
  • datastore/tests/test_plan_invariants.py
    44 plan-shape invariants over the layered pipeline.
  • datastore/tests/test_property_based_chains.py
    Hypothesis-driven random op-chain fuzzer (filter / sort / head /
    select / agg). Now runs without skip filters for the two bugs that
    used to be tracked separately. Empty-result groupby cases (a
    pre-existing pandas-vs-DataStore column-set divergence) are
    explicitly short-circuited.
  • .cursor/rules/chdb-ds.mdc: added "Multi-Step Mirror for Bug-Fix
    Tests" + "Property-Based Sweep for Compositional Bugs" sections.
  • requirements-dev.txt: hypothesis>=6.0.0.

Test plan

  • pytest datastore/tests/ (excluding remote): 10473 passed,
    0 failed
    , 17 skipped, 87 xfailed, 2 xpassed.
  • Hypothesis sweep with max_examples=1000: passes.
  • All 47 journey tests pass, including the 3 flipped xfails:
    • test_head_before_agg_respects_limit
    • test_agg_output_name_shadows_source_column
    • test_top_three_revenue_dataframe_source_assign_before_groupby
  • No new linter errors in the touched files.

@auxten auxten requested a review from Copilot May 21, 2026 05:28
@auxten auxten changed the title fix(datastore): resolve 3 SQL builder bugs found by user-path tests Resolve 3 SQL builder bugs found by user-path tests May 21, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes multiple SQL-planning / SQL-building edge cases uncovered by new user-journey and property-based tests, and adds a broad regression suite to prevent silent SQL-shape regressions (especially around LIMIT/GROUP BY layering, post-aggregation filtering, and alias collisions).

Changes:

  • Update QueryPlanner._build_layers to introduce new subquery layers for LIMIT/OFFSET -> groupby_agg and for projection/WHERE dependency cases.
  • Update the SQL builder to (a) emit post-aggregation predicates as HAVING, (b) apply dropna=True semantics by adding IS NOT NULL filters on group keys, and (c) route more DataFrame-source plans through the unified layered pipeline.
  • Add extensive regression coverage: journey-style mirrors, structural SQL snapshot/invariant tests, cartesian follow-up tests, and Hypothesis fuzz/property-based chains; add Hypothesis as a dev dependency.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
requirements-dev.txt Adds hypothesis to dev dependencies for property-based testing.
datastore/query_planner.py Adds new layer-splitting rules to preserve pandas execution-order semantics.
datastore/sql_executor.py Adjusts layered SQL building (HAVING, dropna filters, wrapper-layer rendering, DataFrame routing).
datastore/tests/test_groupby_sql_pushdown.py Relaxes a SQL substring assertion to tolerate dropna-driven IS NOT NULL filters.
datastore/tests/test_sql_snapshot_assertions.py New structural SQL substring assertions (“shape” snapshots).
datastore/tests/test_plan_invariants.py New plan/SQL structural invariants (layer/subquery counts, alias invariants, etc.).
datastore/tests/test_property_based_chains.py New Hypothesis-generated op-chain property tests against pandas.
datastore/tests/test_naming_conflict_fuzz.py New systematic alias/name-collision regression coverage.
datastore/tests/test_cartesian_result_followups.py New cartesian “result-as-input” follow-up matrix against pandas.
datastore/tests/journeys/__init__.py Documents conventions for long-chain user-journey regressions.
datastore/tests/journeys/test_mark_slack_amazon_reviews.py New verbatim Slack-reported pipeline regression suite (parquet + DataFrame sources).
datastore/tests/journeys/test_kaggle_style_exploration.py New end-to-end “Kaggle-style” EDA chain regressions.
datastore/tests/journeys/test_agg_then_use_result.py New “aggregation result as input” journey regressions + explicit shadowing regression.
.cursor/rules/chdb-ds.mdc Adds contributor guidance for multi-step journey regressions and property-based sweeps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread datastore/sql_executor.py
auxten added a commit that referenced this pull request May 21, 2026
Two follow-ups from automated review:

1. `_build_execution_sql` (core.py): the helper backing `to_sql()` /
   `explain()`'s SQL preview used `next(seg for seg in segments if
   seg.is_sql())`, which would pick a LATER `SQL-on-Python(__df__)`
   segment when the first segment was Pandas (cost-aware planner for
   `ORDER BY` without `LIMIT`). That returned a `Python(__df__)` SQL
   whose `FROM` does not match the source the executor actually
   issues first. Constrained to the FIRST segment: if it is SQL,
   return that segment's SQL; otherwise fall back to
   `_generate_select_sql` (the `SELECT * FROM <source>` the executor
   runs to materialize rows for Pandas).

2. `_build_layers` (query_planner.py): `mask(...).groupby(...).agg(...)`
   (or `where(...).groupby(...).agg(...)`) kept the LazyMask/LazyWhere
   in the SAME layer as LazyGroupByAgg. The wrapper-layer CASE-WHEN
   temp-alias machinery (`__tmp_<col>__`) then collided with the
   aggregation's SELECT list - the outer SELECT renamed `__tmp_*__`
   columns that the inner GROUP-BY SELECT never emitted, producing
   `UNKNOWN_IDENTIFIER` (DataFrame source) or wrong values
   (file/table source). Now LazyGroupByAgg starts a new nested layer
   when the current layer already holds a LazyWhere/LazyMask, so the
   mask runs in the inner subquery and GROUP BY runs on the masked
   output:
   `SELECT ... agg() FROM (SELECT CASE WHEN ... AS x) GROUP BY x`.

3. `_render_wrapper_layer_sql` (sql_executor.py): the
   `add_row_order_fallback` `ORDER BY _row_id` clause is intentionally
   skipped when wrapping with `__case_subq__`. The `Python(__df__)`
   virtual `_row_id` is not propagated through the inner CASE-WHEN
   SELECT, so referencing it on the outer rename layer would fail with
   `UNKNOWN_IDENTIFIER`. CASE-WHEN is row-preserving, so source order
   naturally flows up through the wrapper.

Regression tests:
- `tests/test_groupby_sql_pushdown.py::TestDispatchByOpType
  ::test_mask_then_groupby_splits_into_nested_layers` and
  `test_where_then_groupby_splits_into_nested_layers` pin both the
  plan structure (2 layers, mask in inner, groupby in outer) and the
  end-to-end value equality vs pandas.
- `tests/test_explain_method.py::TestBuildExecutionSQLFirstSegment`
  covers both branches: SQL-first chain returns source-bound SQL,
  Pandas-first chain returns the initial `SELECT * FROM <source>`
  (never `Python(__df__)`).

Verified: 10477 passed, 0 failed (full suite, remote tests excluded).
@auxten auxten changed the base branch from refactor/unify-sql-layer-builder to main May 21, 2026 05:53
auxten added 2 commits May 21, 2026 13:53
Fixes three bugs uncovered by the user-journey + property-based test
sweep introduced on top of PR #577's unified `_build_layer` dispatcher.

1. head-before-agg LIMIT ordering (`query_planner._build_layers`)
   `head(N).groupby().agg()` was folded into one SQL layer, so LIMIT
   ran after GROUP BY and limited the number of groups instead of the
   number of input rows. The planner now starts a new nested-subquery
   layer when `LazyGroupByAgg` follows LIMIT/OFFSET.

2. Post-agg WHERE folded back into pre-agg WHERE (`sql_executor`)
   `result[result['count'] > 0]` after a `groupby().agg(...)` had its
   condition appended to the pre-aggregation WHERE clause, which
   silently bound to the source column of the same name. The first-
   layer renderer now emits the post-agg conditions as a real HAVING
   clause (combined with any user-supplied `self.ds._having_condition`).

3. DataFrame source assign-before-groupby (`sql_executor._build_sql_for_dataframe`)
   `ds.assign(x=...).groupby(...).agg(...)` against a Python() DataFrame
   source emitted `SELECT *, expr AS x, agg(...) ... GROUP BY ...`,
   which ClickHouse rejects with NOT_AN_AGGREGATE. The single-layer
   DataFrame path now routes any plan containing `groupby_agg` /
   `where_ops` / multi-layer through the unified `_build_layered_sql`
   pipeline, and `_build_layer` forces `include_star=False` whenever
   the layer has a `groupby_agg` (the agg's emitted select_fields
   already cover the group keys + aggregations).

Supporting changes in the unified pipeline:
- `_build_layer`: append `col IS NOT NULL` filters for `groupby_cols`
  when `dropna=True` (pandas default) - matches the legacy inline
  DataFrame path that was previously the only place this lived.
- `_render_wrapper_layer_sql`: accept `where_temp_alias_columns` so
  the wrapper layer renames `__tmp_col__` back to user-visible
  column names; postpone WHERE/ORDER BY/LIMIT/OFFSET to the outer
  rename layer so they bind to the masked values (LazyMask/LazyWhere
  semantic); suppress `ORDER BY _row_id` fallback when GROUP BY is
  present (would be NOT_AN_AGGREGATE).
- `_collect_select_fields_from_sql_ops`: unchanged accumulating
  semantics; instead the planner splits a pure-projection SELECT
  into a new layer when an in-layer WHERE references a column the
  projection would drop (e.g. `assign(x=...)[where x>0][['id']]` as
  produced by ColumnExpr's per-column pruning path).

Test coverage:
- 3 previously-xfailed regression tests in `datastore/tests/journeys/`
  (`test_head_before_agg_respects_limit`,
  `test_agg_output_name_shadows_source_column`,
  `test_top_three_revenue_dataframe_source_assign_before_groupby`)
  flipped to permanent passing guards.
- `test_property_based_chains.py` no longer skips the
  head-before-agg / sort-or-select-before-agg chains; remaining
  empty-result groupby cases (pre-existing pandas-vs-DataStore
  column-set divergence) are short-circuited explicitly.
- New `datastore/tests/` files (journeys/, cartesian product,
  naming-conflict fuzz, SQL snapshot assertions, plan invariants,
  property-based chains) introduce ~150 additional regressions.
- Updated `.cursor/rules/chdb-ds.mdc` with multi-step-mirror and
  property-based-sweep test guidelines.
- `requirements-dev.txt`: add `hypothesis>=6.0.0`.

Verified:
- 10473 passed, 0 failed (full suite, remote tests excluded).
- 1000-example hypothesis property sweep passes.
- All 47 journey tests pass (including the 3 flipped xfails).
Two follow-ups from automated review:

1. `_build_execution_sql` (core.py): the helper backing `to_sql()` /
   `explain()`'s SQL preview used `next(seg for seg in segments if
   seg.is_sql())`, which would pick a LATER `SQL-on-Python(__df__)`
   segment when the first segment was Pandas (cost-aware planner for
   `ORDER BY` without `LIMIT`). That returned a `Python(__df__)` SQL
   whose `FROM` does not match the source the executor actually
   issues first. Constrained to the FIRST segment: if it is SQL,
   return that segment's SQL; otherwise fall back to
   `_generate_select_sql` (the `SELECT * FROM <source>` the executor
   runs to materialize rows for Pandas).

2. `_build_layers` (query_planner.py): `mask(...).groupby(...).agg(...)`
   (or `where(...).groupby(...).agg(...)`) kept the LazyMask/LazyWhere
   in the SAME layer as LazyGroupByAgg. The wrapper-layer CASE-WHEN
   temp-alias machinery (`__tmp_<col>__`) then collided with the
   aggregation's SELECT list - the outer SELECT renamed `__tmp_*__`
   columns that the inner GROUP-BY SELECT never emitted, producing
   `UNKNOWN_IDENTIFIER` (DataFrame source) or wrong values
   (file/table source). Now LazyGroupByAgg starts a new nested layer
   when the current layer already holds a LazyWhere/LazyMask, so the
   mask runs in the inner subquery and GROUP BY runs on the masked
   output:
   `SELECT ... agg() FROM (SELECT CASE WHEN ... AS x) GROUP BY x`.

3. `_render_wrapper_layer_sql` (sql_executor.py): the
   `add_row_order_fallback` `ORDER BY _row_id` clause is intentionally
   skipped when wrapping with `__case_subq__`. The `Python(__df__)`
   virtual `_row_id` is not propagated through the inner CASE-WHEN
   SELECT, so referencing it on the outer rename layer would fail with
   `UNKNOWN_IDENTIFIER`. CASE-WHEN is row-preserving, so source order
   naturally flows up through the wrapper.

Regression tests:
- `tests/test_groupby_sql_pushdown.py::TestDispatchByOpType
  ::test_mask_then_groupby_splits_into_nested_layers` and
  `test_where_then_groupby_splits_into_nested_layers` pin both the
  plan structure (2 layers, mask in inner, groupby in outer) and the
  end-to-end value equality vs pandas.
- `tests/test_explain_method.py::TestBuildExecutionSQLFirstSegment`
  covers both branches: SQL-first chain returns source-bound SQL,
  Pandas-first chain returns the initial `SELECT * FROM <source>`
  (never `Python(__df__)`).

Verified: 10477 passed, 0 failed (full suite, remote tests excluded).
@auxten auxten force-pushed the fix/sql-builder-bugs-and-user-path-tests branch from 889453f to 0ed1ed5 Compare May 21, 2026 05:53
auxten added 3 commits May 21, 2026 16:10
PR #578 added `hypothesis>=6.0.0` to `requirements-dev.txt` so
`datastore/tests/test_property_based_chains.py` can import
`hypothesis`. But the GitHub Actions matrices hardcode their pip
install command (`pytest pytest-cov [ray] $PANDAS_CONSTRAINT`) and
never source `requirements-dev.txt`, so collection aborted with
`ModuleNotFoundError: No module named 'hypothesis'` across all 4
platform jobs.

Add `hypothesis` inline to each workflow's install step:
- build_linux_x86_wheels.yml
- build_linux_arm64_wheels-gh.yml
- build_macos_arm64_wheels.yml
- build_macos_x86_wheels.yml
The previous empty-result short-circuit in
``test_random_chain_matches_pandas`` checked
``isinstance(ds_result, pd.DataFrame)``, but ``apply_chain`` returns
a lazy ``DataStore``, not a materialized ``pd.DataFrame`` - so the
isinstance branch was always False and the short-circuit never
fired. CI on commit 64f38b8 surfaced this when Hypothesis generated
the chain ``filter(v>0) -> filter(v>99) -> sort(w desc) ->
groupby(cat).agg({'v':'sum'})`` (empty intermediate after the
second filter; column drift between pandas' agg-projected output
and DataStore's source-bound short-circuit output).

Two changes:

- ``tests/test_property_based_chains.py``: replace the isinstance
  check with a duck-typed ``len(r) == 0`` helper that works on
  ``DataStore`` and ``pd.DataFrame`` alike. Reference the new
  journey test in the comment so future readers can find the
  tracked bug.

- ``tests/journeys/test_property_chain_followups.py`` (new): verbatim
  regression for the falsifying chain, marked
  ``@unittest.expectedFailure``. Follows the
  ``.cursor/rules/chdb-ds.mdc`` rule #6 contract: every Hypothesis
  falsifier becomes a permanent journey test; when the underlying
  empty-intermediate column-projection bug in the executor is
  fixed, flip the xfail and drop the property-test skip in the
  same commit.
…survives

The debug log truncated executed SQL at 200 chars with "...". PR
#578 added the ``col IS NOT NULL`` dropna filter to GROUP BY queries
on SQL sources too (previously only DataFrame source). For remote
sources this pushed the SQL from ~200 to ~220 chars, truncating
``GROUP BY`` to ``GROUP B...`` in the log output.

``tests/test_remote_lazy_chain_pushdown.py::test_assign_filter_groupby_agg``
parses the debug log and asserts ``'GROUP BY' in sql.upper()``, so
the truncation made the assertion fail across all 4 CI platforms.

The actual emitted SQL is correct; only the log line was truncated.
Bump the limit to 2000 chars (10x) so typical multi-clause queries
fit in full and the log-parsing tests work again. The full SQL is
still bounded so very large queries don't spam logs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants