Resolve 3 SQL builder bugs found by user-path tests by auxten · Pull Request #578 · chdb-io/chdb

auxten · 2026-05-21T05:26:48Z

Summary

Stacked on top of #577. Fixes three SQL builder bugs the new user-path /
property-based test sweep uncovered, and flips the three xfailed
regression slots in datastore/tests/journeys/ into permanent passing
guards.

Bugs fixed

head(N).groupby().agg() returns "first N groups" instead of
"agg over first N rows" (query_planner._build_layers).
LIMIT was folded into the same SQL layer as GROUP BY. The planner
now starts a new nested-subquery layer when LazyGroupByAgg
follows LIMIT/OFFSET.
result[result['count'] > 0] after groupby().agg(...) silently
binds the WHERE to the source column of the same name
(sql_executor._render_first_layer_sql).
The post-aggregation conditions were being appended to the
pre-agg WHERE list "as a defensive fallback". They are now emitted
as a real HAVING clause (composed with any existing
self.ds._having_condition).
assign(x=...).groupby().agg(...) on a Python() DataFrame source
emits SELECT *, expr AS x, agg(...) ... GROUP BY ... which
ClickHouse rejects with NOT_AN_AGGREGATE
(sql_executor._build_sql_for_dataframe, _build_layer).
The single-layer DataFrame inline path was a parallel implementation
that lacked the WHERE-by-position split, alias rewriting and
include_star=False GROUP BY guard. Any plan containing
groupby_agg / where_ops / multi-layer is now routed through the
unified _build_layered_sql. _build_layer forces
include_star=False whenever the layer has a groupby_agg.

Supporting fixes in the unified pipeline

_build_layer: append col IS NOT NULL for the groupby_cols when
dropna=True (pandas default), so DataFrame source GROUP BY no
longer leaks NULL groups - matches the legacy inline path.
_render_wrapper_layer_sql: new where_temp_alias_columns parameter
so the wrapper layer renames __tmp_col__ back to user-visible
names; postpone WHERE/ORDER BY/LIMIT/OFFSET to the outer rename
layer so they bind to the masked values (LazyMask/LazyWhere
semantic); suppress ORDER BY _row_id fallback when GROUP BY is
present.
Planner splits a pure-projection SELECT into a new layer when an
in-layer WHERE references a column the projection would drop (e.g.
assign(x=...)[where x>0][['id']], as generated by ColumnExpr's
per-column pruning).

Test coverage added

datastore/tests/journeys/test_mark_slack_amazon_reviews.py
Verbatim mirror of the original Slack-reported pipeline (15 cases
covering all 4 filter idioms x parquet/DataFrame sources, plus
follow-on operations).
datastore/tests/journeys/test_agg_then_use_result.py
Cartesian 5 aggregations x 5 follow-ons = 25 result-as-input
user paths.
datastore/tests/journeys/test_kaggle_style_exploration.py
4 end-to-end Kaggle-style EDA chains (top products by region,
customer order histogram, top active users, multi-stage revenue
filter).
datastore/tests/test_cartesian_result_followups.py
Mechanically generated 8 builders x 6 follow-ups = 48 cases.
datastore/tests/test_naming_conflict_fuzz.py
17 alias-collision scenarios.
datastore/tests/test_sql_snapshot_assertions.py
17 structural SQL assertions (presence/absence of GROUP BY, WHERE,
LIMIT, ORDER BY, __subq aliases).
datastore/tests/test_plan_invariants.py
44 plan-shape invariants over the layered pipeline.
datastore/tests/test_property_based_chains.py
Hypothesis-driven random op-chain fuzzer (filter / sort / head /
select / agg). Now runs without skip filters for the two bugs that
used to be tracked separately. Empty-result groupby cases (a
pre-existing pandas-vs-DataStore column-set divergence) are
explicitly short-circuited.
.cursor/rules/chdb-ds.mdc: added "Multi-Step Mirror for Bug-Fix
Tests" + "Property-Based Sweep for Compositional Bugs" sections.
requirements-dev.txt: hypothesis>=6.0.0.

Test plan

pytest datastore/tests/ (excluding remote): 10473 passed,
0 failed, 17 skipped, 87 xfailed, 2 xpassed.
Hypothesis sweep with max_examples=1000: passes.
All 47 journey tests pass, including the 3 flipped xfails:
- test_head_before_agg_respects_limit
- test_agg_output_name_shadows_source_column
- test_top_three_revenue_dataframe_source_assign_before_groupby
No new linter errors in the touched files.

Copilot

Pull request overview

This PR fixes multiple SQL-planning / SQL-building edge cases uncovered by new user-journey and property-based tests, and adds a broad regression suite to prevent silent SQL-shape regressions (especially around LIMIT/GROUP BY layering, post-aggregation filtering, and alias collisions).

Changes:

Update QueryPlanner._build_layers to introduce new subquery layers for LIMIT/OFFSET -> groupby_agg and for projection/WHERE dependency cases.
Update the SQL builder to (a) emit post-aggregation predicates as HAVING, (b) apply dropna=True semantics by adding IS NOT NULL filters on group keys, and (c) route more DataFrame-source plans through the unified layered pipeline.
Add extensive regression coverage: journey-style mirrors, structural SQL snapshot/invariant tests, cartesian follow-up tests, and Hypothesis fuzz/property-based chains; add Hypothesis as a dev dependency.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`requirements-dev.txt`	Adds `hypothesis` to dev dependencies for property-based testing.
`datastore/query_planner.py`	Adds new layer-splitting rules to preserve pandas execution-order semantics.
`datastore/sql_executor.py`	Adjusts layered SQL building (HAVING, dropna filters, wrapper-layer rendering, DataFrame routing).
`datastore/tests/test_groupby_sql_pushdown.py`	Relaxes a SQL substring assertion to tolerate `dropna`-driven `IS NOT NULL` filters.
`datastore/tests/test_sql_snapshot_assertions.py`	New structural SQL substring assertions (“shape” snapshots).
`datastore/tests/test_plan_invariants.py`	New plan/SQL structural invariants (layer/subquery counts, alias invariants, etc.).
`datastore/tests/test_property_based_chains.py`	New Hypothesis-generated op-chain property tests against pandas.
`datastore/tests/test_naming_conflict_fuzz.py`	New systematic alias/name-collision regression coverage.
`datastore/tests/test_cartesian_result_followups.py`	New cartesian “result-as-input” follow-up matrix against pandas.
`datastore/tests/journeys/__init__.py`	Documents conventions for long-chain user-journey regressions.
`datastore/tests/journeys/test_mark_slack_amazon_reviews.py`	New verbatim Slack-reported pipeline regression suite (parquet + DataFrame sources).
`datastore/tests/journeys/test_kaggle_style_exploration.py`	New end-to-end “Kaggle-style” EDA chain regressions.
`datastore/tests/journeys/test_agg_then_use_result.py`	New “aggregation result as input” journey regressions + explicit shadowing regression.
`.cursor/rules/chdb-ds.mdc`	Adds contributor guidance for multi-step journey regressions and property-based sweeps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Two follow-ups from automated review: 1. `_build_execution_sql` (core.py): the helper backing `to_sql()` / `explain()`'s SQL preview used `next(seg for seg in segments if seg.is_sql())`, which would pick a LATER `SQL-on-Python(__df__)` segment when the first segment was Pandas (cost-aware planner for `ORDER BY` without `LIMIT`). That returned a `Python(__df__)` SQL whose `FROM` does not match the source the executor actually issues first. Constrained to the FIRST segment: if it is SQL, return that segment's SQL; otherwise fall back to `_generate_select_sql` (the `SELECT * FROM <source>` the executor runs to materialize rows for Pandas). 2. `_build_layers` (query_planner.py): `mask(...).groupby(...).agg(...)` (or `where(...).groupby(...).agg(...)`) kept the LazyMask/LazyWhere in the SAME layer as LazyGroupByAgg. The wrapper-layer CASE-WHEN temp-alias machinery (`__tmp_<col>__`) then collided with the aggregation's SELECT list - the outer SELECT renamed `__tmp_*__` columns that the inner GROUP-BY SELECT never emitted, producing `UNKNOWN_IDENTIFIER` (DataFrame source) or wrong values (file/table source). Now LazyGroupByAgg starts a new nested layer when the current layer already holds a LazyWhere/LazyMask, so the mask runs in the inner subquery and GROUP BY runs on the masked output: `SELECT ... agg() FROM (SELECT CASE WHEN ... AS x) GROUP BY x`. 3. `_render_wrapper_layer_sql` (sql_executor.py): the `add_row_order_fallback` `ORDER BY _row_id` clause is intentionally skipped when wrapping with `__case_subq__`. The `Python(__df__)` virtual `_row_id` is not propagated through the inner CASE-WHEN SELECT, so referencing it on the outer rename layer would fail with `UNKNOWN_IDENTIFIER`. CASE-WHEN is row-preserving, so source order naturally flows up through the wrapper. Regression tests: - `tests/test_groupby_sql_pushdown.py::TestDispatchByOpType ::test_mask_then_groupby_splits_into_nested_layers` and `test_where_then_groupby_splits_into_nested_layers` pin both the plan structure (2 layers, mask in inner, groupby in outer) and the end-to-end value equality vs pandas. - `tests/test_explain_method.py::TestBuildExecutionSQLFirstSegment` covers both branches: SQL-first chain returns source-bound SQL, Pandas-first chain returns the initial `SELECT * FROM <source>` (never `Python(__df__)`). Verified: 10477 passed, 0 failed (full suite, remote tests excluded).

Fixes three bugs uncovered by the user-journey + property-based test sweep introduced on top of PR #577's unified `_build_layer` dispatcher. 1. head-before-agg LIMIT ordering (`query_planner._build_layers`) `head(N).groupby().agg()` was folded into one SQL layer, so LIMIT ran after GROUP BY and limited the number of groups instead of the number of input rows. The planner now starts a new nested-subquery layer when `LazyGroupByAgg` follows LIMIT/OFFSET. 2. Post-agg WHERE folded back into pre-agg WHERE (`sql_executor`) `result[result['count'] > 0]` after a `groupby().agg(...)` had its condition appended to the pre-aggregation WHERE clause, which silently bound to the source column of the same name. The first- layer renderer now emits the post-agg conditions as a real HAVING clause (combined with any user-supplied `self.ds._having_condition`). 3. DataFrame source assign-before-groupby (`sql_executor._build_sql_for_dataframe`) `ds.assign(x=...).groupby(...).agg(...)` against a Python() DataFrame source emitted `SELECT *, expr AS x, agg(...) ... GROUP BY ...`, which ClickHouse rejects with NOT_AN_AGGREGATE. The single-layer DataFrame path now routes any plan containing `groupby_agg` / `where_ops` / multi-layer through the unified `_build_layered_sql` pipeline, and `_build_layer` forces `include_star=False` whenever the layer has a `groupby_agg` (the agg's emitted select_fields already cover the group keys + aggregations). Supporting changes in the unified pipeline: - `_build_layer`: append `col IS NOT NULL` filters for `groupby_cols` when `dropna=True` (pandas default) - matches the legacy inline DataFrame path that was previously the only place this lived. - `_render_wrapper_layer_sql`: accept `where_temp_alias_columns` so the wrapper layer renames `__tmp_col__` back to user-visible column names; postpone WHERE/ORDER BY/LIMIT/OFFSET to the outer rename layer so they bind to the masked values (LazyMask/LazyWhere semantic); suppress `ORDER BY _row_id` fallback when GROUP BY is present (would be NOT_AN_AGGREGATE). - `_collect_select_fields_from_sql_ops`: unchanged accumulating semantics; instead the planner splits a pure-projection SELECT into a new layer when an in-layer WHERE references a column the projection would drop (e.g. `assign(x=...)[where x>0][['id']]` as produced by ColumnExpr's per-column pruning path). Test coverage: - 3 previously-xfailed regression tests in `datastore/tests/journeys/` (`test_head_before_agg_respects_limit`, `test_agg_output_name_shadows_source_column`, `test_top_three_revenue_dataframe_source_assign_before_groupby`) flipped to permanent passing guards. - `test_property_based_chains.py` no longer skips the head-before-agg / sort-or-select-before-agg chains; remaining empty-result groupby cases (pre-existing pandas-vs-DataStore column-set divergence) are short-circuited explicitly. - New `datastore/tests/` files (journeys/, cartesian product, naming-conflict fuzz, SQL snapshot assertions, plan invariants, property-based chains) introduce ~150 additional regressions. - Updated `.cursor/rules/chdb-ds.mdc` with multi-step-mirror and property-based-sweep test guidelines. - `requirements-dev.txt`: add `hypothesis>=6.0.0`. Verified: - 10473 passed, 0 failed (full suite, remote tests excluded). - 1000-example hypothesis property sweep passes. - All 47 journey tests pass (including the 3 flipped xfails).

Two follow-ups from automated review: 1. `_build_execution_sql` (core.py): the helper backing `to_sql()` / `explain()`'s SQL preview used `next(seg for seg in segments if seg.is_sql())`, which would pick a LATER `SQL-on-Python(__df__)` segment when the first segment was Pandas (cost-aware planner for `ORDER BY` without `LIMIT`). That returned a `Python(__df__)` SQL whose `FROM` does not match the source the executor actually issues first. Constrained to the FIRST segment: if it is SQL, return that segment's SQL; otherwise fall back to `_generate_select_sql` (the `SELECT * FROM <source>` the executor runs to materialize rows for Pandas). 2. `_build_layers` (query_planner.py): `mask(...).groupby(...).agg(...)` (or `where(...).groupby(...).agg(...)`) kept the LazyMask/LazyWhere in the SAME layer as LazyGroupByAgg. The wrapper-layer CASE-WHEN temp-alias machinery (`__tmp_<col>__`) then collided with the aggregation's SELECT list - the outer SELECT renamed `__tmp_*__` columns that the inner GROUP-BY SELECT never emitted, producing `UNKNOWN_IDENTIFIER` (DataFrame source) or wrong values (file/table source). Now LazyGroupByAgg starts a new nested layer when the current layer already holds a LazyWhere/LazyMask, so the mask runs in the inner subquery and GROUP BY runs on the masked output: `SELECT ... agg() FROM (SELECT CASE WHEN ... AS x) GROUP BY x`. 3. `_render_wrapper_layer_sql` (sql_executor.py): the `add_row_order_fallback` `ORDER BY _row_id` clause is intentionally skipped when wrapping with `__case_subq__`. The `Python(__df__)` virtual `_row_id` is not propagated through the inner CASE-WHEN SELECT, so referencing it on the outer rename layer would fail with `UNKNOWN_IDENTIFIER`. CASE-WHEN is row-preserving, so source order naturally flows up through the wrapper. Regression tests: - `tests/test_groupby_sql_pushdown.py::TestDispatchByOpType ::test_mask_then_groupby_splits_into_nested_layers` and `test_where_then_groupby_splits_into_nested_layers` pin both the plan structure (2 layers, mask in inner, groupby in outer) and the end-to-end value equality vs pandas. - `tests/test_explain_method.py::TestBuildExecutionSQLFirstSegment` covers both branches: SQL-first chain returns source-bound SQL, Pandas-first chain returns the initial `SELECT * FROM <source>` (never `Python(__df__)`). Verified: 10477 passed, 0 failed (full suite, remote tests excluded).

PR #578 added `hypothesis>=6.0.0` to `requirements-dev.txt` so `datastore/tests/test_property_based_chains.py` can import `hypothesis`. But the GitHub Actions matrices hardcode their pip install command (`pytest pytest-cov [ray] $PANDAS_CONSTRAINT`) and never source `requirements-dev.txt`, so collection aborted with `ModuleNotFoundError: No module named 'hypothesis'` across all 4 platform jobs. Add `hypothesis` inline to each workflow's install step: - build_linux_x86_wheels.yml - build_linux_arm64_wheels-gh.yml - build_macos_arm64_wheels.yml - build_macos_x86_wheels.yml

The previous empty-result short-circuit in ``test_random_chain_matches_pandas`` checked ``isinstance(ds_result, pd.DataFrame)``, but ``apply_chain`` returns a lazy ``DataStore``, not a materialized ``pd.DataFrame`` - so the isinstance branch was always False and the short-circuit never fired. CI on commit 64f38b8 surfaced this when Hypothesis generated the chain ``filter(v>0) -> filter(v>99) -> sort(w desc) -> groupby(cat).agg({'v':'sum'})`` (empty intermediate after the second filter; column drift between pandas' agg-projected output and DataStore's source-bound short-circuit output). Two changes: - ``tests/test_property_based_chains.py``: replace the isinstance check with a duck-typed ``len(r) == 0`` helper that works on ``DataStore`` and ``pd.DataFrame`` alike. Reference the new journey test in the comment so future readers can find the tracked bug. - ``tests/journeys/test_property_chain_followups.py`` (new): verbatim regression for the falsifying chain, marked ``@unittest.expectedFailure``. Follows the ``.cursor/rules/chdb-ds.mdc`` rule #6 contract: every Hypothesis falsifier becomes a permanent journey test; when the underlying empty-intermediate column-projection bug in the executor is fixed, flip the xfail and drop the property-test skip in the same commit.

…survives The debug log truncated executed SQL at 200 chars with "...". PR #578 added the ``col IS NOT NULL`` dropna filter to GROUP BY queries on SQL sources too (previously only DataFrame source). For remote sources this pushed the SQL from ~200 to ~220 chars, truncating ``GROUP BY`` to ``GROUP B...`` in the log output. ``tests/test_remote_lazy_chain_pushdown.py::test_assign_filter_groupby_agg`` parses the debug log and asserts ``'GROUP BY' in sql.upper()``, so the truncation made the assertion fail across all 4 CI platforms. The actual emitted SQL is correct; only the log line was truncated. Bump the limit to 2000 chars (10x) so typical multi-clause queries fit in full and the log-parsing tests work again. The full SQL is still bounded so very large queries don't spam logs.

auxten requested a review from Copilot May 21, 2026 05:28

auxten changed the title ~~fix(datastore): resolve 3 SQL builder bugs found by user-path tests~~ Resolve 3 SQL builder bugs found by user-path tests May 21, 2026

Copilot started reviewing on behalf of auxten May 21, 2026 05:29 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread datastore/sql_executor.py

auxten changed the base branch from refactor/unify-sql-layer-builder to main May 21, 2026 05:53

auxten added 2 commits May 21, 2026 13:53

auxten force-pushed the fix/sql-builder-bugs-and-user-path-tests branch from 889453f to 0ed1ed5 Compare May 21, 2026 05:53

auxten added 3 commits May 21, 2026 16:10

auxten merged commit 79b4bde into main May 21, 2026
6 checks passed

This was referenced May 22, 2026

DataStore CI: ModuleNotFoundError: 'hypothesis' on chdb v4.1.8 (Linux x86_64 + macOS arm64, since 63ad199c) chdb-io/chdb-core#60

Closed

ci: install hypothesis for DataStore tests chdb-io/chdb-core#61

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Resolve 3 SQL builder bugs found by user-path tests#578

Resolve 3 SQL builder bugs found by user-path tests#578
auxten merged 5 commits into
mainfrom
fix/sql-builder-bugs-and-user-path-tests

auxten commented May 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

auxten commented May 21, 2026

Summary

Bugs fixed

Supporting fixes in the unified pipeline

Test coverage added

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants