Resolve 3 SQL builder bugs found by user-path tests#578
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes multiple SQL-planning / SQL-building edge cases uncovered by new user-journey and property-based tests, and adds a broad regression suite to prevent silent SQL-shape regressions (especially around LIMIT/GROUP BY layering, post-aggregation filtering, and alias collisions).
Changes:
- Update
QueryPlanner._build_layersto introduce new subquery layers forLIMIT/OFFSET -> groupby_aggand for projection/WHERE dependency cases. - Update the SQL builder to (a) emit post-aggregation predicates as
HAVING, (b) applydropna=Truesemantics by addingIS NOT NULLfilters on group keys, and (c) route more DataFrame-source plans through the unified layered pipeline. - Add extensive regression coverage: journey-style mirrors, structural SQL snapshot/invariant tests, cartesian follow-up tests, and Hypothesis fuzz/property-based chains; add Hypothesis as a dev dependency.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
requirements-dev.txt |
Adds hypothesis to dev dependencies for property-based testing. |
datastore/query_planner.py |
Adds new layer-splitting rules to preserve pandas execution-order semantics. |
datastore/sql_executor.py |
Adjusts layered SQL building (HAVING, dropna filters, wrapper-layer rendering, DataFrame routing). |
datastore/tests/test_groupby_sql_pushdown.py |
Relaxes a SQL substring assertion to tolerate dropna-driven IS NOT NULL filters. |
datastore/tests/test_sql_snapshot_assertions.py |
New structural SQL substring assertions (“shape” snapshots). |
datastore/tests/test_plan_invariants.py |
New plan/SQL structural invariants (layer/subquery counts, alias invariants, etc.). |
datastore/tests/test_property_based_chains.py |
New Hypothesis-generated op-chain property tests against pandas. |
datastore/tests/test_naming_conflict_fuzz.py |
New systematic alias/name-collision regression coverage. |
datastore/tests/test_cartesian_result_followups.py |
New cartesian “result-as-input” follow-up matrix against pandas. |
datastore/tests/journeys/__init__.py |
Documents conventions for long-chain user-journey regressions. |
datastore/tests/journeys/test_mark_slack_amazon_reviews.py |
New verbatim Slack-reported pipeline regression suite (parquet + DataFrame sources). |
datastore/tests/journeys/test_kaggle_style_exploration.py |
New end-to-end “Kaggle-style” EDA chain regressions. |
datastore/tests/journeys/test_agg_then_use_result.py |
New “aggregation result as input” journey regressions + explicit shadowing regression. |
.cursor/rules/chdb-ds.mdc |
Adds contributor guidance for multi-step journey regressions and property-based sweeps. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
auxten
added a commit
that referenced
this pull request
May 21, 2026
Two follow-ups from automated review: 1. `_build_execution_sql` (core.py): the helper backing `to_sql()` / `explain()`'s SQL preview used `next(seg for seg in segments if seg.is_sql())`, which would pick a LATER `SQL-on-Python(__df__)` segment when the first segment was Pandas (cost-aware planner for `ORDER BY` without `LIMIT`). That returned a `Python(__df__)` SQL whose `FROM` does not match the source the executor actually issues first. Constrained to the FIRST segment: if it is SQL, return that segment's SQL; otherwise fall back to `_generate_select_sql` (the `SELECT * FROM <source>` the executor runs to materialize rows for Pandas). 2. `_build_layers` (query_planner.py): `mask(...).groupby(...).agg(...)` (or `where(...).groupby(...).agg(...)`) kept the LazyMask/LazyWhere in the SAME layer as LazyGroupByAgg. The wrapper-layer CASE-WHEN temp-alias machinery (`__tmp_<col>__`) then collided with the aggregation's SELECT list - the outer SELECT renamed `__tmp_*__` columns that the inner GROUP-BY SELECT never emitted, producing `UNKNOWN_IDENTIFIER` (DataFrame source) or wrong values (file/table source). Now LazyGroupByAgg starts a new nested layer when the current layer already holds a LazyWhere/LazyMask, so the mask runs in the inner subquery and GROUP BY runs on the masked output: `SELECT ... agg() FROM (SELECT CASE WHEN ... AS x) GROUP BY x`. 3. `_render_wrapper_layer_sql` (sql_executor.py): the `add_row_order_fallback` `ORDER BY _row_id` clause is intentionally skipped when wrapping with `__case_subq__`. The `Python(__df__)` virtual `_row_id` is not propagated through the inner CASE-WHEN SELECT, so referencing it on the outer rename layer would fail with `UNKNOWN_IDENTIFIER`. CASE-WHEN is row-preserving, so source order naturally flows up through the wrapper. Regression tests: - `tests/test_groupby_sql_pushdown.py::TestDispatchByOpType ::test_mask_then_groupby_splits_into_nested_layers` and `test_where_then_groupby_splits_into_nested_layers` pin both the plan structure (2 layers, mask in inner, groupby in outer) and the end-to-end value equality vs pandas. - `tests/test_explain_method.py::TestBuildExecutionSQLFirstSegment` covers both branches: SQL-first chain returns source-bound SQL, Pandas-first chain returns the initial `SELECT * FROM <source>` (never `Python(__df__)`). Verified: 10477 passed, 0 failed (full suite, remote tests excluded).
Fixes three bugs uncovered by the user-journey + property-based test sweep introduced on top of PR #577's unified `_build_layer` dispatcher. 1. head-before-agg LIMIT ordering (`query_planner._build_layers`) `head(N).groupby().agg()` was folded into one SQL layer, so LIMIT ran after GROUP BY and limited the number of groups instead of the number of input rows. The planner now starts a new nested-subquery layer when `LazyGroupByAgg` follows LIMIT/OFFSET. 2. Post-agg WHERE folded back into pre-agg WHERE (`sql_executor`) `result[result['count'] > 0]` after a `groupby().agg(...)` had its condition appended to the pre-aggregation WHERE clause, which silently bound to the source column of the same name. The first- layer renderer now emits the post-agg conditions as a real HAVING clause (combined with any user-supplied `self.ds._having_condition`). 3. DataFrame source assign-before-groupby (`sql_executor._build_sql_for_dataframe`) `ds.assign(x=...).groupby(...).agg(...)` against a Python() DataFrame source emitted `SELECT *, expr AS x, agg(...) ... GROUP BY ...`, which ClickHouse rejects with NOT_AN_AGGREGATE. The single-layer DataFrame path now routes any plan containing `groupby_agg` / `where_ops` / multi-layer through the unified `_build_layered_sql` pipeline, and `_build_layer` forces `include_star=False` whenever the layer has a `groupby_agg` (the agg's emitted select_fields already cover the group keys + aggregations). Supporting changes in the unified pipeline: - `_build_layer`: append `col IS NOT NULL` filters for `groupby_cols` when `dropna=True` (pandas default) - matches the legacy inline DataFrame path that was previously the only place this lived. - `_render_wrapper_layer_sql`: accept `where_temp_alias_columns` so the wrapper layer renames `__tmp_col__` back to user-visible column names; postpone WHERE/ORDER BY/LIMIT/OFFSET to the outer rename layer so they bind to the masked values (LazyMask/LazyWhere semantic); suppress `ORDER BY _row_id` fallback when GROUP BY is present (would be NOT_AN_AGGREGATE). - `_collect_select_fields_from_sql_ops`: unchanged accumulating semantics; instead the planner splits a pure-projection SELECT into a new layer when an in-layer WHERE references a column the projection would drop (e.g. `assign(x=...)[where x>0][['id']]` as produced by ColumnExpr's per-column pruning path). Test coverage: - 3 previously-xfailed regression tests in `datastore/tests/journeys/` (`test_head_before_agg_respects_limit`, `test_agg_output_name_shadows_source_column`, `test_top_three_revenue_dataframe_source_assign_before_groupby`) flipped to permanent passing guards. - `test_property_based_chains.py` no longer skips the head-before-agg / sort-or-select-before-agg chains; remaining empty-result groupby cases (pre-existing pandas-vs-DataStore column-set divergence) are short-circuited explicitly. - New `datastore/tests/` files (journeys/, cartesian product, naming-conflict fuzz, SQL snapshot assertions, plan invariants, property-based chains) introduce ~150 additional regressions. - Updated `.cursor/rules/chdb-ds.mdc` with multi-step-mirror and property-based-sweep test guidelines. - `requirements-dev.txt`: add `hypothesis>=6.0.0`. Verified: - 10473 passed, 0 failed (full suite, remote tests excluded). - 1000-example hypothesis property sweep passes. - All 47 journey tests pass (including the 3 flipped xfails).
Two follow-ups from automated review: 1. `_build_execution_sql` (core.py): the helper backing `to_sql()` / `explain()`'s SQL preview used `next(seg for seg in segments if seg.is_sql())`, which would pick a LATER `SQL-on-Python(__df__)` segment when the first segment was Pandas (cost-aware planner for `ORDER BY` without `LIMIT`). That returned a `Python(__df__)` SQL whose `FROM` does not match the source the executor actually issues first. Constrained to the FIRST segment: if it is SQL, return that segment's SQL; otherwise fall back to `_generate_select_sql` (the `SELECT * FROM <source>` the executor runs to materialize rows for Pandas). 2. `_build_layers` (query_planner.py): `mask(...).groupby(...).agg(...)` (or `where(...).groupby(...).agg(...)`) kept the LazyMask/LazyWhere in the SAME layer as LazyGroupByAgg. The wrapper-layer CASE-WHEN temp-alias machinery (`__tmp_<col>__`) then collided with the aggregation's SELECT list - the outer SELECT renamed `__tmp_*__` columns that the inner GROUP-BY SELECT never emitted, producing `UNKNOWN_IDENTIFIER` (DataFrame source) or wrong values (file/table source). Now LazyGroupByAgg starts a new nested layer when the current layer already holds a LazyWhere/LazyMask, so the mask runs in the inner subquery and GROUP BY runs on the masked output: `SELECT ... agg() FROM (SELECT CASE WHEN ... AS x) GROUP BY x`. 3. `_render_wrapper_layer_sql` (sql_executor.py): the `add_row_order_fallback` `ORDER BY _row_id` clause is intentionally skipped when wrapping with `__case_subq__`. The `Python(__df__)` virtual `_row_id` is not propagated through the inner CASE-WHEN SELECT, so referencing it on the outer rename layer would fail with `UNKNOWN_IDENTIFIER`. CASE-WHEN is row-preserving, so source order naturally flows up through the wrapper. Regression tests: - `tests/test_groupby_sql_pushdown.py::TestDispatchByOpType ::test_mask_then_groupby_splits_into_nested_layers` and `test_where_then_groupby_splits_into_nested_layers` pin both the plan structure (2 layers, mask in inner, groupby in outer) and the end-to-end value equality vs pandas. - `tests/test_explain_method.py::TestBuildExecutionSQLFirstSegment` covers both branches: SQL-first chain returns source-bound SQL, Pandas-first chain returns the initial `SELECT * FROM <source>` (never `Python(__df__)`). Verified: 10477 passed, 0 failed (full suite, remote tests excluded).
889453f to
0ed1ed5
Compare
PR #578 added `hypothesis>=6.0.0` to `requirements-dev.txt` so `datastore/tests/test_property_based_chains.py` can import `hypothesis`. But the GitHub Actions matrices hardcode their pip install command (`pytest pytest-cov [ray] $PANDAS_CONSTRAINT`) and never source `requirements-dev.txt`, so collection aborted with `ModuleNotFoundError: No module named 'hypothesis'` across all 4 platform jobs. Add `hypothesis` inline to each workflow's install step: - build_linux_x86_wheels.yml - build_linux_arm64_wheels-gh.yml - build_macos_arm64_wheels.yml - build_macos_x86_wheels.yml
The previous empty-result short-circuit in ``test_random_chain_matches_pandas`` checked ``isinstance(ds_result, pd.DataFrame)``, but ``apply_chain`` returns a lazy ``DataStore``, not a materialized ``pd.DataFrame`` - so the isinstance branch was always False and the short-circuit never fired. CI on commit 64f38b8 surfaced this when Hypothesis generated the chain ``filter(v>0) -> filter(v>99) -> sort(w desc) -> groupby(cat).agg({'v':'sum'})`` (empty intermediate after the second filter; column drift between pandas' agg-projected output and DataStore's source-bound short-circuit output). Two changes: - ``tests/test_property_based_chains.py``: replace the isinstance check with a duck-typed ``len(r) == 0`` helper that works on ``DataStore`` and ``pd.DataFrame`` alike. Reference the new journey test in the comment so future readers can find the tracked bug. - ``tests/journeys/test_property_chain_followups.py`` (new): verbatim regression for the falsifying chain, marked ``@unittest.expectedFailure``. Follows the ``.cursor/rules/chdb-ds.mdc`` rule #6 contract: every Hypothesis falsifier becomes a permanent journey test; when the underlying empty-intermediate column-projection bug in the executor is fixed, flip the xfail and drop the property-test skip in the same commit.
…survives The debug log truncated executed SQL at 200 chars with "...". PR #578 added the ``col IS NOT NULL`` dropna filter to GROUP BY queries on SQL sources too (previously only DataFrame source). For remote sources this pushed the SQL from ~200 to ~220 chars, truncating ``GROUP BY`` to ``GROUP B...`` in the log output. ``tests/test_remote_lazy_chain_pushdown.py::test_assign_filter_groupby_agg`` parses the debug log and asserts ``'GROUP BY' in sql.upper()``, so the truncation made the assertion fail across all 4 CI platforms. The actual emitted SQL is correct; only the log line was truncated. Bump the limit to 2000 chars (10x) so typical multi-clause queries fit in full and the log-parsing tests work again. The full SQL is still bounded so very large queries don't spam logs.
This was referenced May 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on top of #577. Fixes three SQL builder bugs the new user-path /
property-based test sweep uncovered, and flips the three
xfailedregression slots in
datastore/tests/journeys/into permanent passingguards.
Bugs fixed
head(N).groupby().agg()returns "first N groups" instead of"agg over first N rows" (
query_planner._build_layers).LIMIT was folded into the same SQL layer as GROUP BY. The planner
now starts a new nested-subquery layer when
LazyGroupByAggfollows LIMIT/OFFSET.
result[result['count'] > 0]aftergroupby().agg(...)silentlybinds the WHERE to the source column of the same name
(
sql_executor._render_first_layer_sql).The post-aggregation conditions were being appended to the
pre-agg WHERE list "as a defensive fallback". They are now emitted
as a real
HAVINGclause (composed with any existingself.ds._having_condition).assign(x=...).groupby().agg(...)on a Python() DataFrame sourceemits
SELECT *, expr AS x, agg(...) ... GROUP BY ...whichClickHouse rejects with
NOT_AN_AGGREGATE(
sql_executor._build_sql_for_dataframe,_build_layer).The single-layer DataFrame inline path was a parallel implementation
that lacked the WHERE-by-position split, alias rewriting and
include_star=FalseGROUP BY guard. Any plan containinggroupby_agg/where_ops/ multi-layer is now routed through theunified
_build_layered_sql._build_layerforcesinclude_star=Falsewhenever the layer has agroupby_agg.Supporting fixes in the unified pipeline
_build_layer: appendcol IS NOT NULLfor thegroupby_colswhendropna=True(pandas default), so DataFrame source GROUP BY nolonger leaks NULL groups - matches the legacy inline path.
_render_wrapper_layer_sql: newwhere_temp_alias_columnsparameterso the wrapper layer renames
__tmp_col__back to user-visiblenames; postpone WHERE/ORDER BY/LIMIT/OFFSET to the outer rename
layer so they bind to the masked values (LazyMask/LazyWhere
semantic); suppress
ORDER BY _row_idfallback when GROUP BY ispresent.
in-layer WHERE references a column the projection would drop (e.g.
assign(x=...)[where x>0][['id']], as generated by ColumnExpr'sper-column pruning).
Test coverage added
datastore/tests/journeys/test_mark_slack_amazon_reviews.pyVerbatim mirror of the original Slack-reported pipeline (15 cases
covering all 4 filter idioms x parquet/DataFrame sources, plus
follow-on operations).
datastore/tests/journeys/test_agg_then_use_result.pyCartesian
5 aggregations x 5 follow-ons= 25 result-as-inputuser paths.
datastore/tests/journeys/test_kaggle_style_exploration.py4 end-to-end Kaggle-style EDA chains (top products by region,
customer order histogram, top active users, multi-stage revenue
filter).
datastore/tests/test_cartesian_result_followups.pyMechanically generated 8 builders x 6 follow-ups = 48 cases.
datastore/tests/test_naming_conflict_fuzz.py17 alias-collision scenarios.
datastore/tests/test_sql_snapshot_assertions.py17 structural SQL assertions (presence/absence of GROUP BY, WHERE,
LIMIT, ORDER BY,
__subqaliases).datastore/tests/test_plan_invariants.py44 plan-shape invariants over the layered pipeline.
datastore/tests/test_property_based_chains.pyHypothesis-driven random op-chain fuzzer (filter / sort / head /
select / agg). Now runs without skip filters for the two bugs that
used to be tracked separately. Empty-result groupby cases (a
pre-existing pandas-vs-DataStore column-set divergence) are
explicitly short-circuited.
.cursor/rules/chdb-ds.mdc: added "Multi-Step Mirror for Bug-FixTests" + "Property-Based Sweep for Compositional Bugs" sections.
requirements-dev.txt:hypothesis>=6.0.0.Test plan
pytest datastore/tests/(excluding remote): 10473 passed,0 failed, 17 skipped, 87 xfailed, 2 xpassed.
max_examples=1000: passes.test_head_before_agg_respects_limittest_agg_output_name_shadows_source_columntest_top_three_revenue_dataframe_source_assign_before_groupby