Unify SQL builder behind a single _build_layer dispatcher#577
Merged
Conversation
The previous nested-subquery SQL builder kept LazyGroupByAgg / LazyWhere / LazyMask in side-channel fields (plan.groupby_agg, plan.where_ops) and relied on each call site to re-inject them into the right subquery layer. When a GroupByAgg landed in a non-innermost layer the wrapper builder silently dropped it, producing queries like "SELECT * FROM (...) WHERE count > 5000" against a base table that has no count column. The DataFrame source had its own parallel _build_nested_sql_for_dataframe path with the same latent bug. Cross- layer alias rewriting (source-col vs agg-alias conflicts that span subquery boundaries) was patched in with snapshot/diff bookkeeping around a shared mutable plan.alias_renames dict. Replace the whole tangle with a single per-layer SQL builder: - _build_layer(layer_ops, from_source, is_first_layer, inherited_renames, previously_emitted_renames, ...) -> LayerBuildResult dispatches on op type and returns the SQL string plus the temp aliases it emitted. It is the only per-layer SQL function; the source and DataFrame paths chain it through _build_layered_sql. - _render_first_layer_sql / _render_wrapper_layer_sql cover the two rendering strategies (table-bound vs subquery / __df__-bound). - build_groupby_select_fields is now pure: it takes inherited_alias_renames read-only and returns (groupby_fields, select_fields, new_renames) so callers explicitly merge the renames back into their state. - assemble_sql takes an optional from_source; wrapper layers and the DataFrame execution path use it to override the default self.ds._table_function FROM. - Cross-layer renames travel through two explicit dicts (inherited_renames for alias-collision detection, previously_emitted_renames for WHERE/ORDER BY rewriting), no more shared mutable state. - _build_sql_for_dataframe routes multi-layer plans through the same _build_layered_sql loop with first_layer_from_source="__df__", which also fixes the GroupByAgg-in-nested-layer bug on Python() DataFrame sources. Functions removed (dead after the unification): - plan() and _find_sql_boundary in query_planner.py - build_simple_sql, build_nested_sql, _wrap_with_layer, _build_layer_sql, _build_wrapper_layer_sql, _build_simple_query_from_plan, _build_nested_query_from_plan, _build_nested_sql_for_dataframe, _assemble_simple_sql in sql_executor.py. _build_execution_sql in core.py now uses plan_segments and picks the first SQL segment's plan. QueryPlan docstring rewritten to describe the new data flow and the role of alias_renames (written during SQL building, consumed by post-execution rename-back). Tests: - TestPostAggregationFilterAfterLimit covers the original layer-N GroupByAgg bug reported on Slack (mask / loc / query / filter all match pandas). - TestGroupByInNonZeroLayer adds end-to-end and structural assertions for the wrapper-layer GroupByAgg case, including the new test_dataframe_source_groupby_agg_in_layer_n_matches_pandas which exercises the same fix on a Python() DataFrame source. - TestCrossLayerAliasRewriting covers the source-col / agg-alias conflict that crosses subquery boundaries. - TestDispatchByOpType drives _build_layer directly to lock in the dispatch behaviour for GroupByAgg vs plain relational layers. Full datastore suite: 9274 passed, 7 skipped, 88 xfailed, 1 xpassed (matching the baseline plus the 12 new regression tests).
7d6c15b to
69bac96
Compare
Member
Author
|
@copilot review |
4 tasks
There was a problem hiding this comment.
Pull request overview
This PR refactors SQL compilation to a single unified per-layer dispatcher (SQLExecutionEngine._build_layer) and a single layer-chaining pipeline (_build_layered_sql) shared by both file/table sources and the Python() DataFrame source. The goal is to prevent “side-channel” ops (notably LazyGroupByAgg, LazyWhere, LazyMask) from being dropped when they appear in non-innermost layers, and to make cross-layer alias rewriting explicit and reliable.
Changes:
- Replace multiple overlapping SQL builder paths with a single
_build_layer+_build_layered_sqlpipeline and explicit cross-layer rename propagation. - Introduce expression/order-by column ref rewriting helpers to support cross-layer temp-alias propagation.
- Add extensive regression/unit tests covering layer-N groupby/agg, post-aggregation filtering after head/sort, and cross-layer alias rewriting; update call sites to use segmented planning (
plan_segments).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| datastore/sql_executor.py | Introduces unified per-layer SQL builder, cross-layer rewrite helpers, and refactors groupby select-field construction to be pure + return emitted rename mappings. |
| datastore/query_planner.py | Updates QueryPlan documentation/semantics for segmented planning and ensures layers retain “special ops” so SQL building can dispatch by op type. |
| datastore/core.py | Switches _build_execution_sql to segmented planning and selects a SQL segment plan for SQL generation. |
| datastore/tests/test_groupby_sql_pushdown.py | Adds regression/unit tests for the layer dispatcher, post-aggregation filtering, and cross-layer alias rewriting; updates tests to use plan_segments. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
auxten
added a commit
that referenced
this pull request
May 21, 2026
Two follow-ups from automated review: 1. `_build_execution_sql` (core.py): the helper backing `to_sql()` / `explain()`'s SQL preview used `next(seg for seg in segments if seg.is_sql())`, which would pick a LATER `SQL-on-Python(__df__)` segment when the first segment was Pandas (cost-aware planner for `ORDER BY` without `LIMIT`). That returned a `Python(__df__)` SQL whose `FROM` does not match the source the executor actually issues first. Constrained to the FIRST segment: if it is SQL, return that segment's SQL; otherwise fall back to `_generate_select_sql` (the `SELECT * FROM <source>` the executor runs to materialize rows for Pandas). 2. `_build_layers` (query_planner.py): `mask(...).groupby(...).agg(...)` (or `where(...).groupby(...).agg(...)`) kept the LazyMask/LazyWhere in the SAME layer as LazyGroupByAgg. The wrapper-layer CASE-WHEN temp-alias machinery (`__tmp_<col>__`) then collided with the aggregation's SELECT list - the outer SELECT renamed `__tmp_*__` columns that the inner GROUP-BY SELECT never emitted, producing `UNKNOWN_IDENTIFIER` (DataFrame source) or wrong values (file/table source). Now LazyGroupByAgg starts a new nested layer when the current layer already holds a LazyWhere/LazyMask, so the mask runs in the inner subquery and GROUP BY runs on the masked output: `SELECT ... agg() FROM (SELECT CASE WHEN ... AS x) GROUP BY x`. 3. `_render_wrapper_layer_sql` (sql_executor.py): the `add_row_order_fallback` `ORDER BY _row_id` clause is intentionally skipped when wrapping with `__case_subq__`. The `Python(__df__)` virtual `_row_id` is not propagated through the inner CASE-WHEN SELECT, so referencing it on the outer rename layer would fail with `UNKNOWN_IDENTIFIER`. CASE-WHEN is row-preserving, so source order naturally flows up through the wrapper. Regression tests: - `tests/test_groupby_sql_pushdown.py::TestDispatchByOpType ::test_mask_then_groupby_splits_into_nested_layers` and `test_where_then_groupby_splits_into_nested_layers` pin both the plan structure (2 layers, mask in inner, groupby in outer) and the end-to-end value equality vs pandas. - `tests/test_explain_method.py::TestBuildExecutionSQLFirstSegment` covers both branches: SQL-first chain returns source-bound SQL, Pandas-first chain returns the initial `SELECT * FROM <source>` (never `Python(__df__)`). Verified: 10477 passed, 0 failed (full suite, remote tests excluded).
auxten
added a commit
that referenced
this pull request
May 21, 2026
Fixes three bugs uncovered by the user-journey + property-based test sweep introduced on top of PR #577's unified `_build_layer` dispatcher. 1. head-before-agg LIMIT ordering (`query_planner._build_layers`) `head(N).groupby().agg()` was folded into one SQL layer, so LIMIT ran after GROUP BY and limited the number of groups instead of the number of input rows. The planner now starts a new nested-subquery layer when `LazyGroupByAgg` follows LIMIT/OFFSET. 2. Post-agg WHERE folded back into pre-agg WHERE (`sql_executor`) `result[result['count'] > 0]` after a `groupby().agg(...)` had its condition appended to the pre-aggregation WHERE clause, which silently bound to the source column of the same name. The first- layer renderer now emits the post-agg conditions as a real HAVING clause (combined with any user-supplied `self.ds._having_condition`). 3. DataFrame source assign-before-groupby (`sql_executor._build_sql_for_dataframe`) `ds.assign(x=...).groupby(...).agg(...)` against a Python() DataFrame source emitted `SELECT *, expr AS x, agg(...) ... GROUP BY ...`, which ClickHouse rejects with NOT_AN_AGGREGATE. The single-layer DataFrame path now routes any plan containing `groupby_agg` / `where_ops` / multi-layer through the unified `_build_layered_sql` pipeline, and `_build_layer` forces `include_star=False` whenever the layer has a `groupby_agg` (the agg's emitted select_fields already cover the group keys + aggregations). Supporting changes in the unified pipeline: - `_build_layer`: append `col IS NOT NULL` filters for `groupby_cols` when `dropna=True` (pandas default) - matches the legacy inline DataFrame path that was previously the only place this lived. - `_render_wrapper_layer_sql`: accept `where_temp_alias_columns` so the wrapper layer renames `__tmp_col__` back to user-visible column names; postpone WHERE/ORDER BY/LIMIT/OFFSET to the outer rename layer so they bind to the masked values (LazyMask/LazyWhere semantic); suppress `ORDER BY _row_id` fallback when GROUP BY is present (would be NOT_AN_AGGREGATE). - `_collect_select_fields_from_sql_ops`: unchanged accumulating semantics; instead the planner splits a pure-projection SELECT into a new layer when an in-layer WHERE references a column the projection would drop (e.g. `assign(x=...)[where x>0][['id']]` as produced by ColumnExpr's per-column pruning path). Test coverage: - 3 previously-xfailed regression tests in `datastore/tests/journeys/` (`test_head_before_agg_respects_limit`, `test_agg_output_name_shadows_source_column`, `test_top_three_revenue_dataframe_source_assign_before_groupby`) flipped to permanent passing guards. - `test_property_based_chains.py` no longer skips the head-before-agg / sort-or-select-before-agg chains; remaining empty-result groupby cases (pre-existing pandas-vs-DataStore column-set divergence) are short-circuited explicitly. - New `datastore/tests/` files (journeys/, cartesian product, naming-conflict fuzz, SQL snapshot assertions, plan invariants, property-based chains) introduce ~150 additional regressions. - Updated `.cursor/rules/chdb-ds.mdc` with multi-step-mirror and property-based-sweep test guidelines. - `requirements-dev.txt`: add `hypothesis>=6.0.0`. Verified: - 10473 passed, 0 failed (full suite, remote tests excluded). - 1000-example hypothesis property sweep passes. - All 47 journey tests pass (including the 3 flipped xfails).
auxten
added a commit
that referenced
this pull request
May 21, 2026
Two follow-ups from automated review: 1. `_build_execution_sql` (core.py): the helper backing `to_sql()` / `explain()`'s SQL preview used `next(seg for seg in segments if seg.is_sql())`, which would pick a LATER `SQL-on-Python(__df__)` segment when the first segment was Pandas (cost-aware planner for `ORDER BY` without `LIMIT`). That returned a `Python(__df__)` SQL whose `FROM` does not match the source the executor actually issues first. Constrained to the FIRST segment: if it is SQL, return that segment's SQL; otherwise fall back to `_generate_select_sql` (the `SELECT * FROM <source>` the executor runs to materialize rows for Pandas). 2. `_build_layers` (query_planner.py): `mask(...).groupby(...).agg(...)` (or `where(...).groupby(...).agg(...)`) kept the LazyMask/LazyWhere in the SAME layer as LazyGroupByAgg. The wrapper-layer CASE-WHEN temp-alias machinery (`__tmp_<col>__`) then collided with the aggregation's SELECT list - the outer SELECT renamed `__tmp_*__` columns that the inner GROUP-BY SELECT never emitted, producing `UNKNOWN_IDENTIFIER` (DataFrame source) or wrong values (file/table source). Now LazyGroupByAgg starts a new nested layer when the current layer already holds a LazyWhere/LazyMask, so the mask runs in the inner subquery and GROUP BY runs on the masked output: `SELECT ... agg() FROM (SELECT CASE WHEN ... AS x) GROUP BY x`. 3. `_render_wrapper_layer_sql` (sql_executor.py): the `add_row_order_fallback` `ORDER BY _row_id` clause is intentionally skipped when wrapping with `__case_subq__`. The `Python(__df__)` virtual `_row_id` is not propagated through the inner CASE-WHEN SELECT, so referencing it on the outer rename layer would fail with `UNKNOWN_IDENTIFIER`. CASE-WHEN is row-preserving, so source order naturally flows up through the wrapper. Regression tests: - `tests/test_groupby_sql_pushdown.py::TestDispatchByOpType ::test_mask_then_groupby_splits_into_nested_layers` and `test_where_then_groupby_splits_into_nested_layers` pin both the plan structure (2 layers, mask in inner, groupby in outer) and the end-to-end value equality vs pandas. - `tests/test_explain_method.py::TestBuildExecutionSQLFirstSegment` covers both branches: SQL-first chain returns source-bound SQL, Pandas-first chain returns the initial `SELECT * FROM <source>` (never `Python(__df__)`). Verified: 10477 passed, 0 failed (full suite, remote tests excluded).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the SQL builder's tangle of partially-overlapping helpers (
_build_layer_sql/_build_wrapper_layer_sql/_build_simple_query_from_plan/_build_nested_query_from_plan/_build_nested_sql_for_dataframe/_assemble_simple_sql/build_simple_sql/build_nested_sql/_wrap_with_layer) with a single per-layer SQL builder.The previous architecture kept
LazyGroupByAgg/LazyWhere/LazyMaskin side-channel fields (plan.groupby_agg,plan.where_ops) and relied on each call site to re-inject them into the right subquery layer. When aGroupByAgglanded in a non-innermost layer the wrapper builder silently dropped it, producing queries likeSELECT * FROM (...) WHERE count > 5000against a base table that has nocountcolumn (the bug Mark reported on Slack). The DataFrame source had its own parallel_build_nested_sql_for_dataframepath with the same latent bug. Cross-layer alias rewriting (source-col vs agg-alias conflicts that span subquery boundaries) was patched in with snapshot/diff bookkeeping around a shared mutableplan.alias_renamesdict.Architecture
_build_layer(layer_ops, from_source, is_first_layer, inherited_renames, previously_emitted_renames, ...)is the only per-layer SQL function. Dispatches on op type, returnsLayerBuildResult(sql, renames_introduced). Used by both the source path and the DataFrame path._build_layered_sqlchains_build_layerover every layer in a plan; layer 0 reads the table source (or__df__for the DataFrame path), layers 1+ wrap(inner_sql) AS __subqN__._render_first_layer_sql/_render_wrapper_layer_sqlare the two rendering strategies (table-bound first layer with JOINs / row-order subquery / CASE WHEN subquery vs simple wrap over an explicitfrom_source).build_groupby_select_fieldsis now pure: takesinherited_alias_renamesread-only and returns(groupby_fields, select_fields, new_renames). No more mutation of shared state.assemble_sqltakes an optionalfrom_source. When set, JOINs are skipped (wrapper layers don't carry source-level joins) and the FROM is the supplied SQL fragment.inherited_renamesfor alias-collision detection insidebuild_groupby_select_fields, andpreviously_emitted_renamesfor WHERE/ORDER BY Field rewriting. No more snapshot/diff bookkeeping.Functions removed (dead after the unification)
query_planner.py:plan(),_find_sql_boundary()sql_executor.py:build_simple_sql,build_nested_sql,_wrap_with_layer,_build_layer_sql,_build_wrapper_layer_sql,_build_simple_query_from_plan,_build_nested_query_from_plan,_build_nested_sql_for_dataframe,_assemble_simple_sql_build_execution_sqlincore.pynow usesplan_segmentsand picks the first SQL segment's plan, mirroring_execute.Bugs fixed along the way
GroupByAggon file source:df[df['verified_purchase']==True].groupby('product_category')['star_rating'].agg(['mean','count']).sort_values('count', ascending=False).head(10)[result['count'] > 5000]now returns the aggregated result instead of crashing withUNKNOWN_IDENTIFIER: count.GroupByAggon DataFrame source: same fix, now applied to the Python() execution path through the unified dispatcher.SUM(int_col) AS __agg_int_col__due to a source-col/agg-alias conflict, the outerWHERE int_col > 50000is now rewritten to use the temp alias.Tests
TestPostAggregationFilterAfterLimit— covers Mark's original report with all four filter styles (mask/loc/query/filter) matching pandas.TestGroupByInNonZeroLayer— end-to-end and structural assertions for the wrapper-layerGroupByAggcase, plustest_dataframe_source_groupby_agg_in_layer_n_matches_pandasexercising the same fix on a Python() DataFrame source.TestCrossLayerAliasRewriting— source-col / agg-alias conflict spanning subquery boundaries (WHERE and ORDER BY variants), plus unit tests onrewrite_column_refs_in_expressionforBinaryCondition/CompoundCondition/UnaryCondition/InCondition/BetweenCondition.TestDispatchByOpType— drives_build_layerdirectly to lock in dispatch behaviour forLazyGroupByAggvs plain relational layers, plus the_extract_special_ops_from_layerand_split_wheres_by_groupby_positionhelpers.Test plan
pytest datastore/tests/test_groupby_sql_pushdown.py -v— 30 passed (12 new regression tests included)pytest datastore/tests/ -q --ignore=tests/test_concurrency.py— 9274 passed, 7 skipped, 88 xfailed, 1 xpassed (baseline 9262 + 12 new tests)head().filter().groupby().agg()pattern now returns the aggregated result instead of dropping the GROUP BY