Unify SQL builder behind a single _build_layer dispatcher by auxten · Pull Request #577 · chdb-io/chdb

auxten · 2026-05-21T02:25:38Z

Summary

Replaces the SQL builder's tangle of partially-overlapping helpers (_build_layer_sql / _build_wrapper_layer_sql / _build_simple_query_from_plan / _build_nested_query_from_plan / _build_nested_sql_for_dataframe / _assemble_simple_sql / build_simple_sql / build_nested_sql / _wrap_with_layer) with a single per-layer SQL builder.

The previous architecture kept LazyGroupByAgg / LazyWhere / LazyMask in side-channel fields (plan.groupby_agg, plan.where_ops) and relied on each call site to re-inject them into the right subquery layer. When a GroupByAgg landed in a non-innermost layer the wrapper builder silently dropped it, producing queries like SELECT * FROM (...) WHERE count > 5000 against a base table that has no count column (the bug Mark reported on Slack). The DataFrame source had its own parallel _build_nested_sql_for_dataframe path with the same latent bug. Cross-layer alias rewriting (source-col vs agg-alias conflicts that span subquery boundaries) was patched in with snapshot/diff bookkeeping around a shared mutable plan.alias_renames dict.

Depends on #576 — commit fe26a87 (the explain() display fix) is included here as a prerequisite. Once #576 lands, rebase will leave just the unification commit.

Architecture

_build_layer(layer_ops, from_source, is_first_layer, inherited_renames, previously_emitted_renames, ...) is the only per-layer SQL function. Dispatches on op type, returns LayerBuildResult(sql, renames_introduced). Used by both the source path and the DataFrame path.
_build_layered_sql chains _build_layer over every layer in a plan; layer 0 reads the table source (or __df__ for the DataFrame path), layers 1+ wrap (inner_sql) AS __subqN__.
_render_first_layer_sql / _render_wrapper_layer_sql are the two rendering strategies (table-bound first layer with JOINs / row-order subquery / CASE WHEN subquery vs simple wrap over an explicit from_source).
build_groupby_select_fields is now pure: takes inherited_alias_renames read-only and returns (groupby_fields, select_fields, new_renames). No more mutation of shared state.
assemble_sql takes an optional from_source. When set, JOINs are skipped (wrapper layers don't carry source-level joins) and the FROM is the supplied SQL fragment.
Cross-layer renames travel through two explicit dicts: inherited_renames for alias-collision detection inside build_groupby_select_fields, and previously_emitted_renames for WHERE/ORDER BY Field rewriting. No more snapshot/diff bookkeeping.

Functions removed (dead after the unification)

query_planner.py: plan(), _find_sql_boundary()
sql_executor.py: build_simple_sql, build_nested_sql, _wrap_with_layer, _build_layer_sql, _build_wrapper_layer_sql, _build_simple_query_from_plan, _build_nested_query_from_plan, _build_nested_sql_for_dataframe, _assemble_simple_sql

_build_execution_sql in core.py now uses plan_segments and picks the first SQL segment's plan, mirroring _execute.

Bugs fixed along the way

Layer-N GroupByAgg on file source: df[df['verified_purchase']==True].groupby('product_category')['star_rating'].agg(['mean','count']).sort_values('count', ascending=False).head(10)[result['count'] > 5000] now returns the aggregated result instead of crashing with UNKNOWN_IDENTIFIER: count.
Layer-N GroupByAgg on DataFrame source: same fix, now applied to the Python() execution path through the unified dispatcher.
Cross-layer alias rewriting: when the inner subquery emits SUM(int_col) AS __agg_int_col__ due to a source-col/agg-alias conflict, the outer WHERE int_col > 50000 is now rewritten to use the temp alias.

Tests

TestPostAggregationFilterAfterLimit — covers Mark's original report with all four filter styles (mask / loc / query / filter) matching pandas.
TestGroupByInNonZeroLayer — end-to-end and structural assertions for the wrapper-layer GroupByAgg case, plus test_dataframe_source_groupby_agg_in_layer_n_matches_pandas exercising the same fix on a Python() DataFrame source.
TestCrossLayerAliasRewriting — source-col / agg-alias conflict spanning subquery boundaries (WHERE and ORDER BY variants), plus unit tests on rewrite_column_refs_in_expression for BinaryCondition / CompoundCondition / UnaryCondition / InCondition / BetweenCondition.
TestDispatchByOpType — drives _build_layer directly to lock in dispatch behaviour for LazyGroupByAgg vs plain relational layers, plus the _extract_special_ops_from_layer and _split_wheres_by_groupby_position helpers.

Test plan

pytest datastore/tests/test_groupby_sql_pushdown.py -v — 30 passed (12 new regression tests included)
pytest datastore/tests/ -q --ignore=tests/test_concurrency.py — 9274 passed, 7 skipped, 88 xfailed, 1 xpassed (baseline 9262 + 12 new tests)
Manually verified Mark's original scenario returns pandas-matching output for all four filter styles
Manually verified the DataFrame-source head().filter().groupby().agg() pattern now returns the aggregated result instead of dropping the GROUP BY

The previous nested-subquery SQL builder kept LazyGroupByAgg / LazyWhere / LazyMask in side-channel fields (plan.groupby_agg, plan.where_ops) and relied on each call site to re-inject them into the right subquery layer. When a GroupByAgg landed in a non-innermost layer the wrapper builder silently dropped it, producing queries like "SELECT * FROM (...) WHERE count > 5000" against a base table that has no count column. The DataFrame source had its own parallel _build_nested_sql_for_dataframe path with the same latent bug. Cross- layer alias rewriting (source-col vs agg-alias conflicts that span subquery boundaries) was patched in with snapshot/diff bookkeeping around a shared mutable plan.alias_renames dict. Replace the whole tangle with a single per-layer SQL builder: - _build_layer(layer_ops, from_source, is_first_layer, inherited_renames, previously_emitted_renames, ...) -> LayerBuildResult dispatches on op type and returns the SQL string plus the temp aliases it emitted. It is the only per-layer SQL function; the source and DataFrame paths chain it through _build_layered_sql. - _render_first_layer_sql / _render_wrapper_layer_sql cover the two rendering strategies (table-bound vs subquery / __df__-bound). - build_groupby_select_fields is now pure: it takes inherited_alias_renames read-only and returns (groupby_fields, select_fields, new_renames) so callers explicitly merge the renames back into their state. - assemble_sql takes an optional from_source; wrapper layers and the DataFrame execution path use it to override the default self.ds._table_function FROM. - Cross-layer renames travel through two explicit dicts (inherited_renames for alias-collision detection, previously_emitted_renames for WHERE/ORDER BY rewriting), no more shared mutable state. - _build_sql_for_dataframe routes multi-layer plans through the same _build_layered_sql loop with first_layer_from_source="__df__", which also fixes the GroupByAgg-in-nested-layer bug on Python() DataFrame sources. Functions removed (dead after the unification): - plan() and _find_sql_boundary in query_planner.py - build_simple_sql, build_nested_sql, _wrap_with_layer, _build_layer_sql, _build_wrapper_layer_sql, _build_simple_query_from_plan, _build_nested_query_from_plan, _build_nested_sql_for_dataframe, _assemble_simple_sql in sql_executor.py. _build_execution_sql in core.py now uses plan_segments and picks the first SQL segment's plan. QueryPlan docstring rewritten to describe the new data flow and the role of alias_renames (written during SQL building, consumed by post-execution rename-back). Tests: - TestPostAggregationFilterAfterLimit covers the original layer-N GroupByAgg bug reported on Slack (mask / loc / query / filter all match pandas). - TestGroupByInNonZeroLayer adds end-to-end and structural assertions for the wrapper-layer GroupByAgg case, including the new test_dataframe_source_groupby_agg_in_layer_n_matches_pandas which exercises the same fix on a Python() DataFrame source. - TestCrossLayerAliasRewriting covers the source-col / agg-alias conflict that crosses subquery boundaries. - TestDispatchByOpType drives _build_layer directly to lock in the dispatch behaviour for GroupByAgg vs plain relational layers. Full datastore suite: 9274 passed, 7 skipped, 88 xfailed, 1 xpassed (matching the baseline plus the 12 new regression tests).

auxten · 2026-05-21T05:22:58Z

@copilot review

Copilot · 2026-05-21T05:25:07Z

@copilot review

Completed a review pass on commit 69bac96, checked recent workflow runs/logs, and re-ran the targeted regression suite (datastore/tests/test_groupby_sql_pushdown.py): no additional code changes are needed.

Copilot

Pull request overview

This PR refactors SQL compilation to a single unified per-layer dispatcher (SQLExecutionEngine._build_layer) and a single layer-chaining pipeline (_build_layered_sql) shared by both file/table sources and the Python() DataFrame source. The goal is to prevent “side-channel” ops (notably LazyGroupByAgg, LazyWhere, LazyMask) from being dropped when they appear in non-innermost layers, and to make cross-layer alias rewriting explicit and reliable.

Changes:

Replace multiple overlapping SQL builder paths with a single _build_layer + _build_layered_sql pipeline and explicit cross-layer rename propagation.
Introduce expression/order-by column ref rewriting helpers to support cross-layer temp-alias propagation.
Add extensive regression/unit tests covering layer-N groupby/agg, post-aggregation filtering after head/sort, and cross-layer alias rewriting; update call sites to use segmented planning (plan_segments).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
datastore/sql_executor.py	Introduces unified per-layer SQL builder, cross-layer rewrite helpers, and refactors groupby select-field construction to be pure + return emitted rename mappings.
datastore/query_planner.py	Updates QueryPlan documentation/semantics for segmented planning and ensures layers retain “special ops” so SQL building can dispatch by op type.
datastore/core.py	Switches `_build_execution_sql` to segmented planning and selects a SQL segment plan for SQL generation.
datastore/tests/test_groupby_sql_pushdown.py	Adds regression/unit tests for the layer dispatcher, post-aggregation filtering, and cross-layer alias rewriting; updates tests to use `plan_segments`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Two follow-ups from automated review: 1. `_build_execution_sql` (core.py): the helper backing `to_sql()` / `explain()`'s SQL preview used `next(seg for seg in segments if seg.is_sql())`, which would pick a LATER `SQL-on-Python(__df__)` segment when the first segment was Pandas (cost-aware planner for `ORDER BY` without `LIMIT`). That returned a `Python(__df__)` SQL whose `FROM` does not match the source the executor actually issues first. Constrained to the FIRST segment: if it is SQL, return that segment's SQL; otherwise fall back to `_generate_select_sql` (the `SELECT * FROM <source>` the executor runs to materialize rows for Pandas). 2. `_build_layers` (query_planner.py): `mask(...).groupby(...).agg(...)` (or `where(...).groupby(...).agg(...)`) kept the LazyMask/LazyWhere in the SAME layer as LazyGroupByAgg. The wrapper-layer CASE-WHEN temp-alias machinery (`__tmp_<col>__`) then collided with the aggregation's SELECT list - the outer SELECT renamed `__tmp_*__` columns that the inner GROUP-BY SELECT never emitted, producing `UNKNOWN_IDENTIFIER` (DataFrame source) or wrong values (file/table source). Now LazyGroupByAgg starts a new nested layer when the current layer already holds a LazyWhere/LazyMask, so the mask runs in the inner subquery and GROUP BY runs on the masked output: `SELECT ... agg() FROM (SELECT CASE WHEN ... AS x) GROUP BY x`. 3. `_render_wrapper_layer_sql` (sql_executor.py): the `add_row_order_fallback` `ORDER BY _row_id` clause is intentionally skipped when wrapping with `__case_subq__`. The `Python(__df__)` virtual `_row_id` is not propagated through the inner CASE-WHEN SELECT, so referencing it on the outer rename layer would fail with `UNKNOWN_IDENTIFIER`. CASE-WHEN is row-preserving, so source order naturally flows up through the wrapper. Regression tests: - `tests/test_groupby_sql_pushdown.py::TestDispatchByOpType ::test_mask_then_groupby_splits_into_nested_layers` and `test_where_then_groupby_splits_into_nested_layers` pin both the plan structure (2 layers, mask in inner, groupby in outer) and the end-to-end value equality vs pandas. - `tests/test_explain_method.py::TestBuildExecutionSQLFirstSegment` covers both branches: SQL-first chain returns source-bound SQL, Pandas-first chain returns the initial `SELECT * FROM <source>` (never `Python(__df__)`). Verified: 10477 passed, 0 failed (full suite, remote tests excluded).

Fixes three bugs uncovered by the user-journey + property-based test sweep introduced on top of PR #577's unified `_build_layer` dispatcher. 1. head-before-agg LIMIT ordering (`query_planner._build_layers`) `head(N).groupby().agg()` was folded into one SQL layer, so LIMIT ran after GROUP BY and limited the number of groups instead of the number of input rows. The planner now starts a new nested-subquery layer when `LazyGroupByAgg` follows LIMIT/OFFSET. 2. Post-agg WHERE folded back into pre-agg WHERE (`sql_executor`) `result[result['count'] > 0]` after a `groupby().agg(...)` had its condition appended to the pre-aggregation WHERE clause, which silently bound to the source column of the same name. The first- layer renderer now emits the post-agg conditions as a real HAVING clause (combined with any user-supplied `self.ds._having_condition`). 3. DataFrame source assign-before-groupby (`sql_executor._build_sql_for_dataframe`) `ds.assign(x=...).groupby(...).agg(...)` against a Python() DataFrame source emitted `SELECT *, expr AS x, agg(...) ... GROUP BY ...`, which ClickHouse rejects with NOT_AN_AGGREGATE. The single-layer DataFrame path now routes any plan containing `groupby_agg` / `where_ops` / multi-layer through the unified `_build_layered_sql` pipeline, and `_build_layer` forces `include_star=False` whenever the layer has a `groupby_agg` (the agg's emitted select_fields already cover the group keys + aggregations). Supporting changes in the unified pipeline: - `_build_layer`: append `col IS NOT NULL` filters for `groupby_cols` when `dropna=True` (pandas default) - matches the legacy inline DataFrame path that was previously the only place this lived. - `_render_wrapper_layer_sql`: accept `where_temp_alias_columns` so the wrapper layer renames `__tmp_col__` back to user-visible column names; postpone WHERE/ORDER BY/LIMIT/OFFSET to the outer rename layer so they bind to the masked values (LazyMask/LazyWhere semantic); suppress `ORDER BY _row_id` fallback when GROUP BY is present (would be NOT_AN_AGGREGATE). - `_collect_select_fields_from_sql_ops`: unchanged accumulating semantics; instead the planner splits a pure-projection SELECT into a new layer when an in-layer WHERE references a column the projection would drop (e.g. `assign(x=...)[where x>0][['id']]` as produced by ColumnExpr's per-column pruning path). Test coverage: - 3 previously-xfailed regression tests in `datastore/tests/journeys/` (`test_head_before_agg_respects_limit`, `test_agg_output_name_shadows_source_column`, `test_top_three_revenue_dataframe_source_assign_before_groupby`) flipped to permanent passing guards. - `test_property_based_chains.py` no longer skips the head-before-agg / sort-or-select-before-agg chains; remaining empty-result groupby cases (pre-existing pandas-vs-DataStore column-set divergence) are short-circuited explicitly. - New `datastore/tests/` files (journeys/, cartesian product, naming-conflict fuzz, SQL snapshot assertions, plan invariants, property-based chains) introduce ~150 additional regressions. - Updated `.cursor/rules/chdb-ds.mdc` with multi-step-mirror and property-based-sweep test guidelines. - `requirements-dev.txt`: add `hypothesis>=6.0.0`. Verified: - 10473 passed, 0 failed (full suite, remote tests excluded). - 1000-example hypothesis property sweep passes. - All 47 journey tests pass (including the 3 flipped xfails).

Two follow-ups from automated review: 1. `_build_execution_sql` (core.py): the helper backing `to_sql()` / `explain()`'s SQL preview used `next(seg for seg in segments if seg.is_sql())`, which would pick a LATER `SQL-on-Python(__df__)` segment when the first segment was Pandas (cost-aware planner for `ORDER BY` without `LIMIT`). That returned a `Python(__df__)` SQL whose `FROM` does not match the source the executor actually issues first. Constrained to the FIRST segment: if it is SQL, return that segment's SQL; otherwise fall back to `_generate_select_sql` (the `SELECT * FROM <source>` the executor runs to materialize rows for Pandas). 2. `_build_layers` (query_planner.py): `mask(...).groupby(...).agg(...)` (or `where(...).groupby(...).agg(...)`) kept the LazyMask/LazyWhere in the SAME layer as LazyGroupByAgg. The wrapper-layer CASE-WHEN temp-alias machinery (`__tmp_<col>__`) then collided with the aggregation's SELECT list - the outer SELECT renamed `__tmp_*__` columns that the inner GROUP-BY SELECT never emitted, producing `UNKNOWN_IDENTIFIER` (DataFrame source) or wrong values (file/table source). Now LazyGroupByAgg starts a new nested layer when the current layer already holds a LazyWhere/LazyMask, so the mask runs in the inner subquery and GROUP BY runs on the masked output: `SELECT ... agg() FROM (SELECT CASE WHEN ... AS x) GROUP BY x`. 3. `_render_wrapper_layer_sql` (sql_executor.py): the `add_row_order_fallback` `ORDER BY _row_id` clause is intentionally skipped when wrapping with `__case_subq__`. The `Python(__df__)` virtual `_row_id` is not propagated through the inner CASE-WHEN SELECT, so referencing it on the outer rename layer would fail with `UNKNOWN_IDENTIFIER`. CASE-WHEN is row-preserving, so source order naturally flows up through the wrapper. Regression tests: - `tests/test_groupby_sql_pushdown.py::TestDispatchByOpType ::test_mask_then_groupby_splits_into_nested_layers` and `test_where_then_groupby_splits_into_nested_layers` pin both the plan structure (2 layers, mask in inner, groupby in outer) and the end-to-end value equality vs pandas. - `tests/test_explain_method.py::TestBuildExecutionSQLFirstSegment` covers both branches: SQL-first chain returns source-bound SQL, Pandas-first chain returns the initial `SELECT * FROM <source>` (never `Python(__df__)`). Verified: 10477 passed, 0 failed (full suite, remote tests excluded).

auxten force-pushed the refactor/unify-sql-layer-builder branch from 7d6c15b to 69bac96 Compare May 21, 2026 02:42

Copilot started work on behalf of auxten May 21, 2026 05:23 View session

auxten requested a review from Copilot May 21, 2026 05:23

Copilot started reviewing on behalf of auxten May 21, 2026 05:23 View session

Copilot finished work on behalf of auxten May 21, 2026 05:25

auxten mentioned this pull request May 21, 2026

Resolve 3 SQL builder bugs found by user-path tests #578

Merged

4 tasks

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread datastore/sql_executor.py

Comment thread datastore/sql_executor.py

Comment thread datastore/core.py

auxten merged commit ab5da29 into main May 21, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unify SQL builder behind a single _build_layer dispatcher#577

Unify SQL builder behind a single _build_layer dispatcher#577
auxten merged 1 commit into
mainfrom
refactor/unify-sql-layer-builder

auxten commented May 21, 2026

Uh oh!

auxten commented May 21, 2026

Uh oh!

Copilot AI commented May 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

auxten commented May 21, 2026

Summary

Architecture

Functions removed (dead after the unification)

Bugs fixed along the way

Tests

Test plan

Uh oh!

auxten commented May 21, 2026

Uh oh!

Copilot AI commented May 21, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants