Skip to content

Unify SQL builder behind a single _build_layer dispatcher#577

Merged
auxten merged 1 commit into
mainfrom
refactor/unify-sql-layer-builder
May 21, 2026
Merged

Unify SQL builder behind a single _build_layer dispatcher#577
auxten merged 1 commit into
mainfrom
refactor/unify-sql-layer-builder

Conversation

@auxten

@auxten auxten commented May 21, 2026

Copy link
Copy Markdown
Member

Summary

Replaces the SQL builder's tangle of partially-overlapping helpers (_build_layer_sql / _build_wrapper_layer_sql / _build_simple_query_from_plan / _build_nested_query_from_plan / _build_nested_sql_for_dataframe / _assemble_simple_sql / build_simple_sql / build_nested_sql / _wrap_with_layer) with a single per-layer SQL builder.

The previous architecture kept LazyGroupByAgg / LazyWhere / LazyMask in side-channel fields (plan.groupby_agg, plan.where_ops) and relied on each call site to re-inject them into the right subquery layer. When a GroupByAgg landed in a non-innermost layer the wrapper builder silently dropped it, producing queries like SELECT * FROM (...) WHERE count > 5000 against a base table that has no count column (the bug Mark reported on Slack). The DataFrame source had its own parallel _build_nested_sql_for_dataframe path with the same latent bug. Cross-layer alias rewriting (source-col vs agg-alias conflicts that span subquery boundaries) was patched in with snapshot/diff bookkeeping around a shared mutable plan.alias_renames dict.

Depends on #576 — commit fe26a87 (the explain() display fix) is included here as a prerequisite. Once #576 lands, rebase will leave just the unification commit.

Architecture

  • _build_layer(layer_ops, from_source, is_first_layer, inherited_renames, previously_emitted_renames, ...) is the only per-layer SQL function. Dispatches on op type, returns LayerBuildResult(sql, renames_introduced). Used by both the source path and the DataFrame path.
  • _build_layered_sql chains _build_layer over every layer in a plan; layer 0 reads the table source (or __df__ for the DataFrame path), layers 1+ wrap (inner_sql) AS __subqN__.
  • _render_first_layer_sql / _render_wrapper_layer_sql are the two rendering strategies (table-bound first layer with JOINs / row-order subquery / CASE WHEN subquery vs simple wrap over an explicit from_source).
  • build_groupby_select_fields is now pure: takes inherited_alias_renames read-only and returns (groupby_fields, select_fields, new_renames). No more mutation of shared state.
  • assemble_sql takes an optional from_source. When set, JOINs are skipped (wrapper layers don't carry source-level joins) and the FROM is the supplied SQL fragment.
  • Cross-layer renames travel through two explicit dicts: inherited_renames for alias-collision detection inside build_groupby_select_fields, and previously_emitted_renames for WHERE/ORDER BY Field rewriting. No more snapshot/diff bookkeeping.

Functions removed (dead after the unification)

query_planner.py: plan(), _find_sql_boundary()
sql_executor.py: build_simple_sql, build_nested_sql, _wrap_with_layer, _build_layer_sql, _build_wrapper_layer_sql, _build_simple_query_from_plan, _build_nested_query_from_plan, _build_nested_sql_for_dataframe, _assemble_simple_sql

_build_execution_sql in core.py now uses plan_segments and picks the first SQL segment's plan, mirroring _execute.

Bugs fixed along the way

  1. Layer-N GroupByAgg on file source: df[df['verified_purchase']==True].groupby('product_category')['star_rating'].agg(['mean','count']).sort_values('count', ascending=False).head(10)[result['count'] > 5000] now returns the aggregated result instead of crashing with UNKNOWN_IDENTIFIER: count.
  2. Layer-N GroupByAgg on DataFrame source: same fix, now applied to the Python() execution path through the unified dispatcher.
  3. Cross-layer alias rewriting: when the inner subquery emits SUM(int_col) AS __agg_int_col__ due to a source-col/agg-alias conflict, the outer WHERE int_col > 50000 is now rewritten to use the temp alias.

Tests

  • TestPostAggregationFilterAfterLimit — covers Mark's original report with all four filter styles (mask / loc / query / filter) matching pandas.
  • TestGroupByInNonZeroLayer — end-to-end and structural assertions for the wrapper-layer GroupByAgg case, plus test_dataframe_source_groupby_agg_in_layer_n_matches_pandas exercising the same fix on a Python() DataFrame source.
  • TestCrossLayerAliasRewriting — source-col / agg-alias conflict spanning subquery boundaries (WHERE and ORDER BY variants), plus unit tests on rewrite_column_refs_in_expression for BinaryCondition / CompoundCondition / UnaryCondition / InCondition / BetweenCondition.
  • TestDispatchByOpType — drives _build_layer directly to lock in dispatch behaviour for LazyGroupByAgg vs plain relational layers, plus the _extract_special_ops_from_layer and _split_wheres_by_groupby_position helpers.

Test plan

  • pytest datastore/tests/test_groupby_sql_pushdown.py -v — 30 passed (12 new regression tests included)
  • pytest datastore/tests/ -q --ignore=tests/test_concurrency.py9274 passed, 7 skipped, 88 xfailed, 1 xpassed (baseline 9262 + 12 new tests)
  • Manually verified Mark's original scenario returns pandas-matching output for all four filter styles
  • Manually verified the DataFrame-source head().filter().groupby().agg() pattern now returns the aggregated result instead of dropping the GROUP BY

The previous nested-subquery SQL builder kept LazyGroupByAgg / LazyWhere /
LazyMask in side-channel fields (plan.groupby_agg, plan.where_ops) and
relied on each call site to re-inject them into the right subquery layer.
When a GroupByAgg landed in a non-innermost layer the wrapper builder
silently dropped it, producing queries like
"SELECT * FROM (...) WHERE count > 5000" against a base table that has
no count column. The DataFrame source had its own parallel
_build_nested_sql_for_dataframe path with the same latent bug. Cross-
layer alias rewriting (source-col vs agg-alias conflicts that span
subquery boundaries) was patched in with snapshot/diff bookkeeping
around a shared mutable plan.alias_renames dict.

Replace the whole tangle with a single per-layer SQL builder:

- _build_layer(layer_ops, from_source, is_first_layer,
  inherited_renames, previously_emitted_renames, ...) -> LayerBuildResult
  dispatches on op type and returns the SQL string plus the temp aliases
  it emitted. It is the only per-layer SQL function; the source and
  DataFrame paths chain it through _build_layered_sql.
- _render_first_layer_sql / _render_wrapper_layer_sql cover the two
  rendering strategies (table-bound vs subquery / __df__-bound).
- build_groupby_select_fields is now pure: it takes
  inherited_alias_renames read-only and returns (groupby_fields,
  select_fields, new_renames) so callers explicitly merge the renames
  back into their state.
- assemble_sql takes an optional from_source; wrapper layers and the
  DataFrame execution path use it to override the default
  self.ds._table_function FROM.
- Cross-layer renames travel through two explicit dicts
  (inherited_renames for alias-collision detection,
  previously_emitted_renames for WHERE/ORDER BY rewriting), no more
  shared mutable state.
- _build_sql_for_dataframe routes multi-layer plans through the same
  _build_layered_sql loop with first_layer_from_source="__df__",
  which also fixes the GroupByAgg-in-nested-layer bug on Python()
  DataFrame sources.

Functions removed (dead after the unification):
- plan() and _find_sql_boundary in query_planner.py
- build_simple_sql, build_nested_sql, _wrap_with_layer,
  _build_layer_sql, _build_wrapper_layer_sql,
  _build_simple_query_from_plan, _build_nested_query_from_plan,
  _build_nested_sql_for_dataframe, _assemble_simple_sql in
  sql_executor.py.
_build_execution_sql in core.py now uses plan_segments and picks the
first SQL segment's plan.

QueryPlan docstring rewritten to describe the new data flow and the
role of alias_renames (written during SQL building, consumed by
post-execution rename-back).

Tests:
- TestPostAggregationFilterAfterLimit covers the original layer-N
  GroupByAgg bug reported on Slack (mask / loc / query / filter all
  match pandas).
- TestGroupByInNonZeroLayer adds end-to-end and structural assertions
  for the wrapper-layer GroupByAgg case, including the new
  test_dataframe_source_groupby_agg_in_layer_n_matches_pandas which
  exercises the same fix on a Python() DataFrame source.
- TestCrossLayerAliasRewriting covers the source-col / agg-alias
  conflict that crosses subquery boundaries.
- TestDispatchByOpType drives _build_layer directly to lock in the
  dispatch behaviour for GroupByAgg vs plain relational layers.

Full datastore suite: 9274 passed, 7 skipped, 88 xfailed, 1 xpassed
(matching the baseline plus the 12 new regression tests).
@auxten auxten force-pushed the refactor/unify-sql-layer-builder branch from 7d6c15b to 69bac96 Compare May 21, 2026 02:42
@auxten

auxten commented May 21, 2026

Copy link
Copy Markdown
Member Author

@copilot review

Copilot AI commented May 21, 2026

Copy link
Copy Markdown

@copilot review

Completed a review pass on commit 69bac96, checked recent workflow runs/logs, and re-ran the targeted regression suite (datastore/tests/test_groupby_sql_pushdown.py): no additional code changes are needed.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors SQL compilation to a single unified per-layer dispatcher (SQLExecutionEngine._build_layer) and a single layer-chaining pipeline (_build_layered_sql) shared by both file/table sources and the Python() DataFrame source. The goal is to prevent “side-channel” ops (notably LazyGroupByAgg, LazyWhere, LazyMask) from being dropped when they appear in non-innermost layers, and to make cross-layer alias rewriting explicit and reliable.

Changes:

  • Replace multiple overlapping SQL builder paths with a single _build_layer + _build_layered_sql pipeline and explicit cross-layer rename propagation.
  • Introduce expression/order-by column ref rewriting helpers to support cross-layer temp-alias propagation.
  • Add extensive regression/unit tests covering layer-N groupby/agg, post-aggregation filtering after head/sort, and cross-layer alias rewriting; update call sites to use segmented planning (plan_segments).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
datastore/sql_executor.py Introduces unified per-layer SQL builder, cross-layer rewrite helpers, and refactors groupby select-field construction to be pure + return emitted rename mappings.
datastore/query_planner.py Updates QueryPlan documentation/semantics for segmented planning and ensures layers retain “special ops” so SQL building can dispatch by op type.
datastore/core.py Switches _build_execution_sql to segmented planning and selects a SQL segment plan for SQL generation.
datastore/tests/test_groupby_sql_pushdown.py Adds regression/unit tests for the layer dispatcher, post-aggregation filtering, and cross-layer alias rewriting; updates tests to use plan_segments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread datastore/sql_executor.py
Comment thread datastore/sql_executor.py
Comment thread datastore/core.py
auxten added a commit that referenced this pull request May 21, 2026
Two follow-ups from automated review:

1. `_build_execution_sql` (core.py): the helper backing `to_sql()` /
   `explain()`'s SQL preview used `next(seg for seg in segments if
   seg.is_sql())`, which would pick a LATER `SQL-on-Python(__df__)`
   segment when the first segment was Pandas (cost-aware planner for
   `ORDER BY` without `LIMIT`). That returned a `Python(__df__)` SQL
   whose `FROM` does not match the source the executor actually
   issues first. Constrained to the FIRST segment: if it is SQL,
   return that segment's SQL; otherwise fall back to
   `_generate_select_sql` (the `SELECT * FROM <source>` the executor
   runs to materialize rows for Pandas).

2. `_build_layers` (query_planner.py): `mask(...).groupby(...).agg(...)`
   (or `where(...).groupby(...).agg(...)`) kept the LazyMask/LazyWhere
   in the SAME layer as LazyGroupByAgg. The wrapper-layer CASE-WHEN
   temp-alias machinery (`__tmp_<col>__`) then collided with the
   aggregation's SELECT list - the outer SELECT renamed `__tmp_*__`
   columns that the inner GROUP-BY SELECT never emitted, producing
   `UNKNOWN_IDENTIFIER` (DataFrame source) or wrong values
   (file/table source). Now LazyGroupByAgg starts a new nested layer
   when the current layer already holds a LazyWhere/LazyMask, so the
   mask runs in the inner subquery and GROUP BY runs on the masked
   output:
   `SELECT ... agg() FROM (SELECT CASE WHEN ... AS x) GROUP BY x`.

3. `_render_wrapper_layer_sql` (sql_executor.py): the
   `add_row_order_fallback` `ORDER BY _row_id` clause is intentionally
   skipped when wrapping with `__case_subq__`. The `Python(__df__)`
   virtual `_row_id` is not propagated through the inner CASE-WHEN
   SELECT, so referencing it on the outer rename layer would fail with
   `UNKNOWN_IDENTIFIER`. CASE-WHEN is row-preserving, so source order
   naturally flows up through the wrapper.

Regression tests:
- `tests/test_groupby_sql_pushdown.py::TestDispatchByOpType
  ::test_mask_then_groupby_splits_into_nested_layers` and
  `test_where_then_groupby_splits_into_nested_layers` pin both the
  plan structure (2 layers, mask in inner, groupby in outer) and the
  end-to-end value equality vs pandas.
- `tests/test_explain_method.py::TestBuildExecutionSQLFirstSegment`
  covers both branches: SQL-first chain returns source-bound SQL,
  Pandas-first chain returns the initial `SELECT * FROM <source>`
  (never `Python(__df__)`).

Verified: 10477 passed, 0 failed (full suite, remote tests excluded).
@auxten auxten merged commit ab5da29 into main May 21, 2026
11 checks passed
auxten added a commit that referenced this pull request May 21, 2026
Fixes three bugs uncovered by the user-journey + property-based test
sweep introduced on top of PR #577's unified `_build_layer` dispatcher.

1. head-before-agg LIMIT ordering (`query_planner._build_layers`)
   `head(N).groupby().agg()` was folded into one SQL layer, so LIMIT
   ran after GROUP BY and limited the number of groups instead of the
   number of input rows. The planner now starts a new nested-subquery
   layer when `LazyGroupByAgg` follows LIMIT/OFFSET.

2. Post-agg WHERE folded back into pre-agg WHERE (`sql_executor`)
   `result[result['count'] > 0]` after a `groupby().agg(...)` had its
   condition appended to the pre-aggregation WHERE clause, which
   silently bound to the source column of the same name. The first-
   layer renderer now emits the post-agg conditions as a real HAVING
   clause (combined with any user-supplied `self.ds._having_condition`).

3. DataFrame source assign-before-groupby (`sql_executor._build_sql_for_dataframe`)
   `ds.assign(x=...).groupby(...).agg(...)` against a Python() DataFrame
   source emitted `SELECT *, expr AS x, agg(...) ... GROUP BY ...`,
   which ClickHouse rejects with NOT_AN_AGGREGATE. The single-layer
   DataFrame path now routes any plan containing `groupby_agg` /
   `where_ops` / multi-layer through the unified `_build_layered_sql`
   pipeline, and `_build_layer` forces `include_star=False` whenever
   the layer has a `groupby_agg` (the agg's emitted select_fields
   already cover the group keys + aggregations).

Supporting changes in the unified pipeline:
- `_build_layer`: append `col IS NOT NULL` filters for `groupby_cols`
  when `dropna=True` (pandas default) - matches the legacy inline
  DataFrame path that was previously the only place this lived.
- `_render_wrapper_layer_sql`: accept `where_temp_alias_columns` so
  the wrapper layer renames `__tmp_col__` back to user-visible
  column names; postpone WHERE/ORDER BY/LIMIT/OFFSET to the outer
  rename layer so they bind to the masked values (LazyMask/LazyWhere
  semantic); suppress `ORDER BY _row_id` fallback when GROUP BY is
  present (would be NOT_AN_AGGREGATE).
- `_collect_select_fields_from_sql_ops`: unchanged accumulating
  semantics; instead the planner splits a pure-projection SELECT
  into a new layer when an in-layer WHERE references a column the
  projection would drop (e.g. `assign(x=...)[where x>0][['id']]` as
  produced by ColumnExpr's per-column pruning path).

Test coverage:
- 3 previously-xfailed regression tests in `datastore/tests/journeys/`
  (`test_head_before_agg_respects_limit`,
  `test_agg_output_name_shadows_source_column`,
  `test_top_three_revenue_dataframe_source_assign_before_groupby`)
  flipped to permanent passing guards.
- `test_property_based_chains.py` no longer skips the
  head-before-agg / sort-or-select-before-agg chains; remaining
  empty-result groupby cases (pre-existing pandas-vs-DataStore
  column-set divergence) are short-circuited explicitly.
- New `datastore/tests/` files (journeys/, cartesian product,
  naming-conflict fuzz, SQL snapshot assertions, plan invariants,
  property-based chains) introduce ~150 additional regressions.
- Updated `.cursor/rules/chdb-ds.mdc` with multi-step-mirror and
  property-based-sweep test guidelines.
- `requirements-dev.txt`: add `hypothesis>=6.0.0`.

Verified:
- 10473 passed, 0 failed (full suite, remote tests excluded).
- 1000-example hypothesis property sweep passes.
- All 47 journey tests pass (including the 3 flipped xfails).
auxten added a commit that referenced this pull request May 21, 2026
Two follow-ups from automated review:

1. `_build_execution_sql` (core.py): the helper backing `to_sql()` /
   `explain()`'s SQL preview used `next(seg for seg in segments if
   seg.is_sql())`, which would pick a LATER `SQL-on-Python(__df__)`
   segment when the first segment was Pandas (cost-aware planner for
   `ORDER BY` without `LIMIT`). That returned a `Python(__df__)` SQL
   whose `FROM` does not match the source the executor actually
   issues first. Constrained to the FIRST segment: if it is SQL,
   return that segment's SQL; otherwise fall back to
   `_generate_select_sql` (the `SELECT * FROM <source>` the executor
   runs to materialize rows for Pandas).

2. `_build_layers` (query_planner.py): `mask(...).groupby(...).agg(...)`
   (or `where(...).groupby(...).agg(...)`) kept the LazyMask/LazyWhere
   in the SAME layer as LazyGroupByAgg. The wrapper-layer CASE-WHEN
   temp-alias machinery (`__tmp_<col>__`) then collided with the
   aggregation's SELECT list - the outer SELECT renamed `__tmp_*__`
   columns that the inner GROUP-BY SELECT never emitted, producing
   `UNKNOWN_IDENTIFIER` (DataFrame source) or wrong values
   (file/table source). Now LazyGroupByAgg starts a new nested layer
   when the current layer already holds a LazyWhere/LazyMask, so the
   mask runs in the inner subquery and GROUP BY runs on the masked
   output:
   `SELECT ... agg() FROM (SELECT CASE WHEN ... AS x) GROUP BY x`.

3. `_render_wrapper_layer_sql` (sql_executor.py): the
   `add_row_order_fallback` `ORDER BY _row_id` clause is intentionally
   skipped when wrapping with `__case_subq__`. The `Python(__df__)`
   virtual `_row_id` is not propagated through the inner CASE-WHEN
   SELECT, so referencing it on the outer rename layer would fail with
   `UNKNOWN_IDENTIFIER`. CASE-WHEN is row-preserving, so source order
   naturally flows up through the wrapper.

Regression tests:
- `tests/test_groupby_sql_pushdown.py::TestDispatchByOpType
  ::test_mask_then_groupby_splits_into_nested_layers` and
  `test_where_then_groupby_splits_into_nested_layers` pin both the
  plan structure (2 layers, mask in inner, groupby in outer) and the
  end-to-end value equality vs pandas.
- `tests/test_explain_method.py::TestBuildExecutionSQLFirstSegment`
  covers both branches: SQL-first chain returns source-bound SQL,
  Pandas-first chain returns the initial `SELECT * FROM <source>`
  (never `Python(__df__)`).

Verified: 10477 passed, 0 failed (full suite, remote tests excluded).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants