Skip to content

Optimize count()/count_rows() to strip ORDER BY and use correct SQL for lazy ops#544

Merged
auxten merged 4 commits into
mainfrom
fix/count-strips-lazy-ops
Mar 18, 2026
Merged

Optimize count()/count_rows() to strip ORDER BY and use correct SQL for lazy ops#544
auxten merged 4 commits into
mainfrom
fix/count-strips-lazy-ops

Conversation

@auxten

@auxten auxten commented Mar 18, 2026

Copy link
Copy Markdown
Member

Summary

  • Strip ORDER BY from count() / count_rows() SQL: ORDER BY is unnecessary for counting and forces remote ClickHouse servers to fully sort data before counting, causing hangs on large tables via remote() table function.
  • Fix PANDAS_FILTER detection: count() and count_rows() now correctly treat LazyRelationalOp with op_type == "PANDAS_FILTER" as non-SQL-pushable, falling back to DataFrame execution instead of generating incorrect SQL.
  • Clean subquery generation: COUNT queries now produce minimal SQL — just SELECT ... WHERE ... without ORDER BY or rowNumberInAllBlocks() overhead.

Context

When calling ds[filter1][filter2].sort_values(...).count() on a DataStore.from_clickhouse(), the count query sent to the remote server included an unnecessary ORDER BY clause. For large remote tables, this forced a full sort before counting, causing the query to hang indefinitely.

Test plan

  • All 328 count-related tests pass
  • Full test suite: 9242 passed (5 pre-existing failures from missing fixture file, unrelated)
  • Verified SQL output for from_clickhouse() path includes WHERE but no ORDER BY

auxten added 4 commits March 18, 2026 17:24
…mize SQL generation

- Added handling for "PANDAS_FILTER" operations to ensure proper execution fallback.
- Refactored COUNT query generation to exclude unnecessary ORDER BY clauses, improving performance.
- Updated subquery construction for efficient column name retrieval using LIMIT 1.
…ter + sort

Covers the scenario: ds[cond1][cond2].sort_values([...]).count()
which previously hung on remote ClickHouse due to unnecessary ORDER BY.
Remote/external database sources (remote, mysql, postgresql, mongodb,
redis, sqlite) have no meaningful "original row order". Adding
rowNumberInAllBlocks() forces a full table scan BEFORE any WHERE
filtering, which is catastrophic for large remote tables.

Mark all remote DB table functions as preserves_row_order=True so the
execution engine skips the expensive row-numbering subquery.
@auxten auxten changed the title fix: optimize count()/count_rows() to strip ORDER BY and use correct SQL for lazy ops Optimize count()/count_rows() to strip ORDER BY and use correct SQL for lazy ops Mar 18, 2026
@auxten auxten merged commit 92e3978 into main Mar 18, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant