Optimize count()/count_rows() to strip ORDER BY and use correct SQL for lazy ops by auxten · Pull Request #544 · chdb-io/chdb

auxten · 2026-03-18T09:25:18Z

Summary

Strip ORDER BY from count() / count_rows() SQL: ORDER BY is unnecessary for counting and forces remote ClickHouse servers to fully sort data before counting, causing hangs on large tables via remote() table function.
Fix PANDAS_FILTER detection: count() and count_rows() now correctly treat LazyRelationalOp with op_type == "PANDAS_FILTER" as non-SQL-pushable, falling back to DataFrame execution instead of generating incorrect SQL.
Clean subquery generation: COUNT queries now produce minimal SQL — just SELECT ... WHERE ... without ORDER BY or rowNumberInAllBlocks() overhead.

Context

When calling ds[filter1][filter2].sort_values(...).count() on a DataStore.from_clickhouse(), the count query sent to the remote server included an unnecessary ORDER BY clause. For large remote tables, this forced a full sort before counting, causing the query to hang indefinitely.

Test plan

All 328 count-related tests pass
Full test suite: 9242 passed (5 pre-existing failures from missing fixture file, unrelated)
Verified SQL output for from_clickhouse() path includes WHERE but no ORDER BY

…mize SQL generation - Added handling for "PANDAS_FILTER" operations to ensure proper execution fallback. - Refactored COUNT query generation to exclude unnecessary ORDER BY clauses, improving performance. - Updated subquery construction for efficient column name retrieval using LIMIT 1.

…ter + sort Covers the scenario: ds[cond1][cond2].sort_values([...]).count() which previously hung on remote ClickHouse due to unnecessary ORDER BY.

Remote/external database sources (remote, mysql, postgresql, mongodb, redis, sqlite) have no meaningful "original row order". Adding rowNumberInAllBlocks() forces a full table scan BEFORE any WHERE filtering, which is catastrophic for large remote tables. Mark all remote DB table functions as preserves_row_order=True so the execution engine skips the expensive row-numbering subquery.

auxten added 4 commits March 18, 2026 17:24

test: add regression tests for count()/count_rows() after chained fil…

1d9fd0a

…ter + sort Covers the scenario: ds[cond1][cond2].sort_values([...]).count() which previously hung on remote ClickHouse due to unnecessary ORDER BY.

Update .gitignore to include .cursor/hooks/ directory for local rules

da0792d

auxten changed the title ~~fix: optimize count()/count_rows() to strip ORDER BY and use correct SQL for lazy ops~~ Optimize count()/count_rows() to strip ORDER BY and use correct SQL for lazy ops Mar 18, 2026

auxten merged commit 92e3978 into main Mar 18, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize count()/count_rows() to strip ORDER BY and use correct SQL for lazy ops#544

Optimize count()/count_rows() to strip ORDER BY and use correct SQL for lazy ops#544
auxten merged 4 commits into
mainfrom
fix/count-strips-lazy-ops

auxten commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

auxten commented Mar 18, 2026

Summary

Context

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant