Optimize count()/count_rows() to strip ORDER BY and use correct SQL for lazy ops#544
Merged
Conversation
…mize SQL generation - Added handling for "PANDAS_FILTER" operations to ensure proper execution fallback. - Refactored COUNT query generation to exclude unnecessary ORDER BY clauses, improving performance. - Updated subquery construction for efficient column name retrieval using LIMIT 1.
…ter + sort Covers the scenario: ds[cond1][cond2].sort_values([...]).count() which previously hung on remote ClickHouse due to unnecessary ORDER BY.
Remote/external database sources (remote, mysql, postgresql, mongodb, redis, sqlite) have no meaningful "original row order". Adding rowNumberInAllBlocks() forces a full table scan BEFORE any WHERE filtering, which is catastrophic for large remote tables. Mark all remote DB table functions as preserves_row_order=True so the execution engine skips the expensive row-numbering subquery.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
count()/count_rows()SQL: ORDER BY is unnecessary for counting and forces remote ClickHouse servers to fully sort data before counting, causing hangs on large tables viaremote()table function.count()andcount_rows()now correctly treatLazyRelationalOpwithop_type == "PANDAS_FILTER"as non-SQL-pushable, falling back to DataFrame execution instead of generating incorrect SQL.SELECT ... WHERE ...without ORDER BY orrowNumberInAllBlocks()overhead.Context
When calling
ds[filter1][filter2].sort_values(...).count()on aDataStore.from_clickhouse(), the count query sent to the remote server included an unnecessaryORDER BYclause. For large remote tables, this forced a full sort before counting, causing the query to hang indefinitely.Test plan
from_clickhouse()path includes WHERE but no ORDER BY