-
Notifications
You must be signed in to change notification settings - Fork 190
[RFC] Support Append command in PPL #4078
Description
Problem Statement
Implement append command to simulate SQL standard union all operation. The append command syntax is quite similar to the appendcols command, which is already implemented during Calcite V3 engine development. Although their syntax are similar, their final combined result formats are quite different. The key difference is that the append command functions as a special union operator in SQL, while the appendcols operates as a special join operator in SQL.
Syntax
append <subsearch>
Core Functionality
OpenSearch doesn’t have similar functionality as append command. So we need to prioritize the new command development.
The append command appends the results of subsearch to the current results vertically. Say, the current search outputs 1000 rows with fields schema [a, b] and the subsearch outputs another 1000 rows with different fields schema [a, c, d], the final output will contain 2000 rows with the combined fields of two searches, aka [a, b, c, d].
The following table illustrates the concept. The green colored results are the main query. The yellow colored results are the subquery output. When combining results, the command fill any missing fields with NULL values.
Implementation
OpenSearch PPL has already supported subsearch syntax starting with search command, follows a serious of other commands. To support the core functionality, we just need to implement union operation over two result rows for the main query and the subquery. Standard SQL union operator (like that in Calcite) only supports union two datasets with the same field schema. But we can achieve the functionality by a simple workaround approach. The problem can be translated to a similar SQL query like:
select a, b, NULL as c, NULL as d from main_query
union all
select a, NULL as b, c, d from sub_query
Proposed Solution
- Analyze current query in Calcite RelNode visitor. Get its rowType (output fields schema).
- Analyze subquery in Calcite RelNode visitor. Get its rowType.
- Merge two schemas and add projects over two queries separately. The missing fields in each project will be filled with constant NULL values
- Perform union all operation over two projected RelNodes.
- Support different indices query in subsearch
Long-Term Goals
- Support subsearch timeout to avoid time consuming query
- Allow users to set maximum output rows
- Allow users to inherit timerange columns from upstream.
Pushdown Support
Don’t support it. It’s an operation that is suitable for execution on coordinator node. We should let the pushdown work in main query and subquery internally.
Limitation
Standard SQL union operator requires the exactly the same field schema(including the field types). If both the main query and the subquery has a same name field but with different types, the standard union could error out. A workaround could be using different field alias for conflicting field.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status