feat: Support structured data types and type casting#2582
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a YAML-driven column_expression mechanism to allow checks/metrics to operate on user-provided SQL expressions (e.g., casts / structured field extraction) instead of only raw column names, and begins plumbing this through the contracts verification implementation.
Changes:
- Parse
column_expressionfrom contract YAML at both column- and check-level. - Introduce
column_expressionplumbing in verification impl (ColumnImpl/CheckImpl → MetricImpl) and update missing check to use it. - Update missing/validity “missing” predicate builder to accept SQL AST expressions.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
soda-core/src/soda_core/contracts/impl/contract_yaml.py |
Adds column_expression parsing on column and check YAML objects. |
soda-core/src/soda_core/contracts/impl/contract_verification_impl.py |
Adds column_expression properties/plumbing and adjusts missing-expression builder signature. |
soda-core/src/soda_core/contracts/impl/check_types/missing_check.py |
Uses the resolved column_expression when building the missing-count condition. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -1535,6 +1553,8 @@ def __init__( | |||
| if data_source_impl: | |||
| self.data_source_impl = data_source_impl | |||
|
|
|||
| self.column_expression: Optional[SqlExpressionStr | COLUMN] = column_expression | |||
|
|
|||
There was a problem hiding this comment.
MetricImpl now accepts/stores column_expression, but it is not incorporated into the metric ID (see _get_id_properties). Metrics with different column_expression values can therefore collide and be incorrectly de-duplicated by MetricsResolver, producing wrong SQL/results. Include column_expression (or its string form) in the ID properties when set (and consider whether it should replace/augment the column property).
There was a problem hiding this comment.
Good point from copilot. I guess we'll have ID collisions with current implementation?
There was a problem hiding this comment.
Nice catch indeed 👏 .
Should we add a test for this specific scenario?
There was a problem hiding this comment.
I dismissed this with an incorrect test first, but I revisited it now and I found a different way to break it and indeed, there would be a clash. I updated the MetricImpl to add the column expression to identity. I added a test case as well.
|
Current state:
SonarCloud is complaining about a bunch of stuff because old tests are considered new files. I will go over the output and address things that are actually new if any. |
| else: | ||
| if self.column_impl: |
| @@ -1535,6 +1553,8 @@ def __init__( | |||
| if data_source_impl: | |||
| self.data_source_impl = data_source_impl | |||
|
|
|||
| self.column_expression: Optional[SqlExpressionStr | COLUMN] = column_expression | |||
|
|
|||
There was a problem hiding this comment.
Good point from copilot. I guess we'll have ID collisions with current implementation?
Niels-b
left a comment
There was a problem hiding this comment.
Left some comments for clarification. And the issue mentioned by CoPilot
Otherwise good work 👍 👏 !
Thank you for taking the initial step for splitting up the tests.
Are we sure that this wouldn't cause any issues? It might be good to also run the tests in unit on SqlServer 🤔 . That also runs in a docker container on CI and provides coverage for another Dialect.
| return SqlExpressionStr(self.check_yaml.column_expression) | ||
| else: | ||
| if self.column_impl: | ||
| return self.column_impl.column_expression |
There was a problem hiding this comment.
Is this always guaranteed to be of the correct type SqlExpressionStr | COLUMN, so not a str anymore?
There was a problem hiding this comment.
yes, ColumnImpl.column_expression is guaranteed to return SqlExpressionStr | COLUMN so CheckImpl.column_expression can guarantee that as well.
| @@ -1535,6 +1553,8 @@ def __init__( | |||
| if data_source_impl: | |||
| self.data_source_impl = data_source_impl | |||
|
|
|||
| self.column_expression: Optional[SqlExpressionStr | COLUMN] = column_expression | |||
|
|
|||
There was a problem hiding this comment.
Nice catch indeed 👏 .
Should we add a test for this specific scenario?
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 58 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
thanks for the reviews! I addressed all the comments and some more. @Niels-b regarding
I wouldn't do it for unit tests, they really don't do anything with datasources, most of them even mock them or don't use them at all. Perhaps you meant the new |
|
@m1n0, Indeed I meant the |
|
Approved the PR already, nice work on this. For the testing |

Description
Adds support for structured data and type casting.
The approach:
column_expressionproperty function as an abstraction over column/column expression, and the override logic so that we can reliably use that further down from check impl and not care what value it contains, it will always be correctly rendered into sql including quotes etcChecklist
Initial PR is a WIP to share the approach, what still needs to happen:
column_expression- it makes sense on yaml level as that is what the config property from yaml is called, but the same name also makes sense on impl level as it literally holds sql expressions. Is there a less confusing name to distinguish these?Example:
(this uses
column_expressionon column level, we allow overrides on check level as well)will produce