feat: Support structured data types and type casting by m1n0 · Pull Request #2582 · sodadata/soda-core

m1n0 · 2026-02-12T15:06:57Z

Description

Adds support for structured data and type casting.

The approach:

Capture raw user input in check/column YAML classes
Transform this input to SQL AST representation at check/column implementation classes. This means it will get correctly quoted/surrounded in parentheses based on whether it's just the column name or user-provided full expression.
Use the column_expression property function as an abstraction over column/column expression, and the override logic so that we can reliably use that further down from check impl and not care what value it contains, it will always be correctly rendered into sql including quotes etc

Checklist

I added a test to verify the new functionality.
I verified this PR does not break soda-extensions.

Initial PR is a WIP to share the approach, what still needs to happen:

the properties are called column_expression - it makes sense on yaml level as that is what the config property from yaml is called, but the same name also makes sense on impl level as it literally holds sql expressions. Is there a less confusing name to distinguish these?
expand to all check types
add tests - this feels like a good feature to start adding tests that only run on one datasource for efficiency

Example:

(this uses column_expression on column level, we allow overrides on check level as well)

...
columns:
  - name: employee_key
    column_expression: CAST("employee_key" AS STRING)
    checks:
      - missing:
      - missing:
          qualifier: regex
          missing_format:
            regex: ^\s*$
            name: empty string

will produce

WITH
"_soda_filtered_dataset" AS (
SELECT *
FROM "main"."dim_employee"
WHERE (hire_date < '2030-01-01')
)
SELECT COUNT(*),
       SUM(CASE WHEN (CAST("employee_key" AS STRING)) IS NULL THEN 1 END),
       SUM(CASE WHEN ((CAST("employee_key" AS STRING)) IS NULL OR REGEXP_MATCHES((CAST("employee_key" AS STRING)), '^\s*$')) THEN 1 END)
FROM "_soda_filtered_dataset";

Copilot

Pull request overview

Adds a YAML-driven column_expression mechanism to allow checks/metrics to operate on user-provided SQL expressions (e.g., casts / structured field extraction) instead of only raw column names, and begins plumbing this through the contracts verification implementation.

Changes:

Parse column_expression from contract YAML at both column- and check-level.
Introduce column_expression plumbing in verification impl (ColumnImpl/CheckImpl → MetricImpl) and update missing check to use it.
Update missing/validity “missing” predicate builder to accept SQL AST expressions.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
`soda-core/src/soda_core/contracts/impl/contract_yaml.py`	Adds `column_expression` parsing on column and check YAML objects.
`soda-core/src/soda_core/contracts/impl/contract_verification_impl.py`	Adds `column_expression` properties/plumbing and adjusts missing-expression builder signature.
`soda-core/src/soda_core/contracts/impl/check_types/missing_check.py`	Uses the resolved `column_expression` when building the missing-count condition.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-12T15:17:46Z

soda-core/src/soda_core/contracts/impl/contract_verification_impl.py

@@ -1535,6 +1553,8 @@ def __init__(
        if data_source_impl:
            self.data_source_impl = data_source_impl

+        self.column_expression: Optional[SqlExpressionStr | COLUMN] = column_expression
+


MetricImpl now accepts/stores column_expression, but it is not incorporated into the metric ID (see _get_id_properties). Metrics with different column_expression values can therefore collide and be incorrectly de-duplicated by MetricsResolver, producing wrong SQL/results. Include column_expression (or its string form) in the ID properties when set (and consider whether it should replace/augment the column property).

Good point from copilot. I guess we'll have ID collisions with current implementation?

Nice catch indeed 👏 .

Should we add a test for this specific scenario?

I dismissed this with an incorrect test first, but I revisited it now and I found a different way to break it and indeed, there would be a clash. I updated the MetricImpl to add the column expression to identity. I added a test case as well.

soda-core/src/soda_core/contracts/impl/contract_verification_impl.py

soda-core/src/soda_core/contracts/impl/contract_yaml.py

soda-core/src/soda_core/contracts/impl/check_types/missing_check.py

m1n0 · 2026-02-16T14:52:18Z

Current state:

missing check and aggregate check are supported
added tests for both, only running on postgres as we only need to test that expressions get applied, we are not trying to test sql
split of tests codebase into unit/feature/integration, with the idea:
unit tests are what was called components before, no need to use our own term, let's go with the standard naming. Run on postgres only for efficiency (this was the case before already)
feature - we previously had features which were really feature+integration tests, now these are split of to do deep feature testing, not datasource integration. Run on postgres only for efficiency. This is a new category
integration - previously called features - true integration tests with datasources, run for the whole matrix of python versions and datasources. We still have tests in datasource-specific packages that verify e.g. connectivity or very niche DS behaviour. Integration tests should normally be shared across DSs, test features in a more basic way, with smaller footprint so that we can run the CI faster.

SonarCloud is complaining about a bunch of stuff because old tests are considered new files. I will go over the output and address things that are actually new if any.

mivds

Overall looks good, only larger concern is the identity collision raised by copilot. Holding off on approving until that one is resolved. The other stuff is minor.

Also, thank you for taking the first steps in more specific test cases! 🙌

soda-core/src/soda_core/contracts/impl/contract_verification_impl.py

mivds · 2026-02-16T15:04:45Z

soda-core/src/soda_core/contracts/impl/contract_verification_impl.py

+        else:
+            if self.column_impl:


Nit: elif

soda-core/src/soda_core/contracts/impl/contract_verification_impl.py

mivds · 2026-02-16T15:12:43Z

soda-core/src/soda_core/contracts/impl/contract_verification_impl.py

@@ -1535,6 +1553,8 @@ def __init__(
        if data_source_impl:
            self.data_source_impl = data_source_impl

+        self.column_expression: Optional[SqlExpressionStr | COLUMN] = column_expression
+


Good point from copilot. I guess we'll have ID collisions with current implementation?

soda-tests/tests/feature/test_column_expression.py

Niels-b

Left some comments for clarification. And the issue mentioned by CoPilot
Otherwise good work 👍 👏 !

Thank you for taking the initial step for splitting up the tests.
Are we sure that this wouldn't cause any issues? It might be good to also run the tests in unit on SqlServer 🤔 . That also runs in a docker container on CI and provides coverage for another Dialect.

soda-core/src/soda_core/contracts/impl/contract_verification_impl.py

Niels-b · 2026-02-16T16:05:32Z

soda-core/src/soda_core/contracts/impl/contract_verification_impl.py

+            return SqlExpressionStr(self.check_yaml.column_expression)
+        else:
+            if self.column_impl:
+                return self.column_impl.column_expression


Is this always guaranteed to be of the correct type SqlExpressionStr | COLUMN, so not a str anymore?

yes, ColumnImpl.column_expression is guaranteed to return SqlExpressionStr | COLUMN so CheckImpl.column_expression can guarantee that as well.

Niels-b · 2026-02-16T16:08:18Z

soda-core/src/soda_core/contracts/impl/contract_verification_impl.py

@@ -1535,6 +1553,8 @@ def __init__(
        if data_source_impl:
            self.data_source_impl = data_source_impl

+        self.column_expression: Optional[SqlExpressionStr | COLUMN] = column_expression
+


Nice catch indeed 👏 .

Should we add a test for this specific scenario?

sonarqubecloud · 2026-02-16T16:42:38Z

Quality Gate passed

Issues
74 New issues
5 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
3.3% Duplication on New Code

See analysis details on SonarQube Cloud

Copilot

Pull request overview

Copilot reviewed 7 out of 58 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

soda-core/src/soda_core/contracts/impl/check_types/aggregate_check.py

soda-core/src/soda_core/contracts/impl/contract_verification_impl.py

m1n0 · 2026-02-16T16:43:49Z

thanks for the reviews! I addressed all the comments and some more.

@Niels-b regarding

Are we sure that this wouldn't cause any issues? It might be good to also run the tests in unit on SqlServer 🤔 . That also runs in a docker container on CI and provides coverage for another Dialect.

I wouldn't do it for unit tests, they really don't do anything with datasources, most of them even mock them or don't use them at all. Perhaps you meant the new feature category?

Niels-b · 2026-02-16T17:43:06Z

@m1n0, Indeed I meant the feature category. Sorry for the confusion 😅 .
Unit tests are fine as they are indeed 👍

Niels-b · 2026-02-16T17:48:10Z

Approved the PR already, nice work on this.

For the testing features with SqlServer; we can always check/change this later. It doesn't need to be blocking to get this merged, but we also shouldn't lose track of this discussion

feat: Support structured data types and type casting

8ab6406

m1n0 requested a review from Copilot February 12, 2026 15:13

Copilot started reviewing on behalf of m1n0 February 12, 2026 15:14 View session

Copilot AI reviewed Feb 12, 2026

View reviewed changes

m1n0 added 2 commits February 16, 2026 15:37

restructure tests, add column_expression test, aggregation check support

f3d5609

test check overrides as well

b853c30

m1n0 marked this pull request as ready for review February 16, 2026 14:41

mivds reviewed Feb 16, 2026

View reviewed changes

Niels-b reviewed Feb 16, 2026

View reviewed changes

review feedback

f55d290

m1n0 requested review from Niels-b, Copilot and mivds February 16, 2026 16:39

Copilot started reviewing on behalf of m1n0 February 16, 2026 16:40 View session

fix tests

455b841

Copilot AI reviewed Feb 16, 2026

View reviewed changes

soda-core/src/soda_core/contracts/impl/check_types/aggregate_check.py Show resolved Hide resolved

soda-core/src/soda_core/contracts/impl/contract_verification_impl.py Show resolved Hide resolved

Niels-b approved these changes Feb 16, 2026

View reviewed changes

mivds approved these changes Feb 16, 2026

View reviewed changes

m1n0 merged commit 95896a5 into main Feb 17, 2026
41 checks passed

m1n0 deleted the dtl-1657-support-structured-data-and-type-casting branch February 17, 2026 08:15

Conversation

m1n0 commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Example:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

mivds Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Niels-b Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

m1n0 Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

m1n0 commented Feb 16, 2026

Uh oh!

mivds left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mivds Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mivds Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Niels-b left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Niels-b Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

m1n0 Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Niels-b Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Quality Gate passed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

m1n0 commented Feb 16, 2026

Uh oh!

Niels-b commented Feb 16, 2026

Uh oh!

Niels-b commented Feb 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

m1n0 commented Feb 12, 2026 •

edited

Loading

m1n0 Feb 16, 2026 •

edited

Loading

mivds left a comment •

edited

Loading

sonarqubecloud bot commented Feb 16, 2026 •

edited

Loading