Skip to content

feat: Support structured data types and type casting#2582

Merged
m1n0 merged 5 commits intomainfrom
dtl-1657-support-structured-data-and-type-casting
Feb 17, 2026
Merged

feat: Support structured data types and type casting#2582
m1n0 merged 5 commits intomainfrom
dtl-1657-support-structured-data-and-type-casting

Conversation

@m1n0
Copy link
Contributor

@m1n0 m1n0 commented Feb 12, 2026

Description

Adds support for structured data and type casting.

The approach:

  • Capture raw user input in check/column YAML classes
  • Transform this input to SQL AST representation at check/column implementation classes. This means it will get correctly quoted/surrounded in parentheses based on whether it's just the column name or user-provided full expression.
  • Use the column_expression property function as an abstraction over column/column expression, and the override logic so that we can reliably use that further down from check impl and not care what value it contains, it will always be correctly rendered into sql including quotes etc

Checklist

  • I added a test to verify the new functionality.
  • I verified this PR does not break soda-extensions.

Initial PR is a WIP to share the approach, what still needs to happen:

  • the properties are called column_expression - it makes sense on yaml level as that is what the config property from yaml is called, but the same name also makes sense on impl level as it literally holds sql expressions. Is there a less confusing name to distinguish these?
  • expand to all check types
  • add tests - this feels like a good feature to start adding tests that only run on one datasource for efficiency

Example:

(this uses column_expression on column level, we allow overrides on check level as well)

...
columns:
  - name: employee_key
    column_expression: CAST("employee_key" AS STRING)
    checks:
      - missing:
      - missing:
          qualifier: regex
          missing_format:
            regex: ^\s*$
            name: empty string

will produce

WITH
"_soda_filtered_dataset" AS (
SELECT *
FROM "main"."dim_employee"
WHERE (hire_date < '2030-01-01')
)
SELECT COUNT(*),
       SUM(CASE WHEN (CAST("employee_key" AS STRING)) IS NULL THEN 1 END),
       SUM(CASE WHEN ((CAST("employee_key" AS STRING)) IS NULL OR REGEXP_MATCHES((CAST("employee_key" AS STRING)), '^\s*$')) THEN 1 END)
FROM "_soda_filtered_dataset";

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a YAML-driven column_expression mechanism to allow checks/metrics to operate on user-provided SQL expressions (e.g., casts / structured field extraction) instead of only raw column names, and begins plumbing this through the contracts verification implementation.

Changes:

  • Parse column_expression from contract YAML at both column- and check-level.
  • Introduce column_expression plumbing in verification impl (ColumnImpl/CheckImpl → MetricImpl) and update missing check to use it.
  • Update missing/validity “missing” predicate builder to accept SQL AST expressions.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
soda-core/src/soda_core/contracts/impl/contract_yaml.py Adds column_expression parsing on column and check YAML objects.
soda-core/src/soda_core/contracts/impl/contract_verification_impl.py Adds column_expression properties/plumbing and adjusts missing-expression builder signature.
soda-core/src/soda_core/contracts/impl/check_types/missing_check.py Uses the resolved column_expression when building the missing-count condition.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1540 to +1557
@@ -1535,6 +1553,8 @@ def __init__(
if data_source_impl:
self.data_source_impl = data_source_impl

self.column_expression: Optional[SqlExpressionStr | COLUMN] = column_expression

Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetricImpl now accepts/stores column_expression, but it is not incorporated into the metric ID (see _get_id_properties). Metrics with different column_expression values can therefore collide and be incorrectly de-duplicated by MetricsResolver, producing wrong SQL/results. Include column_expression (or its string form) in the ID properties when set (and consider whether it should replace/augment the column property).

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point from copilot. I guess we'll have ID collisions with current implementation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch indeed 👏 .

Should we add a test for this specific scenario?

Copy link
Contributor Author

@m1n0 m1n0 Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dismissed this with an incorrect test first, but I revisited it now and I found a different way to break it and indeed, there would be a clash. I updated the MetricImpl to add the column expression to identity. I added a test case as well.

@m1n0 m1n0 marked this pull request as ready for review February 16, 2026 14:41
@m1n0
Copy link
Contributor Author

m1n0 commented Feb 16, 2026

Current state:

  • missing check and aggregate check are supported
  • added tests for both, only running on postgres as we only need to test that expressions get applied, we are not trying to test sql
  • split of tests codebase into unit/feature/integration, with the idea:
  • unit tests are what was called components before, no need to use our own term, let's go with the standard naming. Run on postgres only for efficiency (this was the case before already)
  • feature - we previously had features which were really feature+integration tests, now these are split of to do deep feature testing, not datasource integration. Run on postgres only for efficiency. This is a new category
  • integration - previously called features - true integration tests with datasources, run for the whole matrix of python versions and datasources. We still have tests in datasource-specific packages that verify e.g. connectivity or very niche DS behaviour. Integration tests should normally be shared across DSs, test features in a more basic way, with smaller footprint so that we can run the CI faster.

SonarCloud is complaining about a bunch of stuff because old tests are considered new files. I will go over the output and address things that are actually new if any.

Copy link
Contributor

@mivds mivds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, only larger concern is the identity collision raised by copilot. Holding off on approving until that one is resolved. The other stuff is minor.

Also, thank you for taking the first steps in more specific test cases! 🙌

Comment on lines +1380 to +1381
else:
if self.column_impl:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: elif

Comment on lines 1540 to +1557
@@ -1535,6 +1553,8 @@ def __init__(
if data_source_impl:
self.data_source_impl = data_source_impl

self.column_expression: Optional[SqlExpressionStr | COLUMN] = column_expression

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point from copilot. I guess we'll have ID collisions with current implementation?

Copy link
Contributor

@Niels-b Niels-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments for clarification. And the issue mentioned by CoPilot
Otherwise good work 👍 👏 !

Thank you for taking the initial step for splitting up the tests.
Are we sure that this wouldn't cause any issues? It might be good to also run the tests in unit on SqlServer 🤔 . That also runs in a docker container on CI and provides coverage for another Dialect.

return SqlExpressionStr(self.check_yaml.column_expression)
else:
if self.column_impl:
return self.column_impl.column_expression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always guaranteed to be of the correct type SqlExpressionStr | COLUMN, so not a str anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, ColumnImpl.column_expression is guaranteed to return SqlExpressionStr | COLUMN so CheckImpl.column_expression can guarantee that as well.

Comment on lines 1540 to +1557
@@ -1535,6 +1553,8 @@ def __init__(
if data_source_impl:
self.data_source_impl = data_source_impl

self.column_expression: Optional[SqlExpressionStr | COLUMN] = column_expression

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch indeed 👏 .

Should we add a test for this specific scenario?

@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 16, 2026

Quality Gate Passed Quality Gate passed

Issues
74 New issues
5 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
3.3% Duplication on New Code

See analysis details on SonarQube Cloud

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 58 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@m1n0
Copy link
Contributor Author

m1n0 commented Feb 16, 2026

thanks for the reviews! I addressed all the comments and some more.

@Niels-b regarding

Are we sure that this wouldn't cause any issues? It might be good to also run the tests in unit on SqlServer 🤔 . That also runs in a docker container on CI and provides coverage for another Dialect.

I wouldn't do it for unit tests, they really don't do anything with datasources, most of them even mock them or don't use them at all. Perhaps you meant the new feature category?

@Niels-b
Copy link
Contributor

Niels-b commented Feb 16, 2026

@m1n0, Indeed I meant the feature category. Sorry for the confusion 😅 .
Unit tests are fine as they are indeed 👍

@Niels-b
Copy link
Contributor

Niels-b commented Feb 16, 2026

Approved the PR already, nice work on this.

For the testing features with SqlServer; we can always check/change this later. It doesn't need to be blocking to get this merged, but we also shouldn't lose track of this discussion

@m1n0 m1n0 merged commit 95896a5 into main Feb 17, 2026
41 checks passed
@m1n0 m1n0 deleted the dtl-1657-support-structured-data-and-type-casting branch February 17, 2026 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants