It’s the first T-SQL Tuesday for 2025 and I feel like it’s a good time to get back to participating, since it’s been awhile for me. My friend Rob Farley (blog | Bluesky) has a great topic for us this month. He invites us to talk about integrity, applying it to whichever topic we see fit to.
I want to talk about it in terms of the data itself. I know that doesn’t narrow things down at all.
The first way to think of data integrity is a very small and literal interpretation. This is making sure that our data in the database is good. In many ways, these are easy to enforce – you add constraints. Primary Keys ensure that you know what makes each row unique. Unique constraints represent what would make each record unique if the primary key constraint, which is often a surrogate key these days, didn’t exist or offer different options. Together these help enforce the database design – are tables properly normalized or in a star schema? Then we can start to enforce the integrity of our relationships between tables using foreign keys. Making sure the data that is supposed to exist with NOT NULLs, although I suppose there could be some debate about calling NULLs a constraint. We can enforce the values of data with CHECK constraints. And let’s not forget the importance of choosing the correct data type to help make sure the right data is stored in the table, because no one should be storing calendar dates in character string field and expect no issues.
To me, these are the simplest and most fundamental ways we can make sure our data is good and has integrity. By integrity, it means we can trust the data. The added benefit is that most of these will also help SQL Server come up with better execution plans and help performance.
(As a side note, I actually have a presentation on this very topic.)
The default constraint may be the one exception. But I think it plays another important role, which leads into the other type of data integrity. The default help defines the columns by saying what the data should be when it’s first created. It indicates a lifecycle for the data, more particularly that column. Sometimes that data may not change, for example the date and time a record is first created. But sometimes it could indicate the starting point for data such as a status field, where it will change through the lifecycle to go from New to Open to in Progress to Resolved. So while a default may not be about the integrity of the record directly, it does speak to the greater integrity of the data by helping define the column and table in extension.
This leads into the bigger concept of data integrity which is making sure you have good definitions for your data and understanding its lifecycle – or how it moves and changes through the application. Our data is constantly being moved from one system to another. If you don’t believe me, look at the job description of data engineer. And this makes it even harder to ignore the fact that source and target destinations of our data isn’t necessarily a database, or even the same RDBMS if the data is moving between different databases. Throw in the modern software architecture design which says everything should be done as micro services and every micro service should have its own database. So services which are related and use the same base data, such as people who could be classified as students, teachers, staff or parents or a combination of these, may end up being stored in separated databases or have their data split among those multiple databases with the rest being stored in yet another database for the general concept of a person. It begs the argument: is the data in this case meant to be literally split into multiple databases (with no cross database joins) or could schemas in the same database solve the problem because we could still create the basic reference to help enforce that integrity and we’re not having to make a lot of unnecessary calls to get the data we need? It would also help solve the issues of how do you keep data consistent across the multiple systems and prevent the proverbial Telephone Game, where the message at the end isn’t the same as the one at the beginning. We still have to look at the data as a whole when we try to figure out how to split or move the data between systems.
One of the ways we can fight this is by clearing defining and documenting our databases. In the micro services structure, teams must stay on top of this by creating the data contracts for the apps and making sure that it stays up to date. This may also mean working across teams through the entire organization to ensure that everyone has the same idea of what defines a staff vs a teacher. For data engineers, it’s making sure there’s clear lineage so we can see where the data goes; we can make sure calculations and transforms are understood by everyone who needs to consume the data.
It all comes down to, while it may appear we work in silos – whether from an application team to the different analytics databases for reports, it’s actually the exact opposite where we almost need to overcommunicate what we do to make sure that we can keep our data intact. Documentation becomes more important to the work we do. And when our data does land in the database, the very least we can do is add our constraints. Our normal forms, star schemas, and better execution plans will thank you for that. #TeamConstraints
Looking forward to seeing how integrity gets interpreted in other ways across the data platform. Thanks again to Rob for a great topic!