One of SQL Server 2022’s new features is something called Data Virtualization. It enables T-SQL to directly query files that reside in Azure object storage or S3-compatible object storage. In my opinion, since SQL Server 2022’s release, it’s one of those underrated capabilities that I think many have glossed over. But I strongly believe that it is insanely useful and you should take a few minutes to learn more!
Data Bloat Anyone?
Ask yourself, do you have a database that has a large volume of that that will never change again? Sales order history data is a generic but very relatable example. Once an order is placed, it might be amended but after a few months, it’s highly probable that it’ll never change again. We’ll pretend for the sake of our discussion that after 3 months, that data is considered set in stone.
Can We PLEASE Archive/Delete That Old Data?
Who has begged their business colleagues if they could de-bloat a database by archiving off and deleting old, ancient data from a primary production database? I see your raised hands and raise mine too. There’s many practical reasons to de-bloat a database, from performance reasons to maintenance reasons. But then there’s also the business folks that say “no we can’t because we MIGHT need to report on it.” Might… So like my stubborn beer belly fat, that data bloat remains.
Meet Parquet (… not butter)
We should all know what a Comma Separated Values (CSV) flat file. Because of its flat, singular nature, it’s not very efficient to query because the entirety of the file must be scanned, consumed, processed, etc. But what if we had a different file format whose entirety did not need to be fully processed when querying? Wouldn’t it be awesome to have a simpler flat file format, that could offer column elimination, row elimination, and compression for our data and queries? Let me introduce you to Parquet.
Pretend your sales order history table has 30 different columns of data. And let’s say you were to export all of your sales order history into individual CSV’s per calendar year. If you wanted to query say, all sales in May, June, & July of 2021 and get a SUM of total sales, you’d have to consume the entirety of the 2021 CSV file. But with a Parquet file, metadata is utilized to enable you to zero in on rows for May, June, and July only. And instead of all 30 columns, you can just retrieve sales date and sales amount ONLY, and not consume the other 28 columns! That amounts to a tremendous savings from a workload perspective!
Data Virtualization
So how does SQL Server 2022 come into play here? First, we create something called an External Data Source which in this case will point to S3 object storage somewhere. Then we will combine it with creating an External File Format for our Parquet file. The third piece of the solution is Create External Table as SELECT (aka CETAS).
The end result is that each Parquet file will now become available to us within T-SQL as an External Table entity. So I can write T-SQL to query a Parquet file and my code will look identical.
CREATE EXTERNAL TABLE parquet.SalesOrderHistory_2021
WITH (
LOCATION = '/SalesOrderHistory_2021.parquet',
DATA_SOURCE = cetas_demo,
FILE_FORMAT = parquet_file_format_object
)
AS
SELECT
[SalesID],
[SalesDate],
[ProductID],
[Email],
[PhoneNumber],
[OtherColumns]
FROM parquet.SalesOrderHistory_2021;
GO
SELECT SalesID, SalesDate
FROM parquet.SalesOrderHistory_2021
WHERE SalesDate >= '2021-05-01' AND SalesDate < '2021-08-01'
Now you may have noticed that we have a singular table for 2021 data, or 1 table per Parquet file. You might be thinking, will that be a problem if I have many years worth of data, thus many Parquet files? This is where another old T-SQL trick comes into play – partitioned views. I can have multiple Parquet files virtualized via multiple External Tables, but overlay them all with a single partitioned view, to then enable me to query everything as I once did.
CREATE OR ALTER VIEW dbo.SalesOrderHistory
AS
SELECT
[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2020
UNION ALL
SELECT
[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2021
UNION ALL
SELECT
[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2022
UNION ALL
SELECT
[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2023
UNION ALL
SELECT
[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2024
UNION ALL
SELECT
[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM dbo.SalesOrderHistory_Current
GO
SELECT SalesID, SalesDate
FROM dbo.SalesOrderHistory
WHERE SalesDate >= '2021-05-01' AND SalesDate < '2021-08-01'
Be sure to look closely at the final SELECT in the partitioned view. That’s a reference to a traditional SQL Server table! What this means is that instead of having a dbo.SalesOrderHistory table that contains everything including new orders, I’ve physically partitioned the data out into 5 Parquet files, virtualized via External Tables, AND still included all current sales order data as well. So existing T-SQL queries against dbo.SalesOrderHistory (before a Parquet split) has no idea and does NOT need to be changed!
You Mentioned Azure & S3… That Means I Need to Cloud, Right?
Whenever people (myself included, for the longest time) hear “S3” and/or “object storage,” they typically mentally default to AWS S3 Object Storage. But let’s break that apart. In brief, “object storage” is essentially a storage architecture geared towards unstructured data like images, videos, and documents. “S3” is Amazon’s “Simple Storage Service” which I’d loosely describe as Amazon’s implementation approach for object storage. The good news here, is that you’re not just stuck with Amazon. S3-compatible object storage is also a thing. This is where other vendors have their own object storage implementations but leverage the Amazon S3 API.
But my data is on-prem and staying that way. This is where another common misconception comes in. Because “Amazon S3 object storage” has become such a common term, many do not realize that S3-compatible on-prem storage is a thing! And yes, this is where Pure Storage comes into the picture with our FlashBlade series. You can have all of this awesomeness on-prem without having to consider compromising with the cloud.
I Want To See This in Action!
Hopefully you’re chomping at the bits to give this a go. I’d like to leave you with three resources to help you get jump started:
- Watch… A New Solution for Data Archiving using Data Virtualization
This is my Pure Storage vendor session from PASS Summit 2023, that covers everything end to end with demos! - Learn… Microsoft’s Introduction to SQL Server 2022 Data Virtualization learning path
Go learn straight from Microsoft! - Try… Pure Storage’s Modern Storage Platforms for SQL Server workshop
Regardless of whether you’re a Pure Storage customer or not, you can review Module 4 of this workshop.
And if you are a Pure Storage customer WITH a FlashBlade, use this workshop to try it out for yourself!
And if you are a Pure Storage customer but do not have a FlashBlade, you can contact your Account Team and request a Test Drive voucher for this workshop. This will give you access to a virtual lab with hardware, to work through the lab workshop.
Thanks for reading!






