Combat Database Bloat with Data Virtualization

One of SQL Server 2022’s new features is something called Data Virtualization. It enables T-SQL to directly query files that reside in Azure object storage or S3-compatible object storage. In my opinion, since SQL Server 2022’s release, it’s one of those underrated capabilities that I think many have glossed over. But I strongly believe that it is insanely useful and you should take a few minutes to learn more!

Data Bloat Anyone?

Ask yourself, do you have a database that has a large volume of that that will never change again? Sales order history data is a generic but very relatable example. Once an order is placed, it might be amended but after a few months, it’s highly probable that it’ll never change again. We’ll pretend for the sake of our discussion that after 3 months, that data is considered set in stone.

Can We PLEASE Archive/Delete That Old Data?

Who has begged their business colleagues if they could de-bloat a database by archiving off and deleting old, ancient data from a primary production database? I see your raised hands and raise mine too. There’s many practical reasons to de-bloat a database, from performance reasons to maintenance reasons. But then there’s also the business folks that say “no we can’t because we MIGHT need to report on it.” Might… So like my stubborn beer belly fat, that data bloat remains.

Meet Parquet (… not butter)

We should all know what a Comma Separated Values (CSV) flat file. Because of its flat, singular nature, it’s not very efficient to query because the entirety of the file must be scanned, consumed, processed, etc. But what if we had a different file format whose entirety did not need to be fully processed when querying? Wouldn’t it be awesome to have a simpler flat file format, that could offer column elimination, row elimination, and compression for our data and queries? Let me introduce you to Parquet.

Pretend your sales order history table has 30 different columns of data. And let’s say you were to export all of your sales order history into individual CSV’s per calendar year. If you wanted to query say, all sales in May, June, & July of 2021 and get a SUM of total sales, you’d have to consume the entirety of the 2021 CSV file. But with a Parquet file, metadata is utilized to enable you to zero in on rows for May, June, and July only. And instead of all 30 columns, you can just retrieve sales date and sales amount ONLY, and not consume the other 28 columns! That amounts to a tremendous savings from a workload perspective!

Data Virtualization

So how does SQL Server 2022 come into play here? First, we create something called an External Data Source which in this case will point to S3 object storage somewhere. Then we will combine it with creating an External File Format for our Parquet file. The third piece of the solution is Create External Table as SELECT (aka CETAS).

The end result is that each Parquet file will now become available to us within T-SQL as an External Table entity. So I can write T-SQL to query a Parquet file and my code will look identical.

CREATE EXTERNAL TABLE parquet.SalesOrderHistory_2021
WITH (
	LOCATION = '/SalesOrderHistory_2021.parquet', 
	DATA_SOURCE = cetas_demo, 
	FILE_FORMAT = parquet_file_format_object
)
AS 
SELECT 
	[SalesID], 
	[SalesDate],
	[ProductID],
	[Email],
	[PhoneNumber],
	[OtherColumns]
FROM parquet.SalesOrderHistory_2021;
GO

SELECT SalesID, SalesDate
FROM parquet.SalesOrderHistory_2021
WHERE SalesDate >= '2021-05-01' AND SalesDate < '2021-08-01'

Now you may have noticed that we have a singular table for 2021 data, or 1 table per Parquet file. You might be thinking, will that be a problem if I have many years worth of data, thus many Parquet files? This is where another old T-SQL trick comes into play – partitioned views. I can have multiple Parquet files virtualized via multiple External Tables, but overlay them all with a single partitioned view, to then enable me to query everything as I once did.

CREATE OR ALTER VIEW dbo.SalesOrderHistory
AS
SELECT 
	[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2020
UNION ALL
SELECT 
	[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2021
UNION ALL
SELECT 
	[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2022
UNION ALL
SELECT 
	[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2023
UNION ALL
SELECT 
	[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2024
UNION ALL
SELECT 
	[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM dbo.SalesOrderHistory_Current
GO

SELECT SalesID, SalesDate
FROM dbo.SalesOrderHistory
WHERE SalesDate >= '2021-05-01' AND SalesDate < '2021-08-01'

Be sure to look closely at the final SELECT in the partitioned view. That’s a reference to a traditional SQL Server table! What this means is that instead of having a dbo.SalesOrderHistory table that contains everything including new orders, I’ve physically partitioned the data out into 5 Parquet files, virtualized via External Tables, AND still included all current sales order data as well. So existing T-SQL queries against dbo.SalesOrderHistory (before a Parquet split) has no idea and does NOT need to be changed!

You Mentioned Azure & S3… That Means I Need to Cloud, Right?

Whenever people (myself included, for the longest time) hear “S3” and/or “object storage,” they typically mentally default to AWS S3 Object Storage. But let’s break that apart. In brief, “object storage” is essentially a storage architecture geared towards unstructured data like images, videos, and documents. “S3” is Amazon’s “Simple Storage Service” which I’d loosely describe as Amazon’s implementation approach for object storage. The good news here, is that you’re not just stuck with Amazon. S3-compatible object storage is also a thing. This is where other vendors have their own object storage implementations but leverage the Amazon S3 API.

But my data is on-prem and staying that way. This is where another common misconception comes in. Because “Amazon S3 object storage” has become such a common term, many do not realize that S3-compatible on-prem storage is a thing! And yes, this is where Pure Storage comes into the picture with our FlashBlade series. You can have all of this awesomeness on-prem without having to consider compromising with the cloud.

I Want To See This in Action!

Hopefully you’re chomping at the bits to give this a go. I’d like to leave you with three resources to help you get jump started:

  1. Watch… A New Solution for Data Archiving using Data Virtualization
    This is my Pure Storage vendor session from PASS Summit 2023, that covers everything end to end with demos!
  2. Learn… Microsoft’s Introduction to SQL Server 2022 Data Virtualization learning path
    Go learn straight from Microsoft!
  3. Try… Pure Storage’s Modern Storage Platforms for SQL Server workshop
    Regardless of whether you’re a Pure Storage customer or not, you can review Module 4 of this workshop.
    And if you are a Pure Storage customer WITH a FlashBlade, use this workshop to try it out for yourself!
    And if you are a Pure Storage customer but do not have a FlashBlade, you can contact your Account Team and request a Test Drive voucher for this workshop. This will give you access to a virtual lab with hardware, to work through the lab workshop.

Thanks for reading!

T-SQL Tuesday Logo

T-SQL Tuesday #159: SQL Server 2022 & Purvi’s List

Welcome to another edition of T-SQL Tuesday. This month’s blog party is hosted by Deepthi Goguri (b|t) who asks participants to blog about their new favorite feature in SQL Server 2022.

In my opinion, 2022 is absolutely a major release with significant enhancements which should make it compelling to upgrade rather than wait for another release down the line. I’m thrilled for the improvements in Intelligent Query Processing, TempDB, and (after the training helped me ‘get it’) Arc-enabled SQL Servers. But that’s not what I want to blog about today.

It’s Often the Little Things

By sheer coincidence, I had the privilege of being invited to a private SQL Server 2022 workshop taught by Bob Ward last week. And through my job, I also had the privilege of doing some testing work around QAT backups and S3 Data Virtualization during the private preview phase last summer. So while I had exposure and access to SQL Server 2022 for much longer than others, there were many things that Microsoft loaded into the 2022 release that I barely skimmed over or knew were even there.

Towards the end of the workshop, Bob presented a slide called Purvi’s List. Purvi Shah is an engineer on the SQL performance team and as Bob said, “spends her time finding ways to make SQL Server and Azure SQL faster.” When Bob put up Purvi’s List, I let out an audible “holy shit,” much to Grant’s amusement.

So what caught me by surprise?

Instant File Initialization (IFI) for Transaction Logs

Okay, that’s cool!

For simplicity’s sake, I’ll just quote the documentation (as written today):

Transaction log files cannot be initialized instantaneously, however, starting with SQL Server 2022 (16.x), instant file initialization can benefit transaction log autogrowth events up to 64 MB. The default auto growth size increment for new databases is 64 MB. Transaction log file autogrowth events larger than 64 MB cannot benefit from instant file initialization.

https://learn.microsoft.com/en-us/sql/relational-databases/databases/database-instant-file-initialization?view=sql-server-ver16

So yeah, it is limited to 64MB growth size. But another entry on Purvi’s List is that the VLF algorithm has also been improved.

If growth is less than 64 MB, create 4 VLFs that cover the growth size (for example, for 1 MB growth, create 4 VLFs of size 256 KB). … and starting with SQL Server 2022 (16.x) (all editions), this is slightly different. If the growth is less than or equal to 64 MB, the Database Engine creates only 1 VLF to cover the growth size.

https://learn.microsoft.com/en-us/sql/relational-databases/sql-server-transaction-log-architecture-and-management-guide?view=sql-server-ver16

So do I think everyone should now change their Transaction Log autogrow sizes to 64MB? Of course not. But do I think that this kind of small but interesting improvement is still notable and will hopefully be expanded on in a future release to a larger scale? Absolutely!

And there are a few other things on Purvi’s List too: reduced buffer pool I/O promotions and enhancements to the spinlock algorithms. All amazing work!

Thanks for reading.

SQL Server 2022 – QAT Backups

One of the amazing aspects of my job at Pure Storage, is that I get opportunities to work with new and emerging tech, oftentimes before it is available to the general public. Such is the case with SQL Server 2022, where I got to help test QAT backups for SQL Server 2022.

TL;DR

Using QAT compression for your native SQL Server backups will give you better compression with less CPU overhead, than “legacy” compression. So I get smaller backup files, faster, with less CPU burn. What’s there to not like?

Tell Me More About This Q… A… T… Thing!

So QAT stands for Intel’s Quick Assist Technology, which is a hardware accelerator for compression. It’s actually been around for many years, but most regular folks like myself never got exposed to it, because you need a QAT expansion card in your server to even have access to its powers. And for us SQL Server folks, we had nothing that took advantage of QuickAssist Technology… until now thanks to SQL Server 2022.

In SQL Server 2022, Microsoft has introduced QAT support for Backup Compression. And as I demonstrated in this blog post, your backup files are essentially byte-for-byte copies of your data files (when not using compression or encryption). And I don’t know about you and your databases, but the SQL Server environments I see these days, database sizes continue to grow and grow and grow… so I hope you use compression to save backup time and space!

But I Don’t Have QAT Cards In My SQL Servers

I said earlier that QAT has been around for a number of years, available as expansion cards. But because SQL Server had no hooks to use QAT, I strongly doubt that any of us splurged for QAT cards to be added into our SQL Servers. But there’s two things coming that’ll change all of that…

First, SQL Server 2022 has both QAT hardware support AND QAT software emulation. This means you can leverage QAT goodness WITHOUT a QAT expansion card.

Second, the next generation of Intel server processors will have QAT hardware support built in! So the next time you do a hardware refresh, and you buy the next gen of Intel server CPUs, you’ll have QAT support!

Third, if you cannot get the latest snazzy CPUs in your next hardware refresh, QAT cards are CHEAP. Like, less than $1k, just put it on a corporate charge card cheap.

IMPORTANT – QAT hardware support is an Enterprise Edition feature. But you can use QAT software mode with Standard Edition. And if you stay tuned, you’ll come to find that I’ve become a big fan of QAT software mode.

How’d You Test This Andy?

In my team’s lab, we have some older hardware lying around that I was able to leverage to test this out. Microsoft sent us an Intel 8970 QAT card, which we installed into one of our bare metal SQL Servers, an older Dell R720 with 2x Xeon E5-2697 CPUs and 512GB of RAM.

Database being backed up is a 3.4TB database, with the data spread across 9 data files across 9 data volumes. The data volumes were hosted on a FlashArray and the backup target was a FlashBlade.

To test, I used the above database and executed a bunch of BACKUP commands with different combinations of parameters. I leveraged Nic Cain’s BACKUP Test Harness to generate my T-SQL backup code. If you haven’t used it before, it’ll generate a bunch of permutations of BACKUP commands for you, mixing and matching different parameters and variables. I was particularly pleased that it also included baseline commands like a plain old BACKUP, and a DISK=NUL variant. I did have to make some modifications to the test harness to add in COMPRESSION options: NO_COMPRESSION, MS_XPRESS (i.e. legacy COMPRESSION), and QAT_DEFATE.

Sample BACKUP command used

Tangent: Backup READER & WRITER Threads

So I’ve always known that if you specify more output backup files, that’ll decrease your backup tremendously. But I never quite understood why, until I started this exercise and Anthony Nocentino taught me a bit about BACKUP internals.

In a backup operation, there’s reader threads that are consuming and processing your data, and there are writer threads that’s pushing your data out to your backup target files. If you run a bare bones basic BACKUP command, you get one READER thread and one WRITER thread to do your work. If you add additional DISK = ‘foobar.bak’ parameters, that’ll give you more WRITER threads; 1 per DISK target specified. If you want to get more READER threads, your database has to be split across multiple data VOLUMES (not files or filegroups).

If you were paying attention above, you’ll note that my test database consists of 9 data files across 9 data volumes. I set it up this way because I wanted more READER threads available to me, to help drive the BACKUP harder and faster.

Keep in mind, there’s always a trade-off in SQL Server. In this case, the more threads you’re running, the more CPU you’ll burn. And if you’re doing a bunch of database backups in parallel, or trying to run your backups at the same time as something else CPU heavy (other maintenance tasks, nightly processing, etc.) you may crush your CPU.

Tangent: FlashBlade as a Backup Target

FlashBlade is a scale-out storage array, whose super-power amounts to parallel READ and WRITE of your data. Each chassis has multiple blades and you can stripe your backup files across each of the different blades for amazing throughput. When you look at the sample BACKUP command, you’ll see different destination IP addresses. It is through these multiple Virtual IPs, which go to same appliance, but helps to stripe the backup data across multiple blades in FlashBlade.

Test Results

Legend

Compression TypeDefinition
NO_COMPRESSIONNo BACKUP compression used at all.
MS_XPRESS“Legacy” BACKUP compression used.
QAT_DEFLATE (Software)QAT BACKUP compression – Software emulation mode used
QAT_DEFLATEQAT BACKUP compression – Hardware offloading used

Baseline: DISK = NUL

BACKUP summary results – DISK = NUL
All test results are the average of 3 executions per variable permutation.

Remember, when using DISK = NUL, we’re NOT writing any output – all of the backup file data is essentially thrown away. This is used to test our “best case” scenario, from a READ and BACKUP processing perspective.

It’s interesting to see that without WRITE activity, QAT acceleration did help speed up our BACKUP execution vs legacy compression. And QAT does offer slightly better backup file compression vs legacy compression. But what I find the most impactful is CPU utilization, from both QAT hardware and software modes, is MUCH lower than legacy compression!

Note the Backup Throughput column. We actually hit a bit of a bottleneck here on the READ side, due to an older FibreChannel card in my test server and only having 8x PCIe lanes to read data from my FlashArray. The lab hardware I have access to isn’t cutting edge tech for performance testing, rather older hardware meant more for functionality testing. Moral of this story? Sometimes you I/O subsystem “issues” are because of network OR underlying server infrastructure, like the PCIe lanes and subsequent bandwidth limitations encountered here.

The Best: Files = 8; MTS = 4MB, BufferCount = 100

BACKUP summary results – Files = 8, MTS = 4MB, BufferCount = 100
All test results are the average of 3 executions per variable permutation.

I’m skipping over all of my various permutations to show the best results, which used 8 backup files, MAXTRANSFERSIZE = 2MB, and BUFFERCOUNT = 100.

Much like the DISK = NUL results, QAT yields superior compressed backup file size and CPU utilization. And in this case, Elapsed Time is now inverse – NO_COMPRESSION took the most amount of time, whereas in the DISK = NUL results, NO_COMPRESSION took the least amount of time. Why might that be? Well in the DISK = NUL scenarios, we don’t have to send data over the wire to write a backup target, whereas in this case we did. And using compression of any sort means we will have to send less data out and write less data to our backup target.

Stuck with TDE?

I also TDE encrypted my test database, then re-ran more tests. I found it interesting to see how a TDE database wound up taking more time across the board. And I found it interesting that with TDE + legacy compression, CPU usage was slightly lower but throughput was worse, vs non-TDE + legacy compression.

Parting Thoughts

Of course, the above is just a relatively small set of tests, against a single database. Yet, based on these results and other testing I’ve seen by Glenn Berry, I will admit that I’m VERY excited about SQL Server 2022 bringing QAT to the table to help improve BACKUP performance.

Even if you are stuck with older CPUs and do not have a QAT hardware card to offload to, QAT software mode beats legacy compression across the board.

I do need to test RESTORE next, because your BACKUPs are worthless if they cannot be restored successfully. But alas, that’s for another time and another blog post!

Thanks for reading!