SQL Server 2025 – New Backup Magic

SQL Server 2025 Public Preview is not even a week old, but I’m impressed with another new capability that was released – a new backup compression algorithm: ZSTD. This one came as a surprise, despite being part of Private Preview, as it was only released with Public Preview.

TL;DR – ZSTD Can Be Fast!

Today, I’m only going to do a quick blog post to share some initial test results.

This is against a new database that I’m running on a VMware Workstation VM, on my laptop, on a Samsung 990 Pro. I’m backing up my new RecipesDemoDB, which is 25.13 GB in size.

DISK = NUL (x4 files)Speed
COMPRESSION3057442 pages in 49.966 seconds (478.050 MB/sec)
STD – Level = Low3057442 pages in 44.711 seconds (534.236 MB/sec)
STD – Level = Medium3057442 pages in 33.428 seconds (714.558 MB/sec)
STD – Level = High3057442 pages in 73.147 seconds (326.551 MB/sec)
DISK = 4 Backup FilesSpeedTotal Backup Size% size of original
COMPRESSION3057434 pages in 80.761 seconds (295.764 MB/sec)16.52 GB65.74%
STD – Level = Low3057434 pages in 39.920 seconds (598.351 MB/sec)17.37 GB69.13%
STD – Level = Medium3057434 pages in 56.676 seconds (421.451 MB/sec)15.93 GB63.38%
STD – Level = High3057434 pages in 94.440 seconds (252.924 MB/sec)15.86 GB63.10%

Observations and Thoughts

First, note that there’s now three sub-options which is the compression level. Low is the default. It’s interesting to see that in this initial (SINGLE) test, that STD-Low was fastest but the final output was slightly larger than MSXpress (the legacy compression algorithm).

And a note about data composition… ~3-4GB consists of mostly text data (source) and the remaining ~20-21GB consists of vector embeddings + corresponding text chunks. Because of the nature of vector embeddings, that’ll impact the compressibility. I’ll be releasing this new database very soon with additional supporting documentation.

I’ll be writing much more on this topic later, but wanted to share these initial findings. I find them extremely compelling and am looking forward to testing this with larger databases on higher end hardware.

Thanks for reading!

Practical AI in SQL Server 2025: Ollama Quick Start

With the announcement of SQL Server 2025 Public Preview, hopefully you are interested in test driving Vector Search.

Microsoft has already posted demo code, but it’s only for OpenAI on Azure. But many of us are wondering about running things locally. So I thought I’d share a step-by-step of getting Ollama setup and running locally on my laptop. End-to-end, these instructions should take less than 30 minutes to complete.

If you’re on the Docker bandwagon, jump over to Anthony Nocentino’s fast start with Docker Compose instead!

My Setup

My laptop is a Thinkpad P14s Gen 5 AMD, with a Ryzen 7 PRO 8840 HS, 64GB of RAM, and 4TB of storage and running Windows 11 Pro. I also have a Razer Core X eGPU with an NVidia RTX 3070 connected to my laptop.

I run VMware Workstation on my laptop, so have multiple SQL Server VMs including a fresh one for SQL Server 2025. I want Ollama to use my eGPU, but because VMware Workstation does not support GPU passthrough (to my knowledge), I have to install Ollama on my base OS. As such, you’ll see some extra steps here that can be skipped if you’re just doing everything either locally or within a VM.

You’ll need:

Ollama Setup

First, just run the exe & click through the wizard. That’s easy!

Open a PowerShell prompt and run this:

ollama run llama3.2

This will download the llama3.2 model and begin an interactive session after it completes setup of the model. You can start typing and have a conversation with the model!


Once you’re done, you can exit interactive mode, but Ollama should still remain running in the background. Verify this by checking your System Tray for Ollama.

Now let’s confirm connectivity with this test. Run this in the same PowerShell Prompt:

curl http://localhost:11434

Next, you’ll want to pull an embedding model down. There are several that you can choose from: I’m choosing nomic-embed-text.

ollama pull nomic-embed-text

Now let’s test again. Run this in your PowerShell Prompt:

Invoke-RestMethod  -Uri http://localhost:11434/api/embeddings `
     -Method Post  `
     -Body '{"model": "nomic-embed-text", "prompt": "This is a test sentence to embed."}' `
     -ContentType 'application/json'

NOTE: If you’re browsing other embedding models like mxbai-embed-large, you may see that there’s a curl REST API example command. When I was first playing around, I kept getting a “405 Method Not Allowed” error with that example command. This is because curl in PowerShell is different than if it were run from Linux, which is what the doc page assumes. So I just used Invoke-RestMethod instead.

Next, we must address one annoyance that I have with Ollama – it only listens on HTTP and not HTTPS. SQL Server can only communicate over HTTPS (and rightfully so). To resolve this, we must setup a reverse proxy. This is where nginx and mkcert will come into play.

mkcert Setup

Again in your PowerShell Prompt:

mkcert -install

dir C:\Users\<YOUR USERNAME>\AppData\Local\mkcert\rootCA.pem


Next we’ll create a certificate for our Windows machine. My IP address is 192.168.1.160.

mkcert <YOUR IP ADDRESS> localhost 127.0.0.1 ::1


Go to your nginx folder, create a new subfolder called certs, and copy the two new .pem files there.


Next we’ll setup nginx. Just unzip the download and switch to that folder. We’ll now modify \conf\nginx.conf that’s in the nginx folder.

In the default nginx.conf file, there will be a large segment that starts with http and is enclosed by { } braces. Replace all of that with this (BE SURE TO MAKE CHANGES FOR YOUR ENV):

http {
    server {
        listen 443 ssl;
        server_name <YOUR IP ADDRESS HERE>;

        ssl_certificate      ../certs/<YOUR CERT FILENAME HERE>.pem;
        ssl_certificate_key  ../certs/<YOUR CERT FILENAME HERE>-key.pem;

        location / {
            proxy_pass http://localhost:11434;
            proxy_http_version 1.1;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        }
    }
}

Save and start ngnix.

ngnix

NOTE: This will run ngnix actively within the prompt session you have open. To kill this, you’ll want to run the following command in another session.

ngnix -s stop

Later, if you want to start ngnix and run it in the background

start ngnix

Now let’s quick test again with curl. The first will go directly to ollama while the second will go through nginx to ollama

curl http://localhost:11434

curl https://localhost

Invoke-RestMethod -Uri https://localhost/api/embeddings `
   -Method Post `
   -Body '{"model": "nomic-embed-text", "prompt": "This is a test sentence to embed."}' `
   -ContentType 'application/json'

Note the differences:
* The first goes to the original HTTP URI with the Ollama port specified.

* The second goes to HTTPS but does not need the Ollama port specified because of the nginx reverse proxy is listening on 443 and passing traffic to 11434.
* And finally we’ll hit the REST API via HTTPS and create a vector embedding.

VMware Workstation Setup

Next, let’s switch to our VM and get remote access going.

First, we must import the root certificate that we created earlier to our VM.

  1. Copy C:\Users\\AppData\Local\mkcert\rootCA.pem to somewhere on your VM.
  2. Open the Windows Certificate Manager (certmgr.msc)
    • Start -> “Manage computer certificates”
  3. Navigate to Trusted Root Certication Authorities -> Certificates
  4. Right click Certificates -> Import
  5. Navigate to your copied rootCA.pem (you’ll want to change the “Field name:” field to *.pem)
  6. Complete the wizard


Now let’s test from within the VM:

curl https://192.168.1.160

Invoke-RestMethod -Uri https://192.168.1.160/api/embeddings `
  -Method Post `
  -Body '{"model": "nomic-embed-text", "prompt": "This is a test sentence to embed."}' `
  -ContentType 'application/json'

Test within SQL Server 2025

Finally, let’s confirm that this all works within SQL Server!

First, some setup stuff.

-----
-- SETUP
USE master;
GO

-- Enable rest endpoint
sp_configure 'external rest endpoint enabled', 1
GO
RECONFIGURE WITH OVERRIDE;
GO

-- Turn on Trace Flags for Vector Search
DBCC TRACEON (466, 474, 13981, -1) 
GO

-- Check trace flags status
DBCC TraceStatus
GO

Now we’ll use some new T-SQL to create an External Model. This will communicate with Ollama’s HTTPS REST API endpoint that we took the time to create earlier.

-----
-- Create external model
USE CookbookDemo
GO

DROP EXTERNAL MODEL ollama_nomic_embed_text
GO

CREATE EXTERNAL MODEL ollama_nomic_embed_text
WITH (
    LOCATION = 'https://192.168.1.160:443/api/embed',
    API_FORMAT = 'Ollama',
    MODEL_TYPE = EMBEDDINGS,
    MODEL = 'nomic-embed-text'
);
GO

-- Verify
SELECT *
FROM sys.external_models;
GO

Now we will create some vector embeddings using the new T-SQL function AI_GENERATE_EMBEDDINGS().

The below code uses my CookbookDemo database. Of course, you’re welcome to use AdventureWorks or any other database of your choosing, but just make sure to change the final example SELECT statement.

-----
-- Test!
SELECT AI_GENERATE_EMBEDDINGS(
        N'this is some text that I want to turn into a vector embedding' USE MODEL ollama_nomic_embed_text
    ) AS GeneratedEmbedding;


-----
-- Run against a table with some data
SELECT 
    TOP 5000
    RecipeName,
    Ingredients,
    AI_GENERATE_EMBEDDINGS(
            Ingredients USE MODEL ollama_nomic_embed_text
        ) AS Ingredients_Embeddings
FROM dbo.Recipes_Flat;


Note that Ollama is using my external GPU (GPU 1 – 3D) to process the Ingredients records into Vector Embeddings! It took about 3 minutes to generate 5000 embeddings with my hardware. I do plan on looking into tweaking Ollama and my setup to improve performance. But hey, this works!

What Next?

Hopefully this quick start guide helped you get Ollama up and running, so you can start playing around now.

If you want to keep diving deeper, you’ll want to create another table with a Vector datatype, and store a foreign key plus the embedding value. Then you can start using vector search T-SQL to do some KNN searching, or create a DiskANN Index to do some ANN search. Have no idea what I just said?!? Don’t worry – I’ll be writing more blog posts to walk you through all of this!

And don’t forget to check out my other blogs on Vector Search too!

Thanks for reading!

T-SQL Tuesday #184: My Top Mentoring Obstacle

T-SQL Tuesday Logo

Welcome to another edition of T-SQL Tuesday! This month’s host is Deborah Melkin, who asks bloggers to share some thoughts around mentoring.

My Experiences with Mentoring

Over the course of my career, I’ve had the wonderful opportunity to mentor others in addition to having a mentor. It has been a fantastic experience almost each and every time too.

I value mentorship because I believe that it benefits both parties: the mentor and mentee. Some of those shared benefits result from the sharing of experiences and perspectives. For example, a mentee may bring a question to a mentor, that presents a situation from a new angle that the mentor may not have ever considered before. In general, it is an amazing experience for both parties involved and a wonderful opportunity for mutual growth, not just growth of the mentee seeing mentorship.

So Why Don’t I Do This More Often?

As I think back, I’ve mentored others far more often than I’ve been mentored myself. One might think this is rather peculiar, given how I professed the benefits of being mentored just a few moments ago.

“Andy, why haven’t you sought out a mentor more regularly? Is it because you have a difficult find finding a mentor (a commonly voiced challenge)?”

Because I hate setting goals.

For me, I don’t have a difficult time thinking of someone who I would like to have as a mentor. And I am perfectly okay ASKING to be mentored (as that is another common challenge voiced by others). My biggest obstacle is if that person says yes, then what?

One of the generally accepted best practices for a successful mentoring relationship is to try and think of some goals of what you’d like to get out of the mentoring relationship. It helps to create alignment for both parties, so they know what the time is being spent for. And that’s something I’m terrible at.

Unpopular Opinion…

In my opinion, a mentoring relationship CAN work just fine without a specifically set goal. But what’s the point then Andy?

The point is that your mentor is your trusted advisor, to help you work through whatever challenges life may throw at you that week. And just life work life for most of us, our priorities and challenges change frequently.

While I’ve framed this as an unpopular opinion, I will also fully admit that I still need to ask, but just set the expectation up front. It’s no different than say, “hey, can you help me for a few hours to develop this presentation” vs “hey, can you help me just get a better handle on my career?” And the person being asked to mentor MAY be okay with a less focused mentoring relationship. What I’m guilty of, is assuming that they would not be since it is counter to the typical best practice.

Moral of the Story… is?

Simply ask anyway. But do be clear on what your expectations are, even if you do not have any specific goals in mind. Heck, the kick-off for the mentoring relationship could be a brainstorming conversation around goals!

(I should take my own advice sometimes… )

Thanks for reading – happy T-SQL Tuesday!

SQL Server Backups & Mirror To

I love that, while I’ve been working with SQL Server for far too many years, that I’m still constantly learning new things, even about old aspects of SQL Server.

MirrorDirectory / MIRROR TO… Wut?

About a month ago, was having a conversation with a Pure Storage customer’s SQL Server backups. They were using Ola Hallengren’s scripts but mentioned something about writing the backups to multiple locations at once. I presumed they were executing the backup operation as normal, then using another script of some sort to then copy the backup files elsewhere. That’s when I learned that Ola has a parameter called MirrorDirectory, that exposes the MIRROR TO option in a native BACKUP command.

This conversation raised the question of, what is really happening behind the scenes when this parameter is utilized? More specifically, are write I/Os done in parallel, like if you striped your backups across multiple backup files, or are they serial? Or maybe a combination there of, parallel in a backup set, but serial to each different MIRROR TO destination?

Simple Test

Finally got some spare time so I put together a simple test to dig deeper. First, here’s the backup command that I created:

-- Turn on Trace Flags
DBCC TRACEON(3604, 3004, 3014, 3213, -1);
GO

BACKUP DATABASE AutoDealershipDemo
TO
     DISK='I:\AYunBackups\AutoDealershipDemo-1a.bak',
     DISK='I:\AYunBackups\AutoDealershipDemo-1b.bak',
     DISK='I:\AYunBackups\AutoDealershipDemo-1c.bak'
MIRROR TO
     DISK='H:\Backup\AutoDealershipDemo-2a.bak',
     DISK='H:\Backup\AutoDealershipDemo-2b.bak',
     DISK='H:\Backup\AutoDealershipDemo-2c.bak'
MIRROR TO
     DISK='D:\Backup\AutoDealershipDemo-3a.bak',
     DISK='D:\Backup\AutoDealershipDemo-3b.bak',
     DISK='D:\Backup\AutoDealershipDemo-3c.bak'
MIRROR TO
     DISK='V:\Backup\AutoDealershipDemo-4a.bak',
     DISK='V:\Backup\AutoDealershipDemo-4b.bak',
     DISK='V:\Backup\AutoDealershipDemo-4c.bak'
WITH INIT, CHECKSUM, FORMAT, STATS = 3
GO 

If you’re wondering about the Trace Flags, I wrote about them here.

Then to monitor I/O, I opted for the simple, brute-force method of using Resource Monitor on the SQL Server itself. So I simply executed the above backup statement, switched windows to the SQL Server, and here’s what I saw:

Simple explanation – all writes to all backup sets and backup files, were occurring simultaneously. Nothing kicked off serially. Neat!

Conclusion

So there you have it folks. MIRROR TO can enable you to write backup files to multiple locations simultaneously. But be cautious that it will take more CPU and you’ll increase the volume of write I/O, meaning you could bottleneck on your storage interconnect. Could be useful, but could become painful as your databases grow.

Thanks for reading!

Hyper-V: Cluster Shared Volumes, SQL Server, & FlashArray Snapshots

Many (arguably most) are currently evaluating hypervisor alternatives to VMware, and one obvious contender is obviously Microsoft Hyper-V. And if you’re a Pure Storage customer today, one thing you may be wondering, is whether you can keep utilizing storage-array snapshots to do things like replicate SQL Server data from one VM to another. The goods news is yes, absolutely! Let’s take a look at a common scenario as an example.

TL;DR – Show Me the Code!

A PowerShell example to orchestrate the above is now available up on Pure’s Github in the SQL Server Scripts repository. In addition to having the proper setup detailed below, read the README file’s Important Usage Notes section which will outline other pre-requisites.

If you just want to take Hyper-V CSV snapshots, without SQL Server involved, check out this Github repository, where I’ve shared a PowerShell solution. This is particularly useful if you want to clone an entire VM and present a second iteration of it back to a Hyper-V cluster.

And to see this in action, watch this narrated demo recording!

Updating Non-Prod Databases with a New Prod Copy

One extremely common task and time sink for DBAs is needing to take a backup of a production database and restore it to a non-production environment. Because it is a size of data operation, it can be quite time consuming. To help with that, Pure offers solutions that I’ve showcased, that leverage FlashArray snapshots, to orchestrate such workflows nearly instantaneously. But because (anecdotally) ~90% of all SQL Servers are on VMware, our examples were written for VMware. So now I want to show you how to do it in Hyper-V.

Hyper-V: Key Tidbits Overview

In Hyper-V, virtual disks take the form of VHDX files that reside on a Cluster Shared Volume (CSV) (or generally pre-Windows Server 2012, VHD’s that reside on a Clustered Disk). While CSVs were around in Windows Server 2008R2, they only really became useful and relevant in their second iteration, released with Windows Server 2012 and this linked blog has a fantastic deep dive into the CSV stack.

If you’re familiar with VMware, then I would say that a CSV is somewhat analogous to a VMFS datastore. A VMFS datastore contains one or more VMDK files, with each file representing a virtual disk for a virtual machine. With Hyper-V, each VHDX file is like a VMDK file.

So when you take a snapshot on FlashArray, you’re taking a snapshot of the entire volume – the CSV file in Hyper-V or the VMFS datastore in VMware (unless you’re using vVols – a separate conversation for another time).

The Setup & Workflow

For this, imagine that you have two SQL Servers: SQL-Prod and SQL-NonProd. These are Hyper-V VMs and may reside on the same CSV or different CSVs, it does not really matter. The idea is that we do not want to snapshot the ENTIRE VM, but instead isolate 1 or more SQL Server user databases, to clone from Prod to non-Prod.

So what you’ll do is create two more CSV’s, one for each SQL Server VM. We’ll call them CSV-UserDB-Prod and CSV-UserDB-NonProd for sake of discussion. Inside each CSV, will be the VHDX file(s) that correspond to the virtual disks that contain your user database data files and log files. If you have say, a D:\ drive for data and L:\ drive for log, then you should expect to have two VHDX files in your CSV.

The workflow is that we will then be taking a snapshot of just CSV-UserDB-Prod, and cloning that snapshot and overlaying CSV-UserDB-NonProd, thus updating the underlying data and files. Production will not be interrupted by at all by the snapshot, but because these are volume level operations, non-Prod will be disrupted.

Thanks for reading – happy snapshotting!

Hyper-V Cluster Shared Volumes + Write Caching

This blog post will probably never apply to anyone except me, but I’m writing it anyway.

Let’s say you have some VMware VMs… and decide to run Hyper-V hosts/cluster within those VMware VMs. And let’s say you start running into some headaches, taking crash-consistent snapshots and those snapshots not coming back in a consistent state as you would expect. That’s been my headache recently.

Write Caching?

So there’s several different layers of caching that can come into play, as documented here: “Hyper-V storage: Caching layers and implications for data consistency.” So I searched high and low, to try and figure out if I had unexpected write caching set up on my Cluster Shared Volumes. Looking at Disk Management -> Properties (of a CSV), one would normally see this:

Write-caching policy -> Enable write caching on the device

However… that is what one would see IF their respective Hyper-V host was on a bare-metal server. That’s NOT what one sees though, if the Hyper-V host resides on a VMware VM backed by a vVol (Virtual Volume). THIS is what you get…

Note that there are TWO policies now – Removal Policy

Removal Policy?! Where did that come from? Doing some digging, that is normally only present for removable storage like USB drives. And look at the default setting’s description… “enables write caching in Windows…”

So I disabled that on each of my Cluster Shared Volumes, bounced both Hyper-V hosts (not 100% sure if it was required but did so just to be sure), and HURRAY – my crash consistent snapshot code worked as expected!!!

One reason why this was so frustrating is that I was not doing most of this investigative work via the GUI, but rather via PowerShell commands. So I was digging around things like Cluster Parameters and turning CsvEnableBlockCache and CsvEnforceWriteThrough on and off (https://learn.microsoft.com/en-us/windows-server/failover-clustering/failover-cluster-csvs & https://learn.microsoft.com/en-us/previous-versions/windows/desktop/mscs/physical-disk-csvenforcewritethrough). But nuking this Removal Policy wound up doing the trick for me.

TL;DR

If you’re trying to disable any and all write caching, for whatever reason, be sure to check if Windows thinks the disk is removable media and has a Removal Policy applied (that also enables write caching).

Thanks for reading.

Combat Database Bloat with Data Virtualization

One of SQL Server 2022’s new features is something called Data Virtualization. It enables T-SQL to directly query files that reside in Azure object storage or S3-compatible object storage. In my opinion, since SQL Server 2022’s release, it’s one of those underrated capabilities that I think many have glossed over. But I strongly believe that it is insanely useful and you should take a few minutes to learn more!

Data Bloat Anyone?

Ask yourself, do you have a database that has a large volume of that that will never change again? Sales order history data is a generic but very relatable example. Once an order is placed, it might be amended but after a few months, it’s highly probable that it’ll never change again. We’ll pretend for the sake of our discussion that after 3 months, that data is considered set in stone.

Can We PLEASE Archive/Delete That Old Data?

Who has begged their business colleagues if they could de-bloat a database by archiving off and deleting old, ancient data from a primary production database? I see your raised hands and raise mine too. There’s many practical reasons to de-bloat a database, from performance reasons to maintenance reasons. But then there’s also the business folks that say “no we can’t because we MIGHT need to report on it.” Might… So like my stubborn beer belly fat, that data bloat remains.

Meet Parquet (… not butter)

We should all know what a Comma Separated Values (CSV) flat file. Because of its flat, singular nature, it’s not very efficient to query because the entirety of the file must be scanned, consumed, processed, etc. But what if we had a different file format whose entirety did not need to be fully processed when querying? Wouldn’t it be awesome to have a simpler flat file format, that could offer column elimination, row elimination, and compression for our data and queries? Let me introduce you to Parquet.

Pretend your sales order history table has 30 different columns of data. And let’s say you were to export all of your sales order history into individual CSV’s per calendar year. If you wanted to query say, all sales in May, June, & July of 2021 and get a SUM of total sales, you’d have to consume the entirety of the 2021 CSV file. But with a Parquet file, metadata is utilized to enable you to zero in on rows for May, June, and July only. And instead of all 30 columns, you can just retrieve sales date and sales amount ONLY, and not consume the other 28 columns! That amounts to a tremendous savings from a workload perspective!

Data Virtualization

So how does SQL Server 2022 come into play here? First, we create something called an External Data Source which in this case will point to S3 object storage somewhere. Then we will combine it with creating an External File Format for our Parquet file. The third piece of the solution is Create External Table as SELECT (aka CETAS).

The end result is that each Parquet file will now become available to us within T-SQL as an External Table entity. So I can write T-SQL to query a Parquet file and my code will look identical.

CREATE EXTERNAL TABLE parquet.SalesOrderHistory_2021
WITH (
	LOCATION = '/SalesOrderHistory_2021.parquet', 
	DATA_SOURCE = cetas_demo, 
	FILE_FORMAT = parquet_file_format_object
)
AS 
SELECT 
	[SalesID], 
	[SalesDate],
	[ProductID],
	[Email],
	[PhoneNumber],
	[OtherColumns]
FROM parquet.SalesOrderHistory_2021;
GO

SELECT SalesID, SalesDate
FROM parquet.SalesOrderHistory_2021
WHERE SalesDate >= '2021-05-01' AND SalesDate < '2021-08-01'

Now you may have noticed that we have a singular table for 2021 data, or 1 table per Parquet file. You might be thinking, will that be a problem if I have many years worth of data, thus many Parquet files? This is where another old T-SQL trick comes into play – partitioned views. I can have multiple Parquet files virtualized via multiple External Tables, but overlay them all with a single partitioned view, to then enable me to query everything as I once did.

CREATE OR ALTER VIEW dbo.SalesOrderHistory
AS
SELECT 
	[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2020
UNION ALL
SELECT 
	[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2021
UNION ALL
SELECT 
	[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2022
UNION ALL
SELECT 
	[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2023
UNION ALL
SELECT 
	[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2024
UNION ALL
SELECT 
	[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM dbo.SalesOrderHistory_Current
GO

SELECT SalesID, SalesDate
FROM dbo.SalesOrderHistory
WHERE SalesDate >= '2021-05-01' AND SalesDate < '2021-08-01'

Be sure to look closely at the final SELECT in the partitioned view. That’s a reference to a traditional SQL Server table! What this means is that instead of having a dbo.SalesOrderHistory table that contains everything including new orders, I’ve physically partitioned the data out into 5 Parquet files, virtualized via External Tables, AND still included all current sales order data as well. So existing T-SQL queries against dbo.SalesOrderHistory (before a Parquet split) has no idea and does NOT need to be changed!

You Mentioned Azure & S3… That Means I Need to Cloud, Right?

Whenever people (myself included, for the longest time) hear “S3” and/or “object storage,” they typically mentally default to AWS S3 Object Storage. But let’s break that apart. In brief, “object storage” is essentially a storage architecture geared towards unstructured data like images, videos, and documents. “S3” is Amazon’s “Simple Storage Service” which I’d loosely describe as Amazon’s implementation approach for object storage. The good news here, is that you’re not just stuck with Amazon. S3-compatible object storage is also a thing. This is where other vendors have their own object storage implementations but leverage the Amazon S3 API.

But my data is on-prem and staying that way. This is where another common misconception comes in. Because “Amazon S3 object storage” has become such a common term, many do not realize that S3-compatible on-prem storage is a thing! And yes, this is where Pure Storage comes into the picture with our FlashBlade series. You can have all of this awesomeness on-prem without having to consider compromising with the cloud.

I Want To See This in Action!

Hopefully you’re chomping at the bits to give this a go. I’d like to leave you with three resources to help you get jump started:

  1. Watch… A New Solution for Data Archiving using Data Virtualization
    This is my Pure Storage vendor session from PASS Summit 2023, that covers everything end to end with demos!
  2. Learn… Microsoft’s Introduction to SQL Server 2022 Data Virtualization learning path
    Go learn straight from Microsoft!
  3. Try… Pure Storage’s Modern Storage Platforms for SQL Server workshop
    Regardless of whether you’re a Pure Storage customer or not, you can review Module 4 of this workshop.
    And if you are a Pure Storage customer WITH a FlashBlade, use this workshop to try it out for yourself!
    And if you are a Pure Storage customer but do not have a FlashBlade, you can contact your Account Team and request a Test Drive voucher for this workshop. This will give you access to a virtual lab with hardware, to work through the lab workshop.

Thanks for reading!

T-SQL Tuesday #182: Integrity

T-SQL Tuesday Logo

Welcome back to another edition of T-SQL Tuesday! This month’s blog party is hosted by Rob Farley who is asking bloggers to share some thoughts around a single word: Integrity.

To the Dictionary!

Like many words in the English language, Integrity has several meanings. I decided to look at a handful of dictionaries and am settling on this one from the Cambridge English Dictionary which states:

the quality of being whole and complete

In the realm of data, what does it mean for data to be “whole” and “complete?” Well, if I’m storing a simple value like “$14,693.31” or “fourteen thousand and six hundred and ninety three dollars and thirty one cents”… I would be rather upset if the value that wound up being stored was “693.31” or $1 ,6 3 .31″. That data is neither whole nor complete, thus the integrity of my data is compromised!

Garbage In, Garbage Out

I cannot tell you how many times I’ve encountered scenarios where “this data looks wrong.” Well… can one ensure that it is being retrieved and displayed correctly from the storage media that it resides on in the first place? Are you viewing/validating the data in question correctly? Whatever client/method you are using to review your data – that is suspect and its integrity is in question.

Assuming the answer is yes, well, did the data somehow get changed while it was “at rest” on its media? This is highly, highly unlikely – stable media like flash or even spinning disk, very rarely flip bits when the media is idle. But in extremely rare cases, okay, your storage media has lost its ability to maintain your data integrity.

But more often than not, questionable data was written to your media in an incorrect fashion. That is where the saying, “garbage in, garbage out” is definitely applicable. So, what application is writing your data?

“Oh no Andy, it’s not the application – it must be the storage that’s corrupt!”

“If the storage were corrupt, we’d be seeing random garbage data all over the place. Not just these and only these specific dollar values. So that’s HIGHLY unlikely here.”

“It’s got to be the storage is corrupt, check again.” (hours pass)… “… okay, we found a bug in our code.”

Step by Step

What’s the moral of this simplistic story here? As you determine the source of an integrity issue, you must carefully walk the stack and assess at each point in the stack, where your integrity was compromised.

But Andy, shouldn’t your database have already had other things in place like constraints, foreign keys, etc. to also ensure that all data written down maintains its integrity? Absolutely… but that’s another story for another day (that I hope another T-SQL Tuesday’s blogger tackled too).

Thanks for reading.

Rapidly Recovering from SQL Server Data Mishaps with FlashArray Snapshots

Beginning of a Bad Day

Almost all of us have experienced a time when someone incorrectly modifies or deletes a subset of data in a database. And usually the only recourse is to run a database restore to get the data that we need back.

But… what if we cannot simply run a RESTORE operation over the entire database. Maybe the accident occurred a few hours prior and you cannot roll everything back. You need to do a side-by-side RESTORE.

But… what if the database in question is absolutely massive and will take hours to restore? And what if your server doesn’t have enough available storage to fit a second copy of that database at all?

FlashArray Snapshots to the Rescue!

If you’re fortunate to have your SQL Servers backed by Pure Storage, you can solve this quickly with minimal headache. The key is FlashArray snapshots.

This entire process does NOT require application consistent snapshots either. You can utilize crash consistent snapshots (that do NOT require VSS and do NOT stun your SQL Server when taken), and hopefully you’re already taking these today on your FlashArray. Because of how our snapshots work behind the scenes, you will not consume additional capacity on your storage array either. Best of all, it does not matter if your database was 50GB or 50TB – all of these actions will be nearly instantaneous.

I Need to Execute an Object-Level Restore RIGHT NOW!

Here is a step-by-step video walkthrough. Watch and pause it along the way, to emulate the steps I am doing.

Here are the functional steps (starting at 5m 23s):

  1. Add new volume(s) to your SQL Server, of the exact same size as the existing volume(s) that contain your database’s data and log files.
  2. In FlashArray: Protection Group’s UI, select a snapshot, and find the volume(s) you wish to clone – aka “Copy Snapshot.” Overwrite the volume(s) created in step 1 with the member volume of the Protection Group Snapshot.
  3. In the Windows OS – Disk Management – find the newly added volume(s)/disk(s) and Set Online.
  4. Attach the newly cloned database files in SQL Server (with a different name of course)
  5. Use T-SQL to insert/update/copy from your restored database back to your main database.
  6. Clean-up: drop the restored database, set the volume(s)/disk(s) offline in Windows, delete the disks from your VM

Assumptions & Pre-Requisites:

  • Your SQL Server is a VMware VM – this will work with bare metal as well, but with a slight adjustment to the steps
  • Your VMware VM is using vVols – this will work with RDMs as well, but with a slight adjustment to the steps. VMFS will also work, but you’ll need a slightly different set of steps which I’ll cover in a future blog
  • The volumes containing your SQL Server data and log files are being snapshotted by FlashArray on a pre-scheduled basis, all as part of a Protection Group

Conclusion

If you’re NOT in the midst of an emergency right now, I’d encourage you to practice this on a non-Prod server. Once you grasp the steps and workflow, you’ll see how easy it is and can now add this tool/technique to your arsenal for the next time someone accidentally mangles some data in your database.

Thanks for reading!