TOP Archives | Darling Data

Does Using TOP In A SQL Server Common Table Expression Materialize It?

Posted on March 22, 2022May 16, 2022 by Erik Darling

Nope.

The sad news for you here is that nothing aside from selecting a CTE into a real or temporary table will materialize the result of the query within in it.

WITH 
    cte AS
(
    SELECT TOP (1)
        u.Id
    FROM dbo.Users AS u
)
SELECT
    c1.*
FROM cte AS c1
JOIN cte AS c2
    ON c1.Id = c2.Id
JOIN cte AS c3
    ON c1.Id = c3.Id;

This query will still have to touch the Users table three times. I’ve blogged about this part before, of course.

Bounce

You may notice something interesting in there, though, once you get past the disappointment of seeing three scans of the Users table.

Each scan is preceded by the TOP operator. This can sometimes be where people confuse the behavior of TOP in a Common Table Expression or Derived Table.

It’s not a physical manifestation of the data into an object, but (at least for now) it is a logical separation of the query semantics.

In short, it’s a fence.

The reason why it’s a fence is because using TOP sets a row goal, and the optimizer has to try to meet (but not exceed) that row goal for whatever part of the query is underneath it.

Strange

Take this query for example, which loads a bunch of work into a Common Table Expression with a TOP in it:

WITH
    Posts AS
(
    SELECT TOP (1000)
         p.*
    FROM dbo.Posts AS p
    WHERE p.PostTypeId = 1
    AND   p.Score > 500
    AND   EXISTS
          (
              SELECT
                  1/0
               FROM dbo.Users AS u
               WHERE u.Id = p.OwnerUserId
          )
    AND EXISTS
        (
            SELECT
                1/0
            FROM dbo.Badges AS b
            WHERE b.UserId = p.OwnerUserId
        )
    ORDER BY p.Score DESC
)
SELECT
    u.DisplayName,
    p.PostTypeId,
    p.Score
FROM dbo.Users AS u
JOIN Posts AS p
    ON p.OwnerUserId = u.Id
ORDER BY u.Reputation DESC;

And the plan for it looks like this:

All the work within the Common Table Expression is fenced by the top.

There are many times you can use this to your advantage, when you know certain joins or predicates can produce a very selective result.

Care Control

As a final note, just be really careful how you position your TOPs. They insert loads of semantic differences to the query.

If you don’t believe me, put a TOP in the wrong place and watch your results change dramatically.

Thanks for reading!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

A Parameterization Performance Puzzle With TOP PERCENT in SQL Server

Posted on March 19, 2021May 16, 2022 by Erik Darling

Lawdy

There was a three-part series of posts where I talked about a weird performance issue you can hit with parameterized top. While doing some query tuning for a client recently, I ran across a funny scenario where they were using TOP PERCENT to control the number of rows coming back from queries.

With a parameter.

So uh. Let’s talk about that.

Setup Time

Let’s start with a great index. Possibly the greatest index ever created.

CREATE INDEX whatever 
ON dbo.Votes
    (VoteTypeId, CreationDate DESC)
WITH
(
    MAXDOP = 8,
    SORT_IN_TEMPDB = ON
);
GO

Now let me show you this stored procedure. Hold on tight!

CREATE OR ALTER PROCEDURE dbo.top_percent_sniffer 
(
    @top bigint, 
    @vtid int
)
AS
SET NOCOUNT, XACT_ABORT ON;
BEGIN
    SELECT TOP (@top) PERCENT
        v.*
    FROM dbo.Votes AS v
    WHERE v.VoteTypeId = @vtid
    ORDER BY v.CreationDate DESC;

END;

Cool. Great.

Spool Hardy

When we execute the query, the plan is stupid.

EXEC dbo.top_percent_sniffer
    @top = 1,
    @vtid = 6;
GO

We don’t use our excellent index, and the optimizer uses an eager table spool to hold rows and pass the count to the TOP operator until we hit the correct percentage.

This is the least ideal situation we could possibly imagine.

Boot and Rally

A while back I posted some strange looking code on Twitter, and this is what it ended up being used for (among other things).

getting weird. pic.twitter.com/lRVQQr81Yi

— Erik Darling Data (@erikdarlingdata) February 27, 2021

The final version of the query looks like this:

CREATE OR ALTER PROCEDURE dbo.top_percent_sniffer 
(
    @top bigint, 
    @vtid int
)
AS
SET NOCOUNT, XACT_ABORT ON;
BEGIN;
    
    WITH pct AS
    (
        SELECT
            records = 
                CONVERT(bigint, 
                    CEILING(((@top * COUNT_BIG(*)) / 100.)))
        FROM dbo.Votes AS v
        WHERE v.VoteTypeId = @vtid
    )
    SELECT
        v.*
    FROM pct
    CROSS APPLY
    (
        SELECT TOP (pct.records)
            v.*
        FROM dbo.Votes AS v
        WHERE v.VoteTypeId = @vtid
        ORDER BY v.CreationDate DESC
    ) AS v;

END;
GO

Soul Bowl

This definitely has drawbacks, since the expression in the TOP always gives a 100 row estimate. For large numbers of rows, this plan could be a bad choice and we might need to do some additional tuning to get rid of that lookup.

There might also be occasions when using a column store index to generate the count would be benefit, and the nice thing here is that since we’re accessing the table in two different ways, we could use two different indexes.

But for reliably small numbers of rows, this is a pretty good solution.

Thanks for reading!

Going Further

A SQL Server Parameterization Puzzle With TOP: Part 1

Posted on October 21, 2020May 16, 2022 by Erik Darling

The Gift That Keeps On Giving

I love when a demo written for one purpose turns into an even better demo for another purpose.

While working with a client recently, they ran into a performance issue when trying to promote plan reuse by parameterizing the user-input number for TOP.

In part 1, I’m going to show you what happened and why, and in part 2 I’ll discuss some workarounds.

Regresso Chicken Face Soup

When executed with a literal value in the top, this query runs for around 10 seconds.

I’m not saying that’s great, but it’s a good enough starting place.

SELECT TOP (38)
        u.DisplayName,
        b.Name
FROM dbo.Users u
CROSS APPLY 
(
    SELECT TOP (1) 
            b.Name
    FROM dbo.Badges AS b
    WHERE b.UserId = u.Id
    ORDER BY b.Date DESC
) AS b
WHERE u.Reputation >= 10000
ORDER BY u.Reputation DESC;

If we take that same query, put it in a procedure, and run it with an identical value in the TOP, things will turn out not-so-well.

CREATE OR ALTER PROCEDURE dbo.SniffedTop (@Top INT)
AS
BEGIN
SET NOCOUNT, XACT_ABORT ON;

    SELECT TOP (@Top)
            u.DisplayName,
            b.Name
    FROM dbo.Users u
    CROSS APPLY 
    (
        SELECT TOP (1) 
                b.Name
        FROM dbo.Badges AS b
        WHERE b.UserId = u.Id
        ORDER BY b.Date DESC
    ) AS b
    WHERE u.Reputation >= 10000
    ORDER BY u.Reputation DESC;

END
GO 

EXEC dbo.SniffedTop @Top = 38;

The query runs for a significantly longer amount of time.

What Happened?

Unofficially, when TOP uses a constant and the constant is “small” (under 101), the exchange packets are allowed to send earlier than normal, as long as the exchange is below the TOP operator. They’re allowed to send as soon as they have a row, rather than waiting for them to fill up completely.

This can only happen with constants (or…!), and the behavior is true going back to 2005. It may change in the future, so if you’re reading this at some far off date, please don’t be too harsh on me in the comments.

When you parameterize TOP, it’s considered unsafe to send the exchange buffers early. After all, you could stick anything in there, up through the BIGINT max. In cases where you’ve got a BIG TOP, sending, say, 9,223,372,036,854,775,807 rows one at a time would be significantly ickier than sending over a smaller number of full exchange buffers.

If you’re surprised to hear that parallel exchange buffers can send at different times, you’re not alone. I was also surprised.

SQL Server: Full Of Surprises. Horrible surprises.

In the second query, where exchange buffers are sent when full, we spend a lot longer waiting for them to fill up. This isn’t exposed anywhere in the plan, and you’d need either a debugger or this blog post to figure it out.

Yep.

Yep. Yep. Yep. Yep. Yep. Yep.

Thanks for reading!

Going Further

Starting SQL: A Little TOP Can Have A Big Sort In SQL Server Query Plans

Posted on August 11, 2020May 14, 2022 by Erik Darling

Shortcuts

To illustrate this problem a little bit better, I need to use a slightly different query.

While a tidy group by was nice to illustrate some of the simpler points about lookups, it overly complicates things when we want to involve other columns.

The second we need anything else in the select or order by portions of a query, we need to apply aggregates to them, or include them in the group by.

And you see, once you set up a query to return the TOP N rows, there’s an expectation that users get to choose the order they start seeing rows in. As long as we stick to columns whose ordering is supported by an index, things will be pretty stable.

Once we go outside that, a TOP can be rough on a query.

Order By CreationDate

Even if we order by CreationDate in descending order, with the index created in ascending order, we don’t need an explicit sort operation to put data in order.

SELECT TOP (1000)
       p.CreationDate,
       p.OwnerUserId,
       p.Score
FROM dbo.Posts AS p
WHERE p.CreationDate >= '20131015'
ORDER BY p.CreationDate DESC;

There are some additional possibilities for this kind of thing with multi-key indexes that we’ll talk about later in the series, but for now this is a good enough illustration of indexes putting data in order.

Order By Score

Let’s ask for data in a different order now. Score is a convenient villain, because there are lots of times when you might want to see things by a highest whatever metric is commendable to be high. Like not blood pressure, probably.

SELECT TOP (1000)
       p.CreationDate,
       p.OwnerUserId,
       p.Score
FROM dbo.Posts AS p
WHERE p.CreationDate >= '20131015'
ORDER BY p.Score DESC;

Without an index that has Score in the key, we need to physically put the data in order to fit the requirements of the query. Note that in this case, the optimizer no longer sees any benefit to using our nonclustered index.

Why Do We Care About Sorts?

Without jumping too far ahead, Sorts need extra memory to run.

You know, that stuff you cache data in, and you don’t have enough of already?

Yeah, that memory.

The same memory that gives you a conniption every time PLE fluctuates.

How much they ask for is going to depend on:

How much memory you have
What max server memory is set to
The size of the data you need to sort

If we aren’t able to get enough memory, or if we don’t ask for enough up front, data could spill from memory to disk.

When spills get big enough, they can sometimes cause performance issues.

This is a good time to start talking a little bit more about indexes, so we can understand more about how they work and help us solve problems.

Thanks for reading!

Going Further

SQL Server T-SQL Protip: watch those TOPs without ORDER BY

Posted on June 4, 2019May 14, 2022 by Joe Obbish

In the documentation for TOP, the following is listed as a best practice:

In a SELECT statement, always use an ORDER BY clause with the TOP clause. Because, it’s the only way to predictably indicate which rows are affected by TOP.

Let’s work through a real world example.

The good

One of the great things about the “Top Resource Consuming Queries” query store SSMS report is that it is always able to render the query plan, even for very complex queries. I’m not aware of a pure T-SQL solution that can avoid requiring the end user to save xml to files in all cases. The report nearly always takes a long time to run, so it’s easy to capture the T-SQL that powers the grid details version:

DECLARE @results_row_count INT = 100,
@interval_start_time DATETIMEOFFSET = '2019-05-24 15:30:00 +00:00',
@interval_end_time DATETIMEOFFSET = '2019-05-24 18:00:00 +00:00';
 
SELECT TOP (@results_row_count)
    p.query_id query_id,
    q.object_id object_id,
    ISNULL(OBJECT_NAME(q.object_id),'') object_name,
    qt.query_sql_text query_sql_text,
    ROUND(CONVERT(float, SUM(rs.avg_duration*rs.count_executions))*0.001,2) total_duration,
    SUM(rs.count_executions) count_executions,
    COUNT(distinct p.plan_id) num_plans
FROM sys.query_store_runtime_stats rs
    JOIN sys.query_store_plan p ON p.plan_id = rs.plan_id
    JOIN sys.query_store_query q ON q.query_id = p.query_id
    JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id
WHERE NOT (rs.first_execution_time > @interval_end_time OR rs.last_execution_time < @interval_start_time)
GROUP BY p.query_id, qt.query_sql_text, q.object_id
HAVING COUNT(distinct p.plan_id) >= 1
ORDER BY total_duration DESC;

Note the presence of the ORDER BY. I get exactly the results that I was expecting:

a41 good

The bad

If I ask for extra details (who doesn’t want more details?), a significantly more complex query is generated:

-- grid format query with additional details
-- grid format query with additional details
DECLARE @results_row_count INT = 100,
@interval_start_time DATETIMEOFFSET = '2019-05-24 15:30:00 +00:00',
@interval_end_time DATETIMEOFFSET = '2019-05-24 18:00:00 +00:00';
 
With wait_stats AS
(
SELECT
    ws.plan_id plan_id,
    ws.execution_type,
    ROUND(CONVERT(float, SUM(ws.total_query_wait_time_ms)/SUM(ws.total_query_wait_time_ms/ws.avg_query_wait_time_ms))*1,2) avg_query_wait_time,
    ROUND(CONVERT(float, SQRT( SUM(ws.stdev_query_wait_time_ms*ws.stdev_query_wait_time_ms*(ws.total_query_wait_time_ms/ws.avg_query_wait_time_ms))/SUM(ws.total_query_wait_time_ms/ws.avg_query_wait_time_ms)))*1,2) stdev_query_wait_time,
    CAST(ROUND(SUM(ws.total_query_wait_time_ms/ws.avg_query_wait_time_ms),0) AS BIGINT) count_executions,
    MAX(itvl.end_time) last_execution_time,
    MIN(itvl.start_time) first_execution_time
FROM sys.query_store_wait_stats ws
    JOIN sys.query_store_runtime_stats_interval itvl ON itvl.runtime_stats_interval_id = ws.runtime_stats_interval_id
WHERE NOT (itvl.start_time > @interval_end_time OR itvl.end_time < @interval_start_time)
GROUP BY ws.plan_id, ws.runtime_stats_interval_id, ws.execution_type ),
top_wait_stats AS
(
SELECT TOP (@results_row_count)
    p.query_id query_id,
    q.object_id object_id,
    ISNULL(OBJECT_NAME(q.object_id),'') object_name,
    qt.query_sql_text query_sql_text,
    ROUND(CONVERT(float, SUM(ws.avg_query_wait_time*ws.count_executions))*1,2) total_query_wait_time,
    SUM(ws.count_executions) count_executions,
    COUNT(distinct p.plan_id) num_plans
FROM wait_stats ws
    JOIN sys.query_store_plan p ON p.plan_id = ws.plan_id
    JOIN sys.query_store_query q ON q.query_id = p.query_id
    JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id
WHERE NOT (ws.first_execution_time > @interval_end_time OR ws.last_execution_time < @interval_start_time)
GROUP BY p.query_id, qt.query_sql_text, q.object_id
),
top_other_stats AS
(
SELECT TOP (@results_row_count)
    p.query_id query_id,
    q.object_id object_id,
    ISNULL(OBJECT_NAME(q.object_id),'') object_name,
    qt.query_sql_text query_sql_text,
    ROUND(CONVERT(float, SUM(rs.avg_duration*rs.count_executions))*0.001,2) total_duration,
    ROUND(CONVERT(float, SUM(rs.avg_cpu_time*rs.count_executions))*0.001,2) total_cpu_time,
    ROUND(CONVERT(float, SUM(rs.avg_logical_io_reads*rs.count_executions))*8,2) total_logical_io_reads,
    ROUND(CONVERT(float, SUM(rs.avg_logical_io_writes*rs.count_executions))*8,2) total_logical_io_writes,
    ROUND(CONVERT(float, SUM(rs.avg_physical_io_reads*rs.count_executions))*8,2) total_physical_io_reads,
    ROUND(CONVERT(float, SUM(rs.avg_clr_time*rs.count_executions))*0.001,2) total_clr_time,
    ROUND(CONVERT(float, SUM(rs.avg_dop*rs.count_executions))*1,0) total_dop,
    ROUND(CONVERT(float, SUM(rs.avg_query_max_used_memory*rs.count_executions))*8,2) total_query_max_used_memory,
    ROUND(CONVERT(float, SUM(rs.avg_rowcount*rs.count_executions))*1,0) total_rowcount,
    ROUND(CONVERT(float, SUM(rs.avg_log_bytes_used*rs.count_executions))*0.0009765625,2) total_log_bytes_used,
    ROUND(CONVERT(float, SUM(rs.avg_tempdb_space_used*rs.count_executions))*8,2) total_tempdb_space_used,
    SUM(rs.count_executions) count_executions,
    COUNT(distinct p.plan_id) num_plans
FROM sys.query_store_runtime_stats rs
    JOIN sys.query_store_plan p ON p.plan_id = rs.plan_id
    JOIN sys.query_store_query q ON q.query_id = p.query_id
    JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id
WHERE NOT (rs.first_execution_time > @interval_end_time OR rs.last_execution_time < @interval_start_time)
GROUP BY p.query_id, qt.query_sql_text, q.object_id
)
SELECT TOP (@results_row_count)
    A.query_id query_id,
    A.object_id object_id,
    A.object_name object_name,
    A.query_sql_text query_sql_text,
    A.total_duration total_duration,
    A.total_cpu_time total_cpu_time,
    A.total_logical_io_reads total_logical_io_reads,
    A.total_logical_io_writes total_logical_io_writes,
    A.total_physical_io_reads total_physical_io_reads,
    A.total_clr_time total_clr_time,
    A.total_dop total_dop,
    A.total_query_max_used_memory total_query_max_used_memory,
    A.total_rowcount total_rowcount,
    A.total_log_bytes_used total_log_bytes_used,
    A.total_tempdb_space_used total_tempdb_space_used,
    ISNULL(B.total_query_wait_time,0) total_query_wait_time,
    A.count_executions count_executions,
    A.num_plans num_plans
FROM top_other_stats A LEFT JOIN top_wait_stats B on A.query_id = B.query_id and A.query_sql_text = B.query_sql_text and A.object_id = B.object_id
WHERE A.num_plans >= 1
ORDER BY total_duration DESC
)

Now we have not 1, not 2, but THREE TOP operators! But only one of them has an ORDER BY. The results are completely different, and are pretty much useless:

a41 bad

The ugly

This has nothing to do with TOP as far as I know, but I included it just for fun:

a41 ugly

Final thoughts

All of you developers out there should watch your TOPs and make sure you’re using ORDER BY as needed. Otherwise, you might end up with annoyed end users writing blog posts about your code.

Thanks for reading!

Going Further

Are Self Joins Ever Better Than Key Lookups In SQL Server?

Posted on May 27, 2019May 14, 2022 by Erik Darling

Sorta Topical

Like most tricks, this has a specific use case, but can be quite effective when you spot it.

I’m going to assume you have a vague understanding of parameter sniffing with stored procedures going into this. If you don’t, the post may not make a lot of sense.

Or, heck, maybe it’ll give you a vague understanding of parameter sniffing in stored procedures.

One For The Money

Say I have a stored procedure that accepts a parameter called @Reputation.

The body of the procedure looks like this:

    SELECT TOP (1000) 
	        u.*
    FROM dbo.Users AS u
    WHERE u.Reputation = @Reputation
    ORDER BY u.CreationDate DESC;

In the users table, there are a lot of people with a Reputation of 1.

There are not so many with a Reputation of 2.

+------------+---------+
| Reputation | records |
+------------+---------+
|          1 | 1090043 |
|          2 |    1854 |
+------------+---------+

Two For The Slow

Data distributions like this matter. They change how SQL Server approaches coming up with an execution plan for a query.

Which indexes to use, what kind of joins to use, how to aggregate data, if the plan should be serial or parallel…

The list goes on and on.

In this case, we have a narrow-ish nonclustered index:

    CREATE INDEX whatever 
        ON dbo.Users (Reputation, Age, CreationDate);

When I run my stored procedure and look for Reputation = 2, the plan is very fast.

EXEC dbo.WORLDSTAR @Reputation = 2;

This is a great plan for a small number of rows.

When I run it for a large number of rows, it’s not nearly as fast.

EXEC dbo.WORLDSTAR @Reputation = 1;

We go from a fraction of a second to over three seconds.

This is bad parameter sniffing.

If we run it for Reputation = 1 first, we don’t have the same problem.

That’s good(ish) parameter sniffing.

Better For Everyone

Many things that prevent parameter sniffing will only give you a so-so plan. It may be better than the alternative, but it’s certainly not a “fix”.

It’s possible to get a better plan for everyone in this situation by re-writing the Key Lookup as a self join

    SELECT TOP (1000) 
	        u2.*
    FROM dbo.Users AS u
    JOIN dbo.Users AS u2
        ON u.Id = u2.Id
    WHERE u.Reputation = @Reputation
    ORDER BY u.CreationDate DESC;

The reason why is slightly complicated, but I’ll do my best to explain it simply.

Here’s what the bad parameter sniffing plan looks like for each query.

Note that the Key Lookup plan still runs for ~3 seconds, while the self-join plan runs for around half a second.

While it’s possible for Key Lookups to have Sorts introduced to optimize I/O… That doesn’t happen here.

The main difference between the two plans (aside from run time), is the position of the Sort.

In the Key Lookup plan (top), the Key Lookup between the nonclustered and clustered indexes runs to completion.

In other words, for everyone with a Reputation of 1, we go to the clustered index to retrieve the columns that aren’t part of the nonclustered index.

In the self-join plan (bottom), all rows go into the Sort, but only the 1000 come out.

Different World

The difference is more obvious when viewed with Plan Explorer.

In the Key Lookup plan, rows aren’t narrowed until the end so a seek occurs ~1mm times.

In the self-join plan, they’re eliminated directly after the Index Seek, so the join only runs for 1000 rows and produces 1000 seeks.

This doesn’t mean that Top N Sorts are bad, it just means that they may not produce the most optimal plans for Key Lookups.

When This Doesn’t Work

Without a TOP, the self-join pattern isn’t as dramatically faster, but it is about half a second better (4.3s vs. 3.8s) for the bad parameter sniffing scenario, and far less for the others.

Of course, an index change to put CreationDate as the second key column fixes the issue by removing the need to sort data at all.

    CREATE INDEX whatever --Current Index
        ON dbo.Users (Reputation, Age, CreationDate);
    GO 
    
    CREATE INDEX apathy --Better Index For This Query
	    ON dbo.Users (Reputation, CreationDate, Age);
    GO

But, you know, not everyone is able to make index changes easily, and changing the key column order can cause problems for other queries.

Thanks for reading!

Going Further

In SQL Server Stored Procedures, Parameters In TOP Are Sniffed Too

Posted on January 25, 2019January 20, 2026 by Erik Darling

Thank You For Your Business

In yesterday’s post, we looked at how row goals affected performance when data matching our join and where clauses didn’t exist.

Today we’re going to look at something similar, and perhaps a workaround to avoid the same issues.

Here’s our index setup:

CREATE INDEX whatever ON dbo.Votes(CreationDate DESC, VoteTypeId)
GO

Now we’re gonna wrap our query in a stored procedure.

CREATE OR ALTER PROCEDURE dbo.top_sniffer (@top INT, @vtid INT)
AS
BEGIN

    SELECT   TOP (@top) v.*
    FROM     dbo.Votes AS v
    WHERE    v.VoteTypeId = @vtid
    ORDER BY v.CreationDate DESC;

END;

Video Killed MTV

You know what? I don’t wanna write all this stuff. Let’s do a video.

Thanks for reading!

Video Summary

In this video, I dive into an interesting scenario involving parameterized `TOP` in SQL Server stored procedures. I explore how SQL Server handles rare data searches and the impact on query performance when using parameters like `TOP`. By creating a specific index and running various queries, I demonstrate how SQL Server’s caching mechanism can lead to suboptimal plans for rare data lookups, showcasing both the challenges and potential solutions. I also touch on some lesser-known behaviors, such as the parameter sniffing issues that arise with `OPTIMIZE FOR UNKNOWN` and dynamic SQL alternatives, providing insights into optimizing your stored procedures for better performance in real-world scenarios.

Full Transcript

Howdy folks, Erik Darling here with, well, with Erik Darling Data. And I wanted to record this as a video because I am far too lazy to blog about this today. It’s Sunday and I don’t want to do all that work screen capping and highlighting stuff and all that other things that go along with a great post. So I’m just going to record this video because I think that’s going to be a little bit easier on everyone. So what I got is something kind of funny to show you about what happens when we parameterize top. In other words, when we write a query, a stored procedure, or whatever you want to call it, and we allow people to pass a parameter like this to top like this. Now, I found some interesting stuff in this. Hopefully you find it interesting too. Now I’ve got an index here. You can ignore that little red squiggle. It’s because I already created it ahead of time because I know how much everyone hates waiting for indexes to create. And it’s on creation date descending and vote type ID. Now that is there mostly to satisfy this part of the query, my where clause on vote type ID and my order by on creation date. I know that for a bunch of the columns I’m selecting here, I’m going to have to do a key lookup. That’s not really what this is about. What I want to show you is what happens when SQL Server has to go looking for rare data. Now in the votes table, the way things break down is there are kind of a lot of rows with a vote type ID of one. There are about 3.7 million. There are far fewer with a vote type ID of four. There’s only about 733. The way that they break down in the index is fairly even. So when we look at the year of the creation date, remember that’s the leading column in the index. Oops, jumped around a little bit there. We can see that things break down fairly evenly across all of the possible years here. There’s not like any great big gaps. We have one in just about every year in the index. Everything’s pretty even there. Now when we run the plan, I’m just going to recompile this thing real quick to make sure I’m getting something fresh out the bag. And I look for the top 5,000 rows for vote type ID one. This runs fairly quickly. This runs in around about 23 milliseconds total. And it does about 15,000 logical reads. So that’s good.

And when we look at the execution plan. What I want to show you first is that when we look at the parameters that got passed in, top just like any other parameter was cached. So we have a parameter compile value of 5,000. And we also have a parameter runtime value of 5,000. If I get rid of this, and we go look at an execution where we have a 1 in as top, exactly what you think is going to happen happens. We hit F4. We go look at the plan properties. Got to jump around a little bit to get to the right one. And when I look at the parameter list. Now I have a compile time value of 5,000. And a runtime value of 1. So just like any other parameter in a store procedure in SQL Server, it gets cached when we run the query the first time.

And then it gets reused when we run the query subsequent times. Now, obviously, this could backfire if we recompile here, and we run the plan looking for a top one first. And then the next execution, some ding dong comes along and looks for a million rows. This is the obvious one. We’re going to reuse that tiny little plan that SQL Server came up with to find one row, and we’re going to use it to find a million rows. That’s the obvious bad part of parameter sniffing. The kind of less bad part, or the kind of more interesting bad part, is what happens when we look for data that is a little bit harder to find, that rare data. So now I’m going to look for a top one for the vote type ID of 4. Remember, this is the one that only has about 773 rows in the table. But when I look for top one, it’s fast.

Because it finds a row really early in the index that it can give back. Again, this is going to be early in the index, it’s order descending. So it finds this row from 2013 and says, good, we’re done. If I say I want the top 5,000 where vote type ID equals 4, this is going to be a little bit less snappy. Now we’re going to spend about two seconds, or actually close to three seconds, scanning the entire index and looking for data.

You can see that we spent an elapsed 2,700 milliseconds, close enough to three seconds for me. And we did, let’s see, that’s 146,534 reads. So we had to scan that entire index looking for that kind of spread out, kind of rare vote type ID for.

We had to start at the beginning, which is going to be the year 2013. And we had to go all the way to the end, which is 2008. So that’s one example.

Now what’s kind of interesting that I found is if I recompile things, if I clear out the plan for that store procedure, and I look for the top 5,000 first. Now even though I know that 5,000 rows don’t exist, that’s not really an issue here because the top will exit when the nested loops join says, I don’t have any more rows to give you. The top says, okay, well I wanted 5,000, thanks a lot.

So if we run this first, the plan changes a little. And it changes in that it goes parallel now. SQL Server says, I think I might have to spend a little bit more time reading this. I’m actually going to go parallel.

I’m going to use multiple threads to find these rows because they’re spread out everywhere. So that’s kind of a cool thing that SQL Server knew to do. And then when I rerun, of course, looking for a top 1, it’s going to reuse that parallel plan. And things are going to be fairly snappy still, right? It’s still okay. We’re still running in under a sec.

We’re still pretty well under a second there. Now, something else interesting that I came across when I was messing with stuff is that if you use optimize for unknown and you use a variable for top, you get kind of a funky guess just like you do with most other unknown guesses. Now, what I’ve done for this run of the store procedure is I’ve quoted out the vote type ID part of the where clause because I don’t want you thinking that the row estimate is coming from the vote type ID that I’ve chosen to pass in.

That’s not where it’s coming from. If I recompile the query without that where clause in it, and just for good luck, I’m going to make sure it’s extra double recompiled. I’m going to run this now looking for the top 5,000 rows.

And when I go look at the execution plan, rather than guessing 5,000, my top has a guess of 100. So when you use optimize for unknown with a parameterized top, you always get this guess of 100. I tried lots of different numbers, and the only time that it was never not 100 is if I recompile it through a recompile hint on that.

So that was a little funny. Anyway, if you use top in a parameterized sort of way, there are some things you might want to explore if you’re running into problems. If you’re recompiling the plan, if unparameterized or even non-parameterized, dynamic SQL might help.

Now, I say that, and I know people are going to crawl out of their skin because unparameterized dynamic SQL can lead to SQL injection attacks. I know that it’s a lesson to a degree because you would be using an integer with top, and likely just using a function like rtrim or something to produce a stringified version of the number to concatenate into the query.

But the risk is still there, so be careful. You might want to say, okay, if I know that under certain conditions I want a different plan for things, do I need to separate code into different store procedures?

Or do I have an optimized for a value? Like if I, rather than doing optimized for unknown, can I optimize for a value like 5,000, 10,000, 2.1 billion, whatever Adam Mechanics phone number is, and get a better plan that way? Do I need to change indexes?

Like maybe the order of the key of the index has been appropriate for the query? Do I need to have some included columns to get rid of that key lookup? Do I need to make an adjustment there? So there’s a lot of stuff that you could look at as ways to solve it, but this is just to let you know that that issue exists, and maybe if you keep reading my blog posts and stuff like that, maybe we’ll talk about all the ways that you can fix it coming up.

Anyway, I hope you enjoyed yourselves. I hope you learned some stuff. I mean, who knows? Maybe you didn’t. If you’re smart, you didn’t learn anything. I hope you were at least entertained if you didn’t learn anything. Anyway, I am Erik Darling with Erik Darling Data, and thanks for watching.

Bye.