Recently I was faced with the need to generate lots of data for a presentation. There is a temptation in situations like this to grab some data from a production system and obfuscate it somehow. My advice to you is DON’T DO IT! I know you think your obfuscation algorithm will render the data completely safe but security is a tricky thing. Why expose your data to people if you don’t have to? (There may also be legal requirements or corporate requirements that you not do this.)
Anyway, it turns out generating random data is not that hard. There are some gotchas. Brian Knight points out that simply relying on the RAND() function will get you in trouble because it only provides on number per result set. He does supply a work around to this using the NEWID() function. He also points to another solution using a view and a UDF that he attributes it Itzik Ben-Gan. We’ll be using both of these techniques in this post.
Let’s say for your demo you want a table of project start dates for a contracting company. You want a GUID for the project ID, DATETIMEs for the start time and end time of the project and an INT which is the ID for your hypothetical document management system.
Here is the schema
if object_id(N'tblProjectDates') is not null
drop table tblProjectDates
create table tblProjectDates (
project_id uniqueidentifier not null
,start_dt datetime
,end_dt datetime
,document_id int
)
We’ll also need a “numbers” table. (For more on these sorts of tables see this article or this article.
Here is a slightly modified version of the code from the simple-talk.com article:
if object_id(N'tblNumbers') is not null
drop table tblNumbers
BEGIN
CREATE TABLE [dbo].tblNumbers(
[number] [int],
CONSTRAINT [Index_Numbers] PRIMARY KEY CLUSTERED
(
[number] ASC
) ON [PRIMARY]
) ON [PRIMARY]
DECLARE @ii INT
SELECT @ii=1
WHILE (@ii<=10000)
BEGIN
INSERT INTO tblNumbers(NUMBER) SELECT @II
SELECT @II=@II+1
END
END
So, the first thing we need is the GUIDs. This is the easy part.
This query will give us 10000 GUIDs
select n.number, NEWID() as project_id
from tblNumbers n
Now we need to generate dates in the past for our start date. We’ll need the code from Brian Knight’s article:
if object_id(N'VRand') is not null
drop view VRand
go
CREATE VIEW VRand
AS
SELECT RAND() AS rnd
GO
if object_id(N'dbo.fn_row_rand') is not null
drop function dbo.fn_row_rand
go
CREATE FUNCTION dbo.fn_row_rand() RETURNS FLOAT
AS
BEGIN
RETURN (SELECT rnd FROM VRand)
END
go
I’m going to create a variable called @date_seed and use that to make the dates look nicer instead of calling GETDATE() in the body of the CTE
So, here is our insert statement:
declare @date_seed datetime
select @date_seed = convert(datetime,convert(nvarchar,getdate(),101))
;with start_dt as(
select top 10000 dateadd(d,0-floor(dbo.fn_row_rand() * 3651),@date_seed) as start_dt
,ROW_NUMBER() OVER(order by (select null)) as rn
from tblNumbers n
order by NEWID()
)
insert into tblProjectDates
(project_id, start_dt)
select NEWID() as project_id, s_d.start_dt
from tblNumbers n
inner join start_dt s_d
on n.number = s_d.rn
Let’s inspect the data:
select * from tblProjectDates
We get project IDs and start dates. (All the start dates are for the last 10 years. That’s where the 3651 number above comes into play.)
Now, we need end dates and those have to be in the future. So, we’ll just use DATEADD and the same fn_row_rand function to get those. Except we’ll be DATEADDing to the start_dt instead of GETDATE(). And we won’t use the “0 – ” since we want the end dates to be after the start dates. Also, we’ll convert all future dates to be NULLs and assume that means the project is on going if the end date is unknown. (You could make up some kind of place holder if you don’t want NULLs.) Lastly, we’ll assume that no project lasts more than 150 days so we’ll multiple the random number by 150.
Here’s the update for end_dt
;with end_dt as(
select pd.project_id
, dateadd(d,floor(dbo.fn_row_rand() * 150),pd.start_dt) as end_dt
from tblProjectDates pd
)
update tblProjectDates
set end_dt = case when e_d.end_dt <= getdate() then e_d.end_dt else null end
from tblProjectDates pd
inner join end_dt e_d
on pd.project_id = e_d.project_Id
And here is how our data looks:
select * from tblProjectDates
At this point I’m sure you can see how generating a random INT is trivial. So let’s wrap this up with the final update:
update tblProjectDates
set document_id = dbo.fn_row_rand() * 10000
And here is our data:
select * from tblProjectDates
I’m hoping to talk about generating random strings for names, etc. Obviously you could just cast a NEWID as an NVARCHAR to get a string but they don’t look much like names. I don’t know if there is a quick way to do it but I have some ideas. If anything looks worth sharing I’ll try and get it on here.
Any thoughts on this? Please leave a comment or email me!