Function to Generate a Random Data Set

Often I find myself needing data sets to try functions and code out on or for teaching purposes.  I have a few stand-bys such as the mtcars and CO2 data sets in the base packages of R but sometimes I need a long format data set or a bunch of categorical or a bunch of numeric or repeated measures or I want it to have missing values to test the function and I spend valuable time searching for the correct data set.  About a year ago my answer was to have a file with several data sets I knew could fit various situations but eventually I grew tired of the pain of loading a data set each time I needed to test something and created a randomly generated data set function with categorical, numeric, interval, and repeated measures data.  I recently extended the data set to contain optional missing values, long or wide format, and proportion data and attempted to give it some speed boosts for creating larger data sets.  It generally suits my needs and I think can probably serve others too.

The main function, DFgen, relies on two helper functions, props and NAins.  I do not place these helper functions inside of DFgen itself as they have useful properties in and of themselves.  I’ll briefly explain each function, provide the code, and give a few tests to try it out.

The props Function

The props function generates a data frame of proportions whose rows sum to 1.  It takes two arguments and an optional var.names argument.  The first two arguments are the dimensions of the dataframe and are pretty self explanatory.  The final argument optionally names the columns otherwise they are named X1..Xn.  One note on this function is that for many columns it is a poorer choice.  For a slower props function but better for numerous columns Dason of talkstats.com provides an alternative (LINK).

#############################################################
# function to generate random proportions whose rowSums = 1 #
#############################################################
props <- function(ncol, nrow, var.names=NULL){
    if (ncol < 2) stop("ncol must be greater than 1")
    p <- function(n){
        y <- 0
        z <- sapply(seq_len(n-1), function(i) {
                x <- sample(seq(0, 1-y, by=.01), 1)
                y <<- y + x
                return(x)
            }
        )
        w <- c(z , 1-sum(z))
        return(w)
    }
    DF <- data.frame(t(replicate(nrow, p(n=ncol))))
    if (!is.null(var.names)) colnames(DF) <- var.names
    return(DF)
}
##############
# TRY IT OUT #
##############
props(ncol=5, nrow=5)                                      
props(ncol=3, nrow=25)                                     
props(ncol=3, nrow=5, var.names=c("red", "blue", "green"))

The NAins Function

The NAins function takes a data frame and randomly inserts a certain proportion of missing (NA) values.  The function has two arguments: df which is the dataframe and prop which is the proportion of NA values to be inserted into the data frame (default is .1),

Special thanks again to Dason of talk.stats.com for helping with a speed boost with this function.  This function consumes considerable time in DFgen and he provided the code to really gain some speed.

################################################################
# RANDOMLY INSERT A CERTAIN PROPORTION OF NAs INTO A DATAFRAME #
################################################################
NAins <-  NAinsert <- function(df, prop = .1){
    n <- nrow(df)
    m <- ncol(df)
    num.to.na <- ceiling(prop*n*m)
    id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
    rows <- id %/% m + 1
    cols <- id %% m + 1
    sapply(seq(num.to.na), function(x){
            df[rows[x], cols[x]] <<- NA
        }
    )
    return(df)
}
##############
# TRY IT OUT #
##############
NAins(mtcars, .1)

The DFgen Function

The DFgen function randomly generates an n-lenght data set with predefined variables.  The default DFgen() with no arguments specified will produce the following n=10 data set:

> set.seed(10)
> DFgen()
      id   group hs.grad  race gender age m.status   political n.kids income score time1 time2 time3
1   ID.1   treat     yes white   male  19    never  republican      1 111000 -1.24 51.39 52.15 53.76
2   ID.2 control     yes black   male  30 divorced independent      0 122000 -0.46 32.21 35.07 33.10
3   ID.3 control     yes white   male  32  married  republican      1   2000 -0.83 43.36 45.46 46.22
4   ID.4   treat      no white   male  30 divorced  republican      1  65000  0.34 71.63 72.06 74.49
5   ID.5 control     yes white female  18  married  republican      3  96000  1.07  9.26 12.24 11.02
6   ID.6   treat     yes asian female  30  married independent      3 135000  1.22 24.10 26.45 24.74
7   ID.7   treat     yes white female  26    never    democrat      5  16000  0.74 28.76 31.72 31.39
8   ID.8   treat     yes white   male  40  married  republican      1 113000 -0.48 28.24 29.10 37.12
9   ID.9   treat     yes white   male  23  married independent      2  80000  0.56 62.99 65.09 67.72
10 ID.10   treat      no asian   male  22  married    democrat      1  96000 -1.25 43.74 46.79 44.04

The function also takes optional:

  • type argument (default “wide” or “long”)
  • na.rate (a decimal value between 0 and 1; default is 0) that randomly inserts missing data (great for teaching demos and testing corner cases)
  • prop argument (takes TRUE or default FALSE )
  • digits that controls the number of degits (default is 2)
############################################################
# GENERATE A RANDOM DATA SET.  CAN BE SET TO LONG OR WIDE. #
# DATA SET HAS FACTORS AND NUMERIC VARIABLES AND CAN       #
# OPTIONALLY GIVE BUDGET EXPENDITURES AS A PROPORTION.     #
# CAN ALSO TELL A PROPORTION OF CELLS TO BE MISSING VALUES #
############################################################
# NOTE RELIES ON THE props FUNCTION AND THE NAins FUNCTION #
############################################################
DFgen <- DFmaker <- function(n=10, type=wide, digits=2, 
    proportion=FALSE, na.rate=0) {

    rownamer <- function(dataframe){
        x <- as.data.frame(dataframe)
        rownames(x) <- NULL
        return(x)
    }

    dfround <- function(dataframe, digits = 0){
      df <- dataframe
      df[,sapply(df, is.numeric)] <-round(df[,sapply(df, is.numeric)], digits) 
      return(df)
    }

    TYPE <- as.character(substitute(type))
    time1 <- sample(1:100, n, replace = TRUE) + abs(rnorm(n))
    DF <- data.frame(id = paste0("ID.", 1:n), 
        group= sample(c("control", "treat"), n, replace = TRUE),
        hs.grad = sample(c("yes", "no"), n, replace = TRUE), 
        race = sample(c("black", "white", "asian"), n, 
            replace = TRUE, prob=c(.25, .5, .25)), 
        gender = sample(c("male", "female"), n, replace = TRUE), 
        age = sample(18:40, n, replace = TRUE),
        m.status = sample(c("never", "married", "divorced", "widowed"), 
            n, replace = TRUE, prob=c(.25, .4, .3, .05)), 
        political = sample(c("democrat", "republican", 
            "independent", "other"), n, replace= TRUE, 
            prob=c(.35, .35, .20, .1)),
        n.kids = rpois(n, 1.5), 
        income = sample(c(seq(0, 30000, by=1000), 
            seq(0, 150000, by=1000)), n, replace=TRUE),
        score = rnorm(n), 
        time1, 
        time2 = c(time1 + 2 * abs(rnorm(n))), 
        time3 = c(time1 + (4 * abs(rnorm(n)))))
    if (proportion) {
        DF <- cbind (DF[, 1:10], 
            props(ncol=3, nrow=n, var.names=c("food", 
                "housing", "other")),
            DF[, 11:14])
    }
    if (na.rate!=0) {  
        DF <- cbind(DF[, 1, drop=FALSE], NAins(DF[, -1], 
            prop=na.rate))
    }
    DF <- switch(TYPE, 
        wide = DF, 
        long = {DF <- reshape(DF, direction = "long", idvar = "id",
                varying = c("time1","time2", "time3"),
                v.names = c("value"),
                timevar = "time", times = c("time1", "time2", "time3"))
            rownamer(DF)}, 
        stop("Invalid Data \"type\""))
    return(dfround(DF, digits=digits))
}
##############
# TRY IT OUT #
##############
DFgen()            
DFgen(type="long") 
DFmaker(20000)     
DFgen(prop=T)      
DFgen(na.rate=.3)
NOTE: This function relies on R.2.15.  If you don’t want to update R you must include a paste0 function found in the link below.



Click here for a .txt version of this demonstration

Posted in data, data generation | Tagged , , , , , , , | 12 Comments

The Need for paste2 (part III)

Final installment:  Part III of a multi part blog on the paste2 function…

In my first post on the paste2 function I promised proof of a few practical uses.  In Part II of this series we looked at using paste2 as a convenient way of making a look up table to look up ability scores from the ltm package.  In this post we’ll be looking at two recent uses of paste as a convenient helper function.

Use 1:

Recently on stackoverflow.com  a poster asked about determining what values in a data set  (data set A) are not also found in data set B.  Not much info was given and debate ensued about mixed mode data sets, speed and safety etc.

Here’s a data set similar to the one posters proposed as the challenge (Give it a whack and see if you can solve it) :

A <- data.frame(x = 1:8, y = as.character(1:8))
B <- data.frame(x = c(4:1, 6), y = as.character(c(4:1, 6)))

If you got a way please share, I’d be interested. Now be warned paste2 isn’t the fastest but it’s the safest method that was proposed.

#############################################################
# FIRST LOAD THE paste2 FUNCTION AND CONSTRUCT THE DATA SET #
#############################################################
load(url("http://dl.dropbox.com/u/61803503/paste2demo2.RData"))

A <- data.frame(x = 1:8, y = as.character(1:8))
B <- data.frame(x = c(4:1, 6), y = as.character(c(4:1, 6)))

#####################################################
# LET'S LOOK FIRST AT SOME EXAMPLES THAT DON'T WORK #
#####################################################
##############################
# only works on numeric data #
##############################
A[! data.frame(t(A)) %in% data.frame(t(B)), ] 

##########################################################
# relies on rownames being the same which the may not be #
# as in this case : (                                    #
##########################################################
A[- as.integer(rownames(B)),]

################################################
# safe paste2 way (handles character/numeric & #
# no relieance on row names                    #
################################################
A[!paste2(A)%in%paste2(B), ]

Use 2:

Recently I’ve been working with error bars on count data and inquired on talkstats.com  about used the paste2 function as a helper function to accomplish the task of finding standard errors for count data.  Here’s the function with paste2 in it.  It’s probably not necessary but makes life a lot easier.

SEcount <- function(dat, confidence = .95, se.dig=3, intv.digs=2){
    if (is.list(dat)) dat <- ftable(cbind(dat))
    dat2 <- as.data.frame(dat)
    len <- ncol(dat2)
    if (any(names(dat2) %in% "group")) {
        f <- names(dat2) %in% "group"
        name <- paste0("orig.", names(dat2) [f])
        names(dat2)[f] <- name
    }
    #HERE'S THE paste2 FUNCTION USE
    dat2$group <- factor(paste2(dat2[, 1:(len-1)], sep=":"))
    dat2 <- dat2[order(dat2$group), c(len + 1, 1:len)]
    dat2 <- dat2[dat2$Freq!=0 ,]
    se <- summary(glm(Freq ~  group -1, data=dat2, 
       family=poisson))[["coefficients"]][, 2]
    est <- summary(glm(Freq ~  group -1, data=dat2, 
       family=poisson))[["coefficients"]][, 1]
    dat2$SE <- round(summary(glm(Freq ~  group -1, data=dat2, 
       family=poisson))[["coefficients"]][, 2], digits=se.dig)
    n.SE <- qnorm(1-(1-confidence)/2)
    dat2$minus_1_SE <- round(dat2$Freq - se, digits=intv.digs)
    dat2$plus_1_SE <- round(dat2$Freq + se, digits=intv.digs)
    if (!is.null(confidence)){
        dat2$lower <- round(exp(est - n.SE*se), digits=intv.digs)
        dat2$upper <- round(exp(est + n.SE*se), digits=intv.digs)
    }
    rownames(dat2) <- 1:nrow(dat2)
    return(dat2)
}
###############
# TEST IT OUT #
###############
SEcount(mtcars[, c("cyl", "gear", "carb")])
SEcount(mtcars[, 8:11])
x <- ftable(mtcars[, 8:10])
SEcount(x) #takes ftables too

I hope you learned something about paste2 and may add it to your .Rprofile.  It’s not always the fastest choice but can be very convenient and easy to use.

Click here for a .txt version of this script used here

Posted in paste | Tagged , , , , , | Leave a comment

The Need for paste2 (part II)

This is Part II of a multi part blog on the paste2 function…

In my first post on the paste2 function I promised a proof of a few practical uses.  The first example I have comes from psychometrics and comes out of a need left behind by ltm’s factor.score function.  You can fit a survey data set with an IRT model and the ltm package will allow you to figure out ability estimates for people achieving an exact combination of binary scores.  So for instance on a 6 question test a person may have the following score:

item1 item2 item3 item4 item5 item6 
    0     0     1     1     1     0

…meaning they got questions 1 and 2 wrong, 3-5 correct and question 6 incorrect.  A total score of 3.  Now the IRT model says that that particular combination has a particular ability estimate and a person with 3 different correct responses may not have the same ability score. The problem is factor.score gives the estimates but doesn’t tell you the ability scores for specific observations in the data set.

paste2 to the rescue!

Solution: Create a function, based on paste2, that pastes together all the observations’ scores as well as the estimates from the factor.scores function used after fitting a 1PL or 2PL model.  Now we have two character strings that look something like this: "001110"   Then we can easily use some  slick indexing and we have a basic look up table of ability scores to assign to each observation.  Here’s the script that does that:

########################
# THE ABILITY FUNCTION #
########################
ability <- function(dataset, items.index, fact.score, digits = 3, full = TRUE){
    SD <- fact.score$score.dat
    nc <- ncol(SD)
    ncd <- ncol(dataset)
    IT <- SD[, -c((nc-4):nc)]
    SD$strata <- as.factor(paste2(IT)) #the 1st paste2
    dataset$strata <- as.factor(paste2(dataset[, items.index])) #the 2nd paste2
    key <- c(SD$z1);names(key) <- levels(SD$strata) 
    DF <- transform(dataset, ability=round(key[strata], digits = digits))
    DF$strata <- NULL 
    if (full){ 
        return(DF)
    } else {
        return(DF$ability)
    }
}
#===========================================================================
##############
# TRY IT OUT #
##############
#This loads the paste2 function and a binary tests data set called dat
load(url("http://dl.dropbox.com/u/61803503/paste2demo1.RData"))
library(ltm)

#####################################################
# FIT A 1PL AND 2PL MODEL AND GET ABILITY ESTIMATES #
#####################################################
mod.2PL <- ltm(dat[, -c(1, 22)]~z1)    #fit a 2PL model
(FS2PL <- ltm::factor.scores(mod.2PL)) #obtain the ability scores

mod.1PLcon <- rasch(dat[, -c(1, 22)],  #fit a 2PL model
    constraint = cbind(ncol(dat[, -c(1, 22)]) + 1, 1))
(FS1PLcon <- ltm::factor.scores(mod.1PLcon)) #obtain the ability scores
##############
# HERE IT IS #
##############
###############################################################################
# `dataset` -     This is the original data set you fit the model to.         #
#                 May have variables other than items.                        #
#                                                                             #
# `items.index` - This the item column numbers.  Can be supplied with         #
#                 the colon operator as in 2:21 or as a vector with concate   #
#                 as in c(1, 2, 3, 4, 5, 6, 7, 8, 9, ..., 21)                 #
#                                                                             #
# `fact.score`  - This is the object you assigned the factor.score outcome to #
###############################################################################
ability(dataset = dat, items.index = 2:21, fact.score = FS2PL)  #for the 2PL
(ab.1PL <- ability(dat, 2:21, FS1PLcon))                        #for the 1PL
head(ab.1PL)
######################################
# TO DO IT COMBINED; multiple models #
######################################
FSlist <- list(FS2PL, FS1PLcon)
LIST <- lapply(FSlist, function(x) ability(dat, 2:21, x, full=F))
new2 <- data.frame(dat, do.call('cbind', LIST))
names(new2)[names(new2)%in%c("X1", "X2")] <- c("ability_2PL", "ability_1PL")
head(new2)

In part III of this series we’ll explore a few more practical uses of the paste2 function!

Click here for a .txt version of this script used here

Posted in paste | Tagged , , , , , , | Leave a comment

The Need for paste2 (part I)

This is Part I of a multi part blog on the paste2 function…

I recently generated a new paste function that takes an unspecified list of equal length variables (a column) or multiple columns of a data frame  and pastes them together.  First let me thank Dason of Talk Stats for his help in this post that led to the creation of the paste2 function.  First let me convince you of the need for a paste2 function by showing you where the original paste falls short.  Then I’ll introduce to the function and some basics of what it can do.  In Part II of this paste2 blog series I’ll show you a few practical applications I’ve already encountered.

The main idea behind this function is the need to pass an unknown number of columns from a data frame or list and paste them together to generate an uber column that contains all the information of the original columns.  You may say well I think paste already does that.  Not in it’s home grown state it doesn’t.  What’s that prove it?  OK.  Try the following:

paste(CO2[, 1:3], sep=".")                            #1
paste(CO2[, 1:3], collapse=".")                       #2
paste(CO2[,1], CO2[, 2], CO2[, 3], sep=".")           #3
paste(list(CO2[,1], CO2[, 2], CO2[, 3]), sep=".")     #4

What do you get?  Well the third use of paste is the only one that results in pasting the columns together.  Why?  Because we specified the columns being passed to paste and paste is our friend.  If we try a sneak attack with an index of columns paste becomes scared and returns gobly gook.  If you try to be nice and give paste a list of columns he again gives gobly gook.  So what’s the need for pasting an unknown number of columns (either as an indexed data frame or as a list) together?  Often in functions the number of columns passed to paste can’t be specified in advance, hence out problem (I’ll show you more of those specific applications in Part II).

paste2 <- function(multi.columns, sep=".", handle.na=TRUE, trim=TRUE){
    if (trim) multi.columns <- lapply(multi.columns, function(x) {
            gsub("^\\s+|\\s+$", "", x)
        }
    )
    if (!is.data.frame(multi.columns) & is.list(multi.columns)) {
        multi.columns <- do.call('cbind', multi.columns)
      }
    m <- if(handle.na){
                 apply(multi.columns, 1, function(x){
                     if (any(is.na(x))){
                         NA
                     } else {
                         paste(x, collapse = sep)
                     }
                 }
             )   
         } else {
          apply(multi.columns, 1, paste, collapse = sep)
    }
    names(m) <- NULL
    return(m)
}

Now let’s see it in action:

paste2(CO2[, 1:3], sep=".")
paste2(CO2[, 1:3], sep=":")
paste2(list(CO2[,1], CO2[, 2], CO2[, 3]))
#shoot we can paste the whole data set if we want
paste2(CO2)
paste2(mtcars)

In Part II we’ll explore some practical uses of this new function!

Click HERE for a link to a .txt version of paste2

Posted in paste | Tagged , , , | 1 Comment

microbenchmarking with R

I love to benchmark.  Maybe I’m a bit weird but I love to bench  everything in R.  Recently I’ve had people raise accuracy challenges to the typical system.time and rbenchmark package approaches to benchmarking.  I saw Hadley Wickham promoting the package microbenchmarking and decided to give it a whirl.  This approach claims to improve accuracy and adjusts to your OS.  A nice box plot or a ggplot of the functions output can also aid in understanding and comparing functions.  Here’s a demo test:

library(microbenchmark); library(plyr) 
op <- microbenchmark(
    PLYR=ddply(mtcars, .(cyl, gear), summarise, 
        output = mean(hp)),
    AGGR=aggregate(hp ~ cyl + gear, mtcars, mean),
    TAPPLY = tapply(mtcars$hp, interaction(mtcars$cyl, 
        mtcars$gear), mean),
times=1000L)

print(op) #standard data frame of the output
boxplot(op) #boxplot of output
library(ggplot2) #nice log plot of the output
qplot(y=time, data=op, colour=expr) + scale_y_log10()

The output to the console window using print(op) yields like this:

Unit: milliseconds
    expr      min       lq   median       uq       max
1   AGGR 2.856758 2.972932 3.121999  3.48615 121.49828
2   PLYR 7.880229 8.497956 8.983880 10.71436 139.04940
3 TAPPLY 1.108085 1.159873 1.196731  1.30824  67.33326

The ggplot log plot from the output:

ggplot2 Plot of the Output

The boxplot from output:

Box Plot of the Output

Posted in benchmark | Tagged , , , | 7 Comments