This is the R script repository of the "Tools for Analytics Lab - R-track" course, part of the MSc in Business Analytics at CEU.
- Jan 25: Introduction to R and General Programming
- Jan 26: First Steps with Data Visualization
- Jan 27: Data Preparation
- Jan 28: Introduction to modeling
- Jan 30: Modeling & introduction to ML methods on qualitative data
- Sample exam questions
- Feb 01: Random forest and GBM with H2O
- Feb 02: Exam
- Feb 03: Dynamic Reports and Reproducible Research
- Feb 04: Interactive Data Analysis and Dashboards
- General overview of the R ecosystem: slides
- Introduction to R: variables, functions and vectors
Jan 26 (90 min): First Steps with Data Visualization
- Introducing
data.frame - Exploratory data analysis with histogram, boxplot, bar chart and scatterplot
- Plots outside of Excel:
dotchartandvioplotexamples - The Grammar of Graphics in R with
ggplot2
Jan 27 (140 min): Data Preparation
-
ggplot2exercises- number of carburetors
- horsepower
- barplot of number of carburetors per transmission
- boxplot of horsepower by the number of carburetors
- horsepower and weight by the number of carburetors
- horsepower and weight by the number of carburetors with a trend line
-
Filtering and summarizing data with base
R -
Intro to
data.table
-
data.tableexercises withhflights- the number of cancelled flights
- the shortest flight on each weekday
- the average delay to all destination
- the average delay to all destination per destination
- plot the departure and arrival delays
- plot the average departure and arrival delays per destination
- plot the average departure and arrival delays per flight + size
- estimate the delay to Budapest
-
Some quick examples on string and date manipulations
-
Left joins
Jan 28 (140 min): Models
- Revisiting GitHub integration in RStudio:
-
Install git from https://git-scm.com/
-
Install R from https://www.r-project.org/
-
Install RStudio from https://www.rstudio.com/products/RStudio/#Desktop
-
Verify that in RStudio, you can see the path of the
gitexecutable binary in the Tools/Global Options menu's "Git/Svn" tab -- if not, then you might have to restart RStudio (if you installed git after starting RStudio) or installed git by not adding that to the PATH on Windows. Either way, browse the "git executable" manually (in somebinfolder look for theegitexecutable file). -
Create an RSA key (optionally with a passphrase for increased security -- that you have to enter every time you push and pull to and from GitHub). Copy the public key and add that to you SSH keys on your GitHub profile.
-
Create a new project choosing "version control", then "git" and paste the SSH version of the repo URL copied from GitHub in the pop-up -- now RStudio should be able to download the repo. If it asks you to accept GitHub's fingerprint, say "Yes".
-
If RStudio/git is complaining that you have to set your identity, click on the "Git" tab in the top-right panel, then click on the Gear icon and then "Shell" -- here you can set your username and e-mail address in the command line, so that RStudio/git integration can work. Use the following commands:
$ git config --global user.name "Your Name" $ git config --global user.email "Your e-mail address"Close this window, commit, push changes, all set.
-
-
data.tableexercises with the following dataset:set.seed(42) tx <- data.table( item = sample(letters[1:3], 10, replace = TRUE), time = as.POSIXct(as.Date('2016-01-01')) - runif(10) * 36*60^2, amount = rpois(10, 25)) prices <- data.table( item = letters[1:3], date = as.Date('2016-01-01') - 1:2, price = as.vector(outer(c(100, 200, 300), c(1, 1.2)))) items <- data.table( item = letters[1:3], color = c('red', 'white', 'red'), weight = c(2, 4, 2.5))
- filter for transactions with "b" items
- filter for transactions with less than 25 items
- filter for transactions with less then 25 "b" items
- count the number of transactions for each items
- count the number of transactions for each day
- count the overall number of items sold on each day
-
Further
data.tableexamples on- left joins
- transforming wide and long tables with
reshape2 - rolling and overlap joins
-
ANOVA
-
Crosstable
-
Simpson's paradox
Datasets for the model examples:
Further data.table exercises on the nycflights13 dataset to practice for the exam:
- count the number of flights to LAX
- count the number of flights to LAX from JFK
- compute the average delay (in minutes) for flights from JFK to LAX
- which destination has the lowest average delay from JFK?
- plot the average delay to all destinations from JFK
- plot the distribution of all flight delays to all destinations from JFK
- compute a new variable in flights showing the week of day
- plot the number of flights per weekday
- create a heatmap on the number of flights per weekday and hour of the day (see
geom_tile) - merge the
airportsdataset toflightson the FAA airport code - order the
weatherdataset byyear,month,dayandhour - plot the average temperature at noon in
EWRfor each month based on theweatherdataset - aggregate the
weatherdataset and store asdaily_temperaturesto show the daily average temperatures based on theEWRrecords - merge the
daily_temperaturesdataset toflightson the date - do the above two steps on daily + hourly temperature averages
Jan 28 (180 + 100 min): Modeling & intro to ML methods on qualitative data
Basic models:
- Linear regression
- Diagnose plots
- Extrapolation
- Polynomial regression
- Confounders
- Correlation & causality
- Importance of feature selection and engineering
Clustering:
- distance matrix
- hierarchical clustering
- identifying the optimal number of clusters
- k-means clustering
Classification:
- confusion matrix
- k-Nearest Neighbors algorithm
- decision trees with
rpart - overfitting
- other decision tree algorithms in R and the
caretpackage
Dimension reduction methods:
- Principal Component Analysis
- Multidimensional Scaling
Datasets and references for the model examples:
- Load the content of the https://bit.ly/mtcars-csvO CSV file and save as
df(check the variable names in the manual ofmtcars) - Transform
dfto adata.tableobject - Count the number of cars with
4gears - Count the number of cars with
4gears and less than 100 horsepower - What's the overall weight of cars with
4cylinders? - Which car is the heaviest?
- Plot the distribution of weights
- Plot the distribution of gears
- Plot the distribution of weights per gears
- Plot the average weight per gears
- Which car has the best fuel consumption?
- Plot the weight and horsepower of cars
- Add a linear trend line to the above plot
- Add a 3rd degree polynomial model to the above plot
- Fit a linear model on
hpto predict weight - Estimate the weight based on the above model for
Lotus Europa - Compute a new variable in the dataset for the ratio of
wtandhp - Plot the distribution of this new variable on a boxplot
- Create an aggregated dataset on
mtcarsincluding the averagehpandwtgrouped by the number of gears - Merge the average
hpandwtper gears from the above dataset to the originaldfobject based on the number of gears - Compute a new variable for fuel consumption using the "liters per 100 kilometers" unit based on
mpg - Which car has the best fuel consumption?
- Compute
wt2to store the weight in kilograms based onwt - Apply k-means clustering on the dataset to split the observations into 3 groups
- Perform hierarchical clustering on the dataset and plot the dendogram
- Build a decision tree to tell if a car has automatic or manual transmission
- Visualize the above decision tree
- Create a confusion matrix for the above model
- Use the k-NN algorithm to fit a similar model and decide on the best number of neighbors to use
Slides:
Install the h2o package then start H2O in the R console:
library(h2o)
h2o.init()If you get an error on the Blue Lab computers, then open a Windows Command Prompt in Start Menu/All Programs/Accessories and run the following command:
java -jar "c:\Program Files\R\R-3.2.3\library\h2o\java\h2o.jar"
And connect to H2O from R via rerunning the h2o.init() function from above.
- Transform the
mtcarsdataset todata.tableand store as a new object - Count the number of cars with less than
4gears - Count the number of cars with more than
4gears and less than 100 horsepower - What's the average weight of cars with
4cylinders? - Which car has the best fuel consumption?
- Plot the distribution of the number of carburetors
- Plot the distribution of the number of carburetors grouped by gears
- Plot the average weight grouped by the number of carburetors
- Plot the weight and horsepower of cars
- Add a linear trend line to the above plot
- Add a 3rd degree polynomial model to the above plot
- Fit a linear model on the weight of cars to predict fuel consumption
- What's the estimated fuel consumption of a car with
wt = 5? - Install the
ISLRpackage and use itsAutofor the below exercises - Build and visualize a decision tree to tell if a car was made in America, Europe or Japan
- Apply k-means or hierarchical clustering on the dataset to split the observations into 3 groups
Bonus exercise: train a reasonable k-NN or other ML model classifying cars as American VS other origin (target for AUC > 0.95)
Results:
grades <- c(61, 62, 65, 73, 76, 78, 89, 90, 91, 93, 94, 94, 95, 95, 96, 99, 100, 100, 100)
library(ggplot2)
ggplot() + geom_histogram(aes(grades), fill = 'orange', binwidth = 5) + xlim(0, 100)For a formal introduction, see my tutorial slides presented at useR! 2015 or for a quick intro: 7.Rmd
Quick demo: Network analysis of the Hungarian interbank lending market
General Shiny demos and references
Example Shiny app we implemented in class: 8
Home assignment and contact info: slides