GitHub - daroczig/CEU-R-intro at 2016

This is the R script repository of the "Tools for Analytics Lab - R-track" course, part of the MSc in Business Analytics at CEU.

Jan 25: Introduction to R and General Programming
Jan 26: First Steps with Data Visualization
Jan 27: Data Preparation
Jan 28: Introduction to modeling
Jan 30: Modeling & introduction to ML methods on qualitative data
Sample exam questions
Feb 01: Random forest and GBM with H2O
Feb 02: Exam
Feb 03: Dynamic Reports and Reproducible Research
Feb 04: Interactive Data Analysis and Dashboards

List of optional take-home exercises

ggplot
data.table: 1, 2, 3

Jan 25 (90 min): Introduction to R and General Programming

General overview of the R ecosystem: slides
Introduction to R: variables, functions and vectors

Jan 26 (90 min): First Steps with Data Visualization

Introducing data.frame
Exploratory data analysis with histogram, boxplot, bar chart and scatterplot
Plots outside of Excel: dotchart and vioplot examples
The Grammar of Graphics in R with ggplot2

Jan 27 (140 min): Data Preparation

Register a GitHub account

ggplot2 exercises
- number of carburetors
- horsepower
- barplot of number of carburetors per transmission
- boxplot of horsepower by the number of carburetors
- horsepower and weight by the number of carburetors
- horsepower and weight by the number of carburetors with a trend line
Filtering and summarizing data with base R
Intro to data.table

data.table exercises with hflights
- the number of cancelled flights
- the shortest flight on each weekday
- the average delay to all destination
- the average delay to all destination per destination
- plot the departure and arrival delays
- plot the average departure and arrival delays per destination
- plot the average departure and arrival delays per flight + size
- estimate the delay to Budapest
Some quick examples on string and date manipulations
Left joins

Jan 28 (140 min): Models

Revisiting GitHub integration in RStudio:
1. Install git from https://git-scm.com/
2. Install R from https://www.r-project.org/
3. Install RStudio from https://www.rstudio.com/products/RStudio/#Desktop
4. Verify that in RStudio, you can see the path of the git executable binary in the Tools/Global Options menu's "Git/Svn" tab -- if not, then you might have to restart RStudio (if you installed git after starting RStudio) or installed git by not adding that to the PATH on Windows. Either way, browse the "git executable" manually (in some bin folder look for thee git executable file).
5. Create an RSA key (optionally with a passphrase for increased security -- that you have to enter every time you push and pull to and from GitHub). Copy the public key and add that to you SSH keys on your GitHub profile.
6. Create a new project choosing "version control", then "git" and paste the SSH version of the repo URL copied from GitHub in the pop-up -- now RStudio should be able to download the repo. If it asks you to accept GitHub's fingerprint, say "Yes".
7. If RStudio/git is complaining that you have to set your identity, click on the "Git" tab in the top-right panel, then click on the Gear icon and then "Shell" -- here you can set your username and e-mail address in the command line, so that RStudio/git integration can work. Use the following commands:
```
$ git config --global user.name "Your Name"
$ git config --global user.email "Your e-mail address"
```
  Close this window, commit, push changes, all set.

data.table exercises with the following dataset:

 set.seed(42)
 tx <- data.table(
     item   = sample(letters[1:3], 10, replace = TRUE),
     time   = as.POSIXct(as.Date('2016-01-01')) - runif(10) * 36*60^2,
     amount = rpois(10, 25))
 prices <- data.table(
     item  = letters[1:3],
     date  = as.Date('2016-01-01') - 1:2,
     price = as.vector(outer(c(100, 200, 300), c(1, 1.2))))
 items <- data.table(
     item   = letters[1:3],
    color  = c('red', 'white', 'red'),
   weight = c(2, 4, 2.5))

filter for transactions with "b" items
filter for transactions with less than 25 items
filter for transactions with less then 25 "b" items
count the number of transactions for each items
count the number of transactions for each day
count the overall number of items sold on each day

Further data.table examples on
- left joins
- transforming wide and long tables with reshape2
- rolling and overlap joins
ANOVA
Crosstable
Simpson's paradox

Datasets for the model examples:

Further data.table exercises on the nycflights13 dataset to practice for the exam:

count the number of flights to LAX
count the number of flights to LAX from JFK
compute the average delay (in minutes) for flights from JFK to LAX
which destination has the lowest average delay from JFK?
plot the average delay to all destinations from JFK
plot the distribution of all flight delays to all destinations from JFK
compute a new variable in flights showing the week of day
plot the number of flights per weekday
create a heatmap on the number of flights per weekday and hour of the day (see geom_tile)
merge the airports dataset to flights on the FAA airport code
order the weather dataset by year, month, day and hour
plot the average temperature at noon in EWR for each month based on the weather dataset
aggregate the weather dataset and store as daily_temperatures to show the daily average temperatures based on the EWR records
merge the daily_temperatures dataset to flights on the date
do the above two steps on daily + hourly temperature averages

Jan 28 (180 + 100 min): Modeling & intro to ML methods on qualitative data

Basic models:

Linear regression
Diagnose plots
Extrapolation
Polynomial regression
Confounders
Correlation & causality
Importance of feature selection and engineering

Clustering:

distance matrix
hierarchical clustering
identifying the optimal number of clusters
k-means clustering

Classification:

confusion matrix
k-Nearest Neighbors algorithm
decision trees with rpart
overfitting
other decision tree algorithms in R and the caret package

Dimension reduction methods:

Principal Component Analysis
Multidimensional Scaling

Datasets and references for the model examples:

Sample exam questions

Load the content of the https://bit.ly/mtcars-csvO CSV file and save as df (check the variable names in the manual of mtcars)
Transform df to a data.table object
Count the number of cars with 4 gears
Count the number of cars with 4 gears and less than 100 horsepower
What's the overall weight of cars with 4 cylinders?
Which car is the heaviest?
Plot the distribution of weights
Plot the distribution of gears
Plot the distribution of weights per gears
Plot the average weight per gears
Which car has the best fuel consumption?
Plot the weight and horsepower of cars
Add a linear trend line to the above plot
Add a 3rd degree polynomial model to the above plot
Fit a linear model on hp to predict weight
Estimate the weight based on the above model for Lotus Europa
Compute a new variable in the dataset for the ratio of wt and hp
Plot the distribution of this new variable on a boxplot
Create an aggregated dataset on mtcars including the average hp and wt grouped by the number of gears
Merge the average hp and wt per gears from the above dataset to the original df object based on the number of gears
Compute a new variable for fuel consumption using the "liters per 100 kilometers" unit based on mpg
Which car has the best fuel consumption?
Compute wt2 to store the weight in kilograms based on wt
Apply k-means clustering on the dataset to split the observations into 3 groups
Perform hierarchical clustering on the dataset and plot the dendogram
Build a decision tree to tell if a car has automatic or manual transmission
Visualize the above decision tree
Create a confusion matrix for the above model
Use the k-NN algorithm to fit a similar model and decide on the best number of neighbors to use

Feb 1 (90 min): Random forest and GBM with H2O

Slides:

Install the h2o package then start H2O in the R console:

library(h2o)
h2o.init()

If you get an error on the Blue Lab computers, then open a Windows Command Prompt in Start Menu/All Programs/Accessories and run the following command:

java -jar "c:\Program Files\R\R-3.2.3\library\h2o\java\h2o.jar"

And connect to H2O from R via rerunning the h2o.init() function from above.

Feb 2 (90 min): Exam

Transform the mtcars dataset to data.table and store as a new object
Count the number of cars with less than 4 gears
Count the number of cars with more than 4 gears and less than 100 horsepower
What's the average weight of cars with 4 cylinders?
Which car has the best fuel consumption?
Plot the distribution of the number of carburetors
Plot the distribution of the number of carburetors grouped by gears
Plot the average weight grouped by the number of carburetors
Plot the weight and horsepower of cars
Add a linear trend line to the above plot
Add a 3rd degree polynomial model to the above plot
Fit a linear model on the weight of cars to predict fuel consumption
What's the estimated fuel consumption of a car with wt = 5?
Install the ISLR package and use its Auto for the below exercises
Build and visualize a decision tree to tell if a car was made in America, Europe or Japan
Apply k-means or hierarchical clustering on the dataset to split the observations into 3 groups

Bonus exercise: train a reasonable k-NN or other ML model classifying cars as American VS other origin (target for AUC > 0.95)

Results:

grades <- c(61, 62, 65, 73, 76, 78, 89, 90, 91, 93, 94, 94, 95, 95, 96, 99, 100, 100, 100)
library(ggplot2)
ggplot() + geom_histogram(aes(grades), fill = 'orange', binwidth  = 5) + xlim(0, 100)

Feb 3 (140 min): Dynamic Reports and Reproducible Research

For a formal introduction, see my tutorial slides presented at useR! 2015 or for a quick intro: 7.Rmd

Feb 4 (140 min): Interactive Data Analysis and Dashboards

Quick demo: Network analysis of the Hungarian interbank lending market

General Shiny demos and references

Example Shiny app we implemented in class: 8

Home assignment and contact info: slides

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
8		8
syllabus		syllabus
.gitignore		.gitignore
1.R		1.R
2.R		2.R
3.R		3.R
4.R		4.R
5.R		5.R
6.R		6.R
7.Rmd		7.Rmd
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Table of Contents

List of optional take-home exercises

Jan 25 (90 min): Introduction to R and General Programming

Jan 26 (90 min): First Steps with Data Visualization

Jan 27 (140 min): Data Preparation

Jan 28 (140 min): Models

Jan 28 (180 + 100 min): Modeling & intro to ML methods on qualitative data

Sample exam questions

Feb 1 (90 min): Random forest and GBM with H2O

Feb 2 (90 min): Exam

Feb 3 (140 min): Dynamic Reports and Reproducible Research

Feb 4 (140 min): Interactive Data Analysis and Dashboards

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

daroczig/CEU-R-intro

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

List of optional take-home exercises

Jan 25 (90 min): Introduction to R and General Programming

Jan 26 (90 min): First Steps with Data Visualization

Jan 27 (140 min): Data Preparation

Jan 28 (140 min): Models

Jan 28 (180 + 100 min): Modeling & intro to ML methods on qualitative data

Sample exam questions

Feb 1 (90 min): Random forest and GBM with H2O

Feb 2 (90 min): Exam

Feb 3 (140 min): Dynamic Reports and Reproducible Research

Feb 4 (140 min): Interactive Data Analysis and Dashboards

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages