This is the R script repository of the "Tools for Analytics Lab - R-track" course, part of the MSc in Business Analytics at CEU.
- Jan 28: Introduction to R and General Programming
- Jan 29: Data Visualization and Data Preparations
- Exercises on data preparations and visualization
- Feb 04: Modeling
- Exercises for the exam
- Feb 05: Exam, Introduction to R Markdown, Using git and GitHub, Introduction to Shiny
- Home Assignment
- General overview of the R ecosystem: slides
- Introduction to R: variables, functions and vectors
- Univariate plots in
baseR - Data Visualization with
ggplot2 - Filtering and summarizing data with
data.table - Wide and long tables
-
Visualize the below variables from the
mtcarsdataset withggplot2:- number of carburetors
- horsepower
- barplot on the number of carburetors per transmission
- boxplot on the horsepower by the number of carburetors
- horsepower and weight by the number of carburetors
- horsepower and weight by the number of carburetors with a trend line
-
data.tableexercises using thehflightsdataset:- compute the number of cancelled flights
- compute the shortest flight on each weekday
- compute the average delay to all destination
- compute the average delay to all destination per origin
- plot the average departure and arrival delays per destination
- plot the percentage of cancelled flights per destination
-
Further exercises on the
nycflights13dataset:- count the number of flights to LAX
- count the number of flights to LAX from JFK
- compute the average delay (in minutes) for flights from JFK to LAX
- which destination has the lowest average delay from JFK?
- plot the average delay to all destinations from JFK
- plot the distribution of all flight delays to all destinations from JFK
- compute a new variable in flights showing the week of day
- plot the number of flights per weekday
- create a heatmap on the number of flights per weekday and hour of the day
- plot the average temperature at noon in
EWRfor each month based on theweatherdataset
- General overview of the R ecosystem: slides
- Recap in linear models
- Hierarchical and k-means clustering
- Classification
- Intro to PCA with image processing
- Dimension reduction with PCA
- Multidimensional scaling
- High level overview on decision trees, bagging, random forest and boosting
- Intro to
h2o
- Plot the weight and horsepower of cars from the
mtcarsdataset (bundled with R, see?mtcars) - Add a linear trend line to the above plot
- Add a 3rd degree polynomial model to the above plot
- Fit a linear model on
hpto predict weight - Estimate the weight based on the above model for a car with 98 horsepower
- Estimate the weight based on the above model for
Lotus Europa - What's the average fuel consumption?
- Build a linear model to describe fuel consumption based on the horsepower and weight
- Compute a new variable in the dataset for the ratio of
wtandhp - Plot the distribution of this new variable on a boxplot
- Create an aggregated dataset on
mtcarsincluding the averagehpandwtgrouped by the number of gears - Compute a new variable for fuel consumption using the "liters per 100 kilometers" unit based on
mpg - Which car has the best fuel consumption?
- Compute
wt2to store the weight in kilograms based onwt - Apply k-means clustering on the dataset to split the observations into 3 groups
- Perform hierarchical clustering on the dataset and plot the dendogram
- Compare the cluster memberships returned by the hierarchical and k-means methods
- Build a decision tree to tell if a car has automatic or manual transmission (hint: you might want to convert the number to factor first)
- Visualize the above decision tree
- Create a confusion matrix for the above model
- Use the k-NN algorithm to fit a similar model and decide on the best number of neighbors to use
- Did you use a training and validation dataset?
- Visualize the (dis)similarity of cars using PCA or MDS
- Load the weight.csv and build a model to classifying observation if BMI is above the normal threshold (25)
- Transform the
mtcarsdataset to a newdata.tableobject calleddt - Count the number of cars with less than
4gears - Count the number of cars with more than
4gears and less than 90 horsepower - What's the average weight of cars with
4gears? - What's the weight of the car with the best fuel consumption?
- Plot the distribution of the number of cylinders
- Plot the distribution of the number of cylinders grouped by carburetors
- Plot the average weight of cars grouped by the number of cylinders
- Plot the distribution of the performance of the cars (horsepower) per number of cylinders
- Install and load the
ISLRpackage and use itsAutodataset for the below exercises - Plot the weight and horsepower of cars
- Add a linear trend line to the above plot
- Fit a linear model using the weight of cars to predict acceleration
- What's the estimated acceleration of a car with
weight = 3? - Filter for cars from America (1) and Europe (2) and store the results in a new object called
auto(mind the lower case letters) - Remove the
namecolumn - Apply k-means or hierarchical clustering on this dataset to split the observations into 3 groups, and show the number of observations in the clusters
- Bonus points: Build and visualize a decision tree to tell if a car was made in America or Europe, show the confusion matrix, do the same with k-NN
Create a new R Markdown document -- it might ask you to install a bunch of packages. If you get an error due to not being able to install jsonlite or rmarkdown, run the following commands before trying this again:
install.packages('jsonlite', dependencies = TRUE)
install.packages('rmarkdown', dependencies = TRUE)Example document: intro-to-markdown.Rmd
Further examples and resources:
-
Register an account at https://github.com
-
Install git from https://git-scm.com/
-
Install R from https://www.r-project.org/
-
Install RStudio from https://www.rstudio.com/products/RStudio/#Desktop
-
Verify that in RStudio, you can see the path of the
gitexecutable binary in the Tools/Global Options menu's "Git/Svn" tab -- if not, then you might have to restart RStudio (if you installed git after starting RStudio) or installed git by not adding that to the PATH on Windows. Either way, browse the "git executable" manually (in somebinfolder look for theegitexecutable file). -
Create an RSA key (optionally with a passphrase for increased security -- that you have to enter every time you push and pull to and from GitHub). Copy the public key and add that to you SSH keys on your GitHub profile.
-
Create a new project choosing "version control", then "git" and paste the SSH version of the repo URL copied from GitHub in the pop-up -- now RStudio should be able to download the repo. If it asks you to accept GitHub's fingerprint, say "Yes".
-
If RStudio/git is complaining that you have to set your identity, click on the "Git" tab in the top-right panel, then click on the Gear icon and then "Shell" -- here you can set your username and e-mail address in the command line, so that RStudio/git integration can work. Use the following commands:
$ git config --global user.name "Your Name" $ git config --global user.email "Your e-mail address"Close this window, commit, push changes, all set.
Find more resources in Jenny Bryan's "Happy Git and GitHub for the useR" tutorial
Example document:
Further examples and resources:
Use the nycflights13 dataset and create either an R Markdown document or a Shiny application to demonstrate your R skills. You can use any of the related datasets:
> data(package = 'nycflights13')
Data sets in package ‘nycflights13’:
airlines Airline names.
airports Airport metadata
flights Flights data
planes Plane metadata.
weather Hourly weather dataIf you decide to write an R Markdown document, then include at least
- some exploratory data analysis on the available variables,
- feature engineering and enriching the
flightsdataset (eg weekday or grouped hour of the day), - a model predicting if a flight will be late by more than 15 minutes at the destination.
If working on a Shiny application, the create a tool for exploratory data analysis on the flights dataset including
- inputs to filter data on date and distance,
- at least one static plot,
- and a HTML table.
Please upload your project to Moodle no later than Feb 28 2017. Your submission should include:
- the source of your R Markdown document along with a PDF export or
- the
ui.Randserver.R(and any other files required to run the application) in a zip archive