Getting Started with RStudio on Ubuntu Linux

RStudio is an integrated development environment (IDE) for the R programming language. It provides a suite of tools that make working with R much easier, including code completion, debugging, visualizations, and notebook publishing. This guide will walk through installing RStudio on Ubuntu 20.04 and getting started with some basic usage.

Installing R

R can be installed from Ubuntu‘s default repositories. This will provide the latest supported version:

sudo apt update
sudo apt install r-base

However for specific needs, you may want alternate versions they can compiled from source:

wget https://cran.r-project.org/src/base/R-4/R-4.2.2.tar.gz
tar -xf R-4.2.2.tar.gz
cd R-4.2.2
./configure
make
sudo make install

Compiling from source allows targeting particular CPU architectures optimizations like AVX2.

Once installed, check the version:

R --version

RStudio Alternatives

While RStudio is likely the most popular IDE for R, some alternatives worth considering:

Jupyter Notebook – Code and document R in notebooks along Python and other languages
Emacs/Emacs Speaks Statistics – Extensively customize workflows with this text editor
Visual Studio Code – Microsoft‘s free IDE with R plugin for remote workspaces and collaboration features
Atom – Hackable open-source IDE with packages for autocomplete, debugging, plotting

However RStudio still sets itself apart with the most seamless integration across reporting, plotting, packaging, collaboration and publishing outputs.

Installing RStudio

Download the latest .deb package (currently rstudio-2022.12.0-353-amd64.deb) and install using dpkg:

wget https://download1.rstudio.org/desktop/bionic/amd64/rstudio-2022.12.0-353-amd64.deb
sudo dpkg -i rstudio-*.deb

Alternatively, automated updates can be enabled by adding RStudio‘s repository:

wget -qO- https://rstudio.org/download/latest/ubuntu/rstudio-key.asc | sudo tee /usr/share/keyrings/rstudio-keyring.asc &>/dev/null

echo "deb [signed-by=/usr/share/keyrings/rstudio-keyring.asc] https://download2.rstudio.org/server/bionic main" | sudo tee /etc/apt/sources.list.d/rstudio.list

sudo apt-get update
sudo apt-get install rstudio-desktop

Once installed, launch RStudio Desktop from the applications menu or command line with rstudio.

RStudio IDE Tour

When you first launch RStudio, you will be presented with a multi-pane interface:

RStudio IDE

The default layout comprises:

Source Pane – R script editor with syntax highlighting, smart code completion,multiple-file editing
Console Pane – Read–eval–print loop (REPL) for running code line-by-line
Environment/History Pane – Explore variable contents, access history, manage objects
Files/Plots/Packages/Help Pane – GUI interfaces for key components like visualization, package management, documentation lookup

This layout can be customized extensively based on personal preference – panes can be added, removed, resized, and reordered via the View > Panes menu.

For example, you may opt to hide the console pane to maximize space
for the script editor. Or split documents horizontally to view multiple source files. Nearly any interface configuration is possible.

RStudio Projects

To keep work organized, RStudio introduces the concept of projects – self-contained workspaces storing related data, code, results and reports as a portable unit.

Creating a new project generates an associated directory for files, with sub-folders like /data or /figures created by default. Switching into a project resets paths and workspace/history accordingly.

This enables easily bundling up everything required to resume work later, share with others, or archive results submitted for publication.

Projects can be created from the File > New Project menu or by using the projectTemplate() function.

R Basics

Now that RStudio is setup, let‘s go through some R basics – from simple arithmetic to data structures and analysis.

Math Operators

Common mathematical operators like +, -, *, / behave as expected:

2 + 2

## [1] 4

Use parentheses to dictate order of operations:

(2 + 3) * 4

## [1] 20

Exponents, logs, trig and other math functions are included:

sin(pi/2)

## [1] 1

See ?Math for the full list.

Variable Assignment

Use the <- arrow for assignment:

x <- 2 + 2
print(x)

## [1] 4

Alt + - keys produce the assignment arrow in most keyboards.

It assigns the output of the expression on the right to a variable name on the left.

Data Types

R includes common data types like:

numeric – decimal numbers
- Doubles by default, can specify integer with L suffix
integer – round numbers
complex – complex numbers with real & imaginary parts
logical – boolean TRUE / FALSE values
character – string text

Check types with class() function:

x <- 5       # numeric 
y <- 5L      # integer
z <- 5+3i    # complex
a <- TRUE    # logical / boolean
b <- "text"  # character

print(class(x)) 
print(class(y))
print(class(z))  
print(class(a))
print(class(b))

## [1] "numeric"
## [1] "integer"
## [1] "complex"
## [1] "logical"
## [1] "character"

Data Structures

Beyond atomic types, R includes data structures for storing data collections:

Vectors – Ordered collections, 1d arrays
Lists – Ordered, heterogeneous collections
Matrices – 2d rectangular dataset
Arrays – Multidimensional generalizations of matrices
Data Frames – Tabular datasets comprised of equal-length vectors
Factors – Nominal/ordinal categorical variables

Some usage examples:

vec <- c(1, 3, 5)               # vector
lst <- list(a = 1, b = "text")  # list 

matrix(1:6, nrow = 2, ncol = 3) # matrix

array(1:24, dim = c(3,4,2))    # 3D array

data.frame(x = 1:3, y = 4:6)    # data frame

Data frames (used most commonly) will be covered more below.

See ?Compound for more data structure details.

Importing & Tidying Data

Loading datasets is essential to analysis in RStudio.

Data can originate locally from files, databases or spreadsheets, as well externally via web APIs or scraping.

Importing Local Data

Common options for getting data locally into R include:

CSV files – read.csv()
Text/log files – read.delim()/read.fwf()/read.table()
Excel spreadsheets – readxl::read_excel()
JSON – jsonlite::fromJSON()
RDBMS – RMySQL, RPostgres, RSQLite packages
SPSS/SAS/Stata – haven::read_sas(), haven::read_spss() etc

For example reading a CSV:

df <- read.csv("data.csv")

Or Excel sheet:

library(readxl)
df <- read_excel("data.xlsx", sheet = "Sheet1")

See the import options at Help -> Data Import -> Import Data.

Tidy Data

Best practice is for data to be tidy before analyzing – meaning:

Each variable has its own column
Each observation forms a row
Each value sits in its own cell

This facilitates data manipulation using the "tidyverse" set of packages like dplyr and tidyr designed specifically for tidy data.

Untidy formats like column headers with multiple variables should be broken out:

Year, Location, Sales Reps (John, Jane, Jake), Revenue 
2010, East, 10, 20, 30, 100000

Would be broken into:

Year, Location, JohnSales, JaneSales, JakeSales, Revenue
2010, East, 10, 20, 30, 100000

With a separate column per variable.

The tidyr pivot functions are handy for reshaping data as needed.

Data Analysis Examples

Now let‘s go through some examples working with datasets for visualization and modeling tasks.

The iris Dataset

A classic dataset for data analysis is Ronald Fisher‘s Iris flower dataset. This contains measurements of 150 flowers across 3 species:

# Load dataset
iris <- datasets::iris 

# View column names
names(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

It captures numeric measurements of sepal & petal dimensions, along with the categorical species classification.

First let‘s check out some summaries:

summary(iris)
str(iris)

This reveals variable ranges and the data types – a combination of numeric measurements and factors (categoricals) for the species.

Visualizing

Now let‘s visualize the iris data by plotting the petal dimensions grouped by species:

plot(iris$Petal.Width, 
     iris$Petal.Length,
     col = iris$Species)

Iris Scatterplot

And as a boxplot showing sepal width distributions per species:

boxplot(iris$Sepal.Width ~ iris$Species, 
        xlab = "Species", 
        ylab = "Sepal Width (cm)")

Iris Boxplot

This gives a sense for how clustering measurements and sepal width ranges differ across the 3 iris species types.

Many more advanced visualizations are possible with ggplot2 and other graphic packages.

Modeling

Given measurements, we can train machine learning models to automatically predict the iris species represented from input data as classification models.

First, we‘ll split the data into 80% train, 20% test:

library(caret)

set.seed(22519)
indexes = sample(1:nrow(iris), size = 0.8*nrow(iris))  

train = iris[indexes,]
test = iris[-indexes,]

Then we can train a model – here using random forest, but many options available:

library(randomForest)

model = randomForest(Species ~ ., data = train)

Now predict on the held-out test data:

predictions = predict(model, newdata = test)

And evaluate accuracy:

mean(predictions == test$Species)

## [1] 0.9722222

Reaching 97% accuracy – not bad for this quick modeling exercise!

We have barely scratched the surface of R‘s machine learning capabilities. Check out the caret package and its workflows for comparing dozens of algorithms with parameter tuning, cross-validation and other best practices all built-in.

Reproducible Reporting

Once you have done your analysis, effectively communicating results and findings is critical.

R Markdown documents allow inlining R code, results, text and visualizations to publish interactive reports, presentations, papers and more with a single R Notebook:

R Notebooks

Output options range from HTML/PDF documents to notebooks, dashboards, books and journal articles.

R Notebooks are reproducible – the code + narrative allows regenerating any data, analysis or reports fully automatically.

This facilitates sharing more transparent, self-contained analyses for diagnosis, collaboration or publication of research in academia and industry.

RStudio Server/Compute Options

While the RStudio IDE has been covered extensively here as a desktop application, RStudio Server editions are also available to allow access to remote development environments through a web browser.

RStudio server can be deployed on a centralized server, cloud instance, or service platform like AWS, GCP and reconnect to computing resources and storage hosted elsewhere.

Some motivations for remote development environments:

Streamline administration without needing to install/update software on individual machines
Leverage more powerful server hardware like GPUs or large memory capacities if computing constraints present
Bring analyses to the data for scenarios where transferring data to local desktops may be restricted due regulatory reasons
Promote collaboration across geographic distances more easily

If opting to self-host RStudio server, some recommended Ubuntu server optimizations:

Fast processors – Prioritize high CPU clock speeds and core counts
Max RAM – Memory capacity almost always the primary bottleneck
Fast storage – SSD storage helps with launching environments/reading data
Resource allocation – Control per-user RAM allocations depending on workload target
Scaling clusters – Horizontally scale RStudio jobs across auto-scaling server clusters to enable parallel computing of very large workloads

Best Practices

To summarize some best practices covered in this guide for efficient workflows:

Organize projects in self-contained folders storing related data/code/results
Strive for making data tidy before analyzing
Control randomness/sampling via a consistent seed for reproducibility
Consider notebooks to unify code/comments/results in one view
Take advantage of built-in RStudio tools for version control, plotting, packages etc
Profile + optimize performance for intensive computing procedures
Use remote development servers to scale complex workloads on faster infrastructure

Following these and other good habits will ensure analyses run smoothly.

Conclusion

This guide just scratched the surface of using RStudio for advanced analysis tasks on Ubuntu. Visit RStudio‘s learning resources for more or refer to documentation for specifics on any aspect like visualization or modeling.

With such a breadth tools and capabilities backed by an large open-source community, RStudio provides an excellent environment taking projects from data to insights using the R language.

So in summary:

🐧 RStudio desktop makes R much friendlier to work with
⚙️ Tweak and customize the IDE layout to your preference
📁 Use projects to organize all files for a given analysis
📊 Import, process and explore your datasets
🔬 Train models and create beautiful visualizations
🚀 Scale up development environments remotely with RStudio Server editions

RStudio + R let you turn data into knowledge, seamlessly taking ideas from conception all the way through publication and sharing – give it a try with your next analysis!

Getting Started with RStudio on Ubuntu Linux

Installing R

RStudio Alternatives

Installing RStudio

RStudio IDE Tour

RStudio Projects

R Basics

Math Operators

Variable Assignment

Data Types

Data Structures

Importing & Tidying Data

Importing Local Data

Tidy Data

Data Analysis Examples

The iris Dataset

Visualizing

Modeling

Reproducible Reporting

RStudio Server/Compute Options

Best Practices

Conclusion

Securing Your SSH Authentication: Setting Proper Config File Permissions

The Essential Guide to Commenting Arduino Code

How to Create a Map in TypeScript

Why is Docker Build Not Showing Any Output From Commands? An In-Depth Guide

Unlocking Effective Data Visualization with Matplotlib Colorbars

How to Import and Run SQL Script File in MySQL Workbench

Linuxhaxor.net – About Open Source & Linux

Installing R

RStudio Alternatives

Installing RStudio

RStudio IDE Tour

RStudio Projects

R Basics

Math Operators

Variable Assignment

Data Types

Data Structures

Importing & Tidying Data

Importing Local Data

Tidy Data

Data Analysis Examples

The iris Dataset

Visualizing

Modeling

Reproducible Reporting

RStudio Server/Compute Options

Best Practices

Conclusion

Related posts:

Similar Posts

Linuxhaxor.net – About Open Source & Linux