corrgram Function in R

The corrgram function in R from the corrgram package is a powerful tool for creating correlation matrix visualizations. It combines numerical correlation values with graphical representations to help identify patterns, relationships, and outliers in multivariate data.

corrgram Package Installation

The corrgram first need to be installed. The following commands can be used for the installation and loading of corrgram package.

# Install and load the package
install.packages("corrgram")
library(corrgram)

# Load additional packages for examples
library(datasets)
library(corrplot)  # For comparison

One can also use corrplot package for visualization.

Syntax of the corrgram Function

The general syntax of the corrgram function in the R Language is:

corrgram(x, order = FALSE, panel = panel.shade, lower.panel = panel, 
         upper.panel = panel, diag.panel = NULL, text.panel = textPanel,
         label.pos = c(0.5, 0.5), label.srt = 0, cex.labels = NULL,
         font.labels = 1, row1attop = TRUE, dir = "", gap = 0, abs = FALSE, ...)

corrgram Examples

The following example makes use of the mtcars data set to draw a correlation matrix visualization. A numerical correlation matrix is also produced by using cor_matrix() function.

# Load mtcars dataset
data(mtcars)

# Basic corrgram
corrgram(mtcars, 
         main = "Correlation Matrix of mtcars Dataset",
         cex.main = 1.2)

# Calculate numerical correlations
cor_matrix <- cor(mtcars)
round(cor_matrix, 3)

## OUTPUT
        mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear   carb
mpg   1.000 -0.852 -0.848 -0.776  0.681 -0.868  0.419  0.664  0.600  0.480 -0.551
cyl  -0.852  1.000  0.902  0.832 -0.700  0.782 -0.591 -0.811 -0.523 -0.493  0.527
disp -0.848  0.902  1.000  0.791 -0.710  0.888 -0.434 -0.710 -0.591 -0.556  0.395
hp   -0.776  0.832  0.791  1.000 -0.449  0.659 -0.708 -0.723 -0.243 -0.126  0.750
drat  0.681 -0.700 -0.710 -0.449  1.000 -0.712  0.091  0.440  0.713  0.700 -0.091
wt   -0.868  0.782  0.888  0.659 -0.712  1.000 -0.175 -0.555 -0.692 -0.583  0.428
qsec  0.419 -0.591 -0.434 -0.708  0.091 -0.175  1.000  0.745 -0.230 -0.213 -0.656
vs    0.664 -0.811 -0.710 -0.723  0.440 -0.555  0.745  1.000  0.168  0.206 -0.570
am    0.600 -0.523 -0.591 -0.243  0.713 -0.692 -0.230  0.168  1.000  0.794  0.058
gear  0.480 -0.493 -0.556 -0.126  0.700 -0.583 -0.213  0.206  0.794  1.000  0.274
carb -0.551  0.527  0.395  0.750 -0.091  0.428 -0.656 -0.570  0.058  0.274  1.000

Note that the diagonal shows the variable names. The upper triangle shows the colored squares with correlation coefficients. The lower triangle shows the colored ellipses/pies showing the strength and direction of the correlation.

corrgram function in R

The visualization and numerical results show the following

  • Dark blue/positive ellipses indicate strong positive correlations
  • Red/negative ellipses indicate strong negative correlations
  • Lighter colors indicate weaker correlations
  • For example, mpg and wt show a strong negative correlation (-0.87)

Customizing Panel and Ordering

One can easily customize the panel and ordering. For example

# Custom panel functions
corrgram(mtcars, 
         order = TRUE,  # PCA ordering
         lower.panel = panel.pie,  # Pies in lower triangle
         upper.panel = panel.conf, # Confidence intervals in upper
         diag.panel = panel.density, # Density plots on diagonal
         main = "Customized Corrgram")

# Alternative with different panels
corrgram(mtcars,
         lower.panel = panel.shade,
         upper.panel = panel.pts,  # Scatter plots
         diag.panel = panel.minmax, # Min-max values
         cex.labels = 1.2)
  • PCA ordering groups of highly correlated variables together
  • Pies show the proportion of correlation (filled portion = |r|)
  • Shading intensity indicates correlation strength
  • Scatter plots in the upper triangle show actual data relationships

Best Practices and Interpretation Guidelines

The following are best practices and interpretation guidelines when using corrgram function in R:

  1. Color Interpretation:
    • Blue = Positive correlation
    • Red = Negative correlation
    • Saturation intensity = Strength of correlation
    • White = No correlation
  2. Pattern Recognition:
    • Blocks of similar colors indicate variable clusters
    • Check for multicollinearity (high correlations among predictors)
    • Look for unexpected correlations that might indicate data issues
  3. Statistical Considerations:
  4. When to Use Different Panels:
    • panel.shade: Quick overview of correlation structure
    • panel.pie: Emphasize correlation magnitude
    • panel.ellipse: Show confidence and data spread
    • panel.pts: Identify outliers and nonlinear patterns

Summary of using corrgram Function in R

The corrgram function in R is an excellent tool for exploratory data analysis, providing both visual and numerical insights into correlation structures. The key takeaways:

  1. Start with basic plots and add customization as needed
  2. Always complement visual analysis with numerical correlation values
  3. Consider statistical significance when interpreting patterns
  4. Use appropriate correlation methods for your data type
  5. Combine corrgram() with other EDA tools for comprehensive analysis

The corrgram function in R is particularly valuable in the early stages of data analysis, helping to identify relationships, potential problems, and directions for further investigation.

Computing Z Scores in R

Learn how to calculate z scores in R with this step-by-step tutorial. Use R’s powerful functions to standardize your data and analyze its distribution.

Given a distribution with mean $\overline{x}$ and standard deviation $s$, a location-scale transformation known as a Z-score will shift the distribution to have mean 0 and scale the spread to have standard deviation 1:

$$Z= \frac{x – \overline{x}}{s}$$

Computing Z Scores in R

Consider the variable $x$ has a normal distribution with mean 100 and standard deviation 15, that is $x\sim N(100, 15^2)$ and $Z$ has a standard normal distribution, that is $Z\sim N(0, 1)$. One can easily transform the $x$ variable to a Z-score transformation in R and can also visualize it.

Z-Score Transformation in R

# Sample from Normal Distribution
# with mean = 100 and SD = 15

df <- data.frame(x = rnorm(100, mean = 100, sd = 15))
# Z-score Tranformation
df$z <- scale(df$x)

## Descriptive Statistics
summary(df)
Summary or Descriptive Statistics in R

Transforming a Variable to Z-Score in R

One can visualize the original variable $x$ and the Z-score variable using a histogram and a density estimation.

##  ggplot for original variable
library(ggplot2)
p1 <- ggplot(df, aes(x = x))

# Histogram with density instead of count on y-axis
p1 <- p1 + geom_histogram(aes(y = ..density..))
p1 <- p1 + geom_density(alpha = .2, fill="yellow")
p1 <- p1 + geom_rug()
p1 <- p1 + labs(title = "X ~ N(100, 15)")
p1
Histogram with density in R for original variable
## ggplot for z variable
p2 <- ggplot(df, aes(x = z))

# Histogram with density instead of count on y-axis
p2 <- p2 + geom_histogram(aes (y=..density..))
p2 <- p2 + geom_density(alpha = 0.2, fill = "yellow")
p2 <- p2 + geom_rug()
p2 <- p2 + labs(title = "Z ~ N(0, 1)")
p2
histogram with density for Z variable in R

One can combine these two graphs using grid.arrange() function from gridExtra R package.

library(gridExtra)
grid.arrange(grobs = list(p1, p2), ncol = 2)
Z scores in R, Z-score transformation in R

Calculating Z Scores in R Manually

For manual calculation and full control or educational purposes, one can calculate the z scores in R by using basic arithmetic functions: mean() and sd(). The formula is:$$z=\frac{x-\mu }{\sigma }$$ where: $x$ is a data point $\mu$ is the mean of the data $\sigma$ is the standard deviation of the data 

Example: Z-score for a data frame column R# Create a sample data frame

df <- data.frame(
  pressure = c(98, 102, 100, 99, 101),
  temperature = c(20, 22, 23, 21, 25)
)

# Calculate z-scores for the 'pressure' column manually
mean_pressure <- mean(df$pressure)
sd_pressure <- sd(df$pressure)
df$pressure_z <- (df$pressure - mean_pressure) / sd_pressure

# Print the data frame with the new z-score column
print(df)

## Output
  pressure temperature pressure_z
1       98          20 -1.2649111
2      102          22  1.2649111
3      100          23  0.0000000
4       99          21 -0.6324555
5      101          25  0.6324555

Learn more about Z-Scores

R Language: A Quick Reference Guide – IV

R Quick Reference Guide

Quick Reference Quide R Language

R language: A Quick Reference Guide about learning R Programming with a short description of the widely used commands. It will help the learner and intermediate user of the R Programming Language to get help with different functions quickly. This Quick Reference is classified into different groups. Let us start with R Language: A Quick Reference – IV.

This Quick Reference will help in performing different descriptive statistics on vectors, matrices, lists, data frames, arrays, and factors.

Basic Descriptive Statistics in R Language

The following is the list of widely used functions that are further helpful in computing descriptive statistics. The functions below are not direct descriptive statistics functions, however, these functions are helpful to compute other descriptive statistics.

R CommandShort Description
sum(x1, x2, … , xn)Computes the sum/total of $n$ numeric values given as argument
prod(x1, x2, … , xn)Computes the product of all $n$ numeric values given as argument
min(x1, x2, … , xn)Gives smallest of all $n$ values given as argument
max(x1, x2, …, xn)Gives largest of all $n$ values given as argument
range(x1, x2, … , xn)Gives both the smallest and largest of all $n$ values given as argument
pmin(x1, x2, …)Returns minima of the input values
pmax(x1, x2, …)Returns maxima of the input values

Statistical Descriptive Statistics in R Language

The following functions are used to compute measures of central tendency, measures of dispersion, and measures of positions.

R CommandShort Description
mean(x)Computes the arithmetic mean of all elements in $x$
sd(x)Computes the standard deviation of all elements in $x$
var(x)Computes the variance of all elements in $x$
median(x)Computes the median of all elements in $x$
quantile(x)Computes the median, quartiles, and extremes in $x$
quantile(x, p)Computes the quantiles specified by $p$

Cumulative Summaries in R Language

The following functions are also helpful in computing the other descriptive calculations.

R CommandShort Description
cumsum(x)Computes the cumulative sum of $x$
cumprod(x)Computes the cumulative product of $x$
cummin(x)Computes the cumulative minimum of $x$
cummax(x)Computes the cumulative maximum of $x$

Sorting and Ordering Elements in R Language

The sorting and ordering functions are useful in especially non-parametric methods.

R CommandShort Description
sort(x)Sort the all elements of $x$ in ascending order
sort(x, decreasing = TRUE)Sor the all elements of $x$ in descending order
rev(x)Reverse the elements in $x$
order(x)Get the ordering permutation of $x$

Sequence and Repetition of Elements in R Language

These functions are used to generate a sequence of numbers or repeat the set of numbers $n$ times.

R CommandShort Description
a:bGenerates a sequence of numbers from $a$ to $b$ in steps of size 1
seq(n)Generates a sequence of numbers from 1 to $n$
seq(a, b)Generates a sequence of numbers from $a$ to $b$ in steps of size 1, it is the same as a:b
seq(a, b, by=s)Generates a sequence of numbers from $a$ to $b$ in steps of size $s$.
seq(a, b, length=n)Generates a sequence of numbers having length $n$ from $a$ to $b$
rep(x, n)Repeats the elements $n$ times
rep(x, each=n)Repeats the elements of $x$, each element is repeated $n$ times
R Quick Reference Guide Frequently Asked Questions About R

R Language: A Quick Reference – I

https://gmstat.com