advanced-data-wrangling-in-R-legacy/code/02_subsetting.Rmd at main · dlab-berkeley/advanced-data-wrangling-in-R-legacy

This repository was archived by the owner on Jan 17, 2022. It is now read-only.
308 lines (196 loc) · 4.53 KB
title: 'Subsetting'
author: "D-Lab"
date: "`r format(Sys.time(), '%d %B, %Y')`"
  html_document:
    df_print: paged
    number_sections: yes
    toc: yes
    toc_float: yes
  pdf_document:
    toc: yes
```{r include = FALSE}
# p_load loads and, if necessary, install missing packages.
# install.packages() + library() = p_load()
# If you just want to install, then use p_install()
if (!require("pacman")) install.packages("pacman")
pacman::p_load(
  tidyverse, # for the tidyverse framework
  palmerpenguins,
  gapminder, 
  kableExtra, 
  flextable,
  nycflights13 
# Subset Observations (Rows)
penguins <- palmerpenguins::penguins
## Choose row by logical condition 
- Single condition
penguins %>%
  filter(species == "Adelie") %>%
  arrange(desc(bill_length_mm))
The following filtering example was inspired by [the suzanbert's dplyr blog post](https://suzan.rbind.io/2018/02/dplyr-tutorial-3/).
- Multiple conditions (numeric)
# First example
penguins %>%
  filter(bill_length_mm <= 42, bill_length_mm > 35) %>%
# Same as above
penguins %>%
  filter(bill_length_mm <= 42 & bill_length_mm > 35) %>%
# Not same as above
penguins %>%
  filter(bill_length_mm < 42 | bill_length_mm > 35) %>%
**Challenge 1**
1. (1) Use `filter(between())` to find characters whose bill lengths are between 35.1 and 42.2 and (2) count the number of these observations.
#User code goes here
- Multiple conditions (character)
# Filter names include ars; `grepl` is a base R function  
penguins %>%
  filter(grepl("ent", tolower(species)))
# Or, if you prefer dplyr way 
penguins %>%
  filter(str_detect(tolower(species), "ent"))
# Filter to Biscoe and Dream Islands
penguins %>%
  filter(island %in% c("Biscoe", "Dream"))
**Challenge 2**
Use `str_detect()` to find all the penguins whose species includes "Chin"
##User Code goes here 
penguins %>%
## Choose row by position (row index)
penguins %>%
  arrange(desc(bill_length_mm)) %>%
  slice(1:6)
## Sample by fraction
# For reproducibility 
set.seed(1234)
penguins %>%
  sample_frac(0.10, 
              replace = FALSE) # Without replacement
set.seed(1234)
penguins %>%
  slice_sample(prop = 0.10, 
             replace = FALSE)
## Sample by number 
set.seed(1234)
penguins %>%
  sample_n(20, 
           replace = FALSE) # Without replacement 
set.seed(1234)
penguins %>%
  slice_sample(n = 20, 
             replace = FALSE) # Without replacement 
## Top 10 rows orderd by height
penguins %>%
  slice_max(bill_length_mm, n = 10) # Variable first, Argument second 
# Subset Variables (Columns)
names(penguins)
### Select only numeric columns 
# Only numeric 
penguins %>%
  select(where(is.numeric)) 
**Challenge 3** 
Use `select(where())` to find only non-numeric columns 
#User code goes here
## Select the columns that include "bill" in their names 
penguins %>%
  select(contains("bill"))
## Select the columns that include either "bill" or "mm" in their names 
- Basic R way 
`grepl` is one of the R base pattern matching functions. 
penguins[grepl('bill|mm', names(penguins))]
**Challenge 4**
Use `select(matches())` to find columns whose names include either "bill" or "mm". 
#User code goes here
## Select the columns that starts with "b"
penguins %>%
  select(starts_with("b"))
## Select the columns that ends with "mm"
penguins %>%
  select(ends_with("mm"))
## Select the columns using both beginning and end string patterns 
The key idea is you can use Boolean operators (`!`, `&`, `|`)to combine different string pattern matching statements. 
penguins %>%
  select(starts_with("b") & ends_with("mm"))
## Select bill_length_mm and move it before everything 
# By specifying a column 
penguins %>%
  select(bill_length_mm, everything())
## Select variables from a character vector.
penguins %>%
  select(any_of(c("species", "bill_length"))) %>%
  colnames()
## Select the variables named in the character + number pattern
## This uses common dplyr functions to make a new variable. 
penguins_example <- penguins %>% 
  mutate(obs1 = 1,
         obs2 = 2,
         obs3 = 3,
         obs4 = 4,
         obs5 = 5)
penguins_example %>%
  select(num_range("obs", c(1:4)))
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

02_subsetting.Rmd

Latest commit

History

02_subsetting.Rmd

File metadata and controls