Remarks

  • It seems like alot of you are proficient in programming but not necessarily familiar with R and are worried it may hinder your performance in this class. I tried to look for some R resources for you all but most of them might be too basic for those who know programming or included too much information about R that you do not necessarily need (although may be good to learn). Therefore, I created this no-nonsense guide that only includes concepts I think are essential/helpful for the course. I will try to draw parallels between Python/C++ and R whenever possible for those of you who are proficient in these languages.

  • I will constantly update this guide as the course progresses and students ask me more questions about R. Most of the R functions you will use are closely tied to the statistical concepts that you learn about in lecture. Therefore, I will add them to this guide after you have learned about the concepts in lecture to not confuse/stress you all out.

  • I have omitted programming concepts that are not specific to R like for loops, if statements, etc and can be easily googled.

  • If you learned C++ by taking EECS 280/281, don’t even stress out about the programming in this course. Come to office hours and teach me about C++ please.

  • You may use Python for the homeworks but R may be worth learning because we will be using stan later in the course and in my opinion, it is easier to use in R. (There is a Python version of it though.)

tl;dr Don’t worry about the programming in this course. This guide includes most of the R concepts you need to know.

Data Types/Structures in R

  • R has 6 basic data types (character, numeric, integer, logical, complex). We will primarily be concerned with the numeric, logical, and integer data type.

Data Structures in R

  1. atomic vector
  2. list
  3. matrix
  4. data frame
  5. factors

We will discuss these one-by-one below.

Atomic Vector

  • These are kind of like lists in Python and arrays/vectors in C++.

  • Atomic means they can only hold data of a single data type.

To create an atomic vector, we use the function c(). The “c” here stands for combine or concatenate.

a <- c(1,2,3,4)

Indexing/Subsetting

  • R uses one-based indexing so be careful of this if you are coming from a 0-based indexing language like Python or C++.

To index an element in an (atomic) vector, we use []. For example, to index the first element of a:

a[1]
## [1] 1

To index a vector with a vector of indices:

a[1:3] # Here 1:3 is the same as c(1, 2, 3). This will be discussed later. 
## [1] 1 2 3

a[-i] will return a with the element in the \(i^{th}\) index deleted. In general, a[-index_vec] will return a with the elements in the indices in index_vec deleted.

a[-2]
## [1] 1 3 4
a[-(1:3)]
## [1] 4

Alternatively, you can index vectors using logical statements. There are two ways to do this.

a[a < 3]
## [1] 1 2
which(a < 3)
## [1] 1 2

Useful vectors

There are several functions in R that output useful vectors. Some of the ones that will be useful for this course are: rep, seq.

rep replicates a given object a specified amount of times.

rep(x = 2, times = 10) # or just rep(2, 10) 
##  [1] 2 2 2 2 2 2 2 2 2 2
# Can also replicate vectors 
rep(a, 2) # create vector with a replicated twice.
## [1] 1 2 3 4 1 2 3 4

seq creates (incremented) sequences. It is like range in Python. However, the seq function is inclusive. (i.e. range(1,10) only returns a list with numbers 1 - 9, whereas seq(1,10) will return a vector with numbers 1- 10.)

seq(from = 1, to = 10, by = 2)
## [1] 1 3 5 7 9

To make sequence of consecutive numbers, you can use a:b (This is usually used in for loops).

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
for (i in 1:4) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4

Vector operations

  • You may add, subtract, multiply, divide vectors element-wise by using +, -, *, /,
# create a vector and add it to a 
b <- c(2, 3, 4, 5)
a + b
## [1] 3 5 7 9
  • BE CAREFUL: In R, adding vectors of different lengths is valid although it gives a warning. The smaller vector is duplicated to match the length of the larger vector and then added to it. This is also true for other vector operations.
c <- c(2, 3, 4, 5, 6, 7, 8)
a_plus_c <- a + c
## Warning in a + c: longer object length is not a multiple of shorter object
## length
a_plus_c
## [1]  3  5  7  9  7  9 11

In this example, a + c = (1, 2, 3, 4, 1, 2, 3) + (2, ,3, 4, 5 ,6, 7, 8).

Appending/Removing

  • The syntax for appending and removing elements isn’t as nice as in Python.
### To append
# There is an append() function, but this method of appending suffices.
a <- c(a, 5)
a
## [1] 1 2 3 4 5
# Alternatively,
a[6] <- 6
a
## [1] 1 2 3 4 5 6
a <- c(a, 7:9) # can also append vectors to vectors
a
## [1] 1 2 3 4 5 6 7 8 9
# Remove second element in R 
a <- a[-2]
a
## [1] 1 3 4 5 6 7 8 9
# Remove first 3 elements in R
a <- a[-(1:3)]
a
## [1] 5 6 7 8 9

Misc.

  • Nesting vectors inside vectors will just create one large vector. In other words, (atomic) vectors are flat.
a <- c(1, 2, c(3, c(4)))
a
## [1] 1 2 3 4
length(a)
## [1] 4
  • length gives you the length of the vector.
a <- c(1,2,3,4)
length(a)
## [1] 4

Lists

  • Lists can hold elements of different types. Lists are like dictionaries in Python.
  • Lists are R’s most flexible object, but it also uses the most memory.

To create a list:

# Create empty list 
x <- list()
x
## list()

You can also create lists with specified values.

# Create list with one numerical vector and one character vector
x <- list(c(1,2,3), c("one", "two", "three"))
x
## [[1]]
## [1] 1 2 3
## 
## [[2]]
## [1] "one"   "two"   "three"

Indexing

  • Indexing a list is a little strange in R. There are 3 ways to index a list:
  1. x[[i]]: returns the \(i^{th}\) element of list x
  2. x[i]: returns a list with \(i^{th}\) element as the only element
    • Can index using a vector: x[ind.vec] will return a list with elements indexed by indices in ind.vec
  3. x$ind: returns element named by ind. Python users can think of the names as keys.
x[[1]]
## [1] 1 2 3
class(x[[1]])
## [1] "numeric"
x[1]
## [[1]]
## [1] 1 2 3
class(x[1])
## [1] "list"
x[c(1,2)]
## [[1]]
## [1] 1 2 3
## 
## [[2]]
## [1] "one"   "two"   "three"
# name elements in list
names(x) <- c("one", "two")
x
## $one
## [1] 1 2 3
## 
## $two
## [1] "one"   "two"   "three"
x$one
## [1] 1 2 3

Matrices

  • R’s two dimensional array that contains elements of the same atomic type. (We usually use matrices containing numeric elements.)

  • Syntax for creating a matrix: matrix(data, nrow, ncol, byrow , dimnames)
    • data: input vector
    • nrow/ncol: number of rows/columns
    • byrow: Fill matrix with input vector by row (default is fill by column)
    • dimnames: row and column names (We usually leave this blank.)
# Create 2 x 3 matrix with elements 1 - 6
mat <- matrix(c(1,2,3,4,5,6), nrow = 2, ncol = 3)
mat
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
# Fill by row
mat <- matrix(1:6, nrow = 2, ncol = 3, byrow = TRUE)
mat
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

Indexing

To index a list:

  • mat[i,]: \(i^{th}\) row of mat
  • mat[,j]: \(j^{th}\) column of mat
  • mat[i,j]: \((i,j)^{th}\) element of mat
mat[1,] # first row of mat
## [1] 1 2 3
mat[,1] # first column of mat
## [1] 1 4
mat[1,3] # (1,3) element of mat
## [1] 3
  • Can also index matrix with vectors
mat[,1:3] # first 3 columns of mat
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

Matrix Operations

  • A %*% B: Matrix multiply A and B. (if dim(A) = [m,n] and dim(B) = [n,l])
  • A * B: Multiply A and B element-wise
  • A + B, A - B, A / B: add, subtract, divide A and B element-wise
  • You can only multiply/add/subtract/divide matrices of the same dimensions element-wise (unlike vectors).
A <- matrix(1:4, nrow = 2, ncol = 2)
B <- matrix(1:6, nrow = 2, ncol = 3)
A %*% B
##      [,1] [,2] [,3]
## [1,]    7   15   23
## [2,]   10   22   34
# A * B # This will give an error

B <- matrix(2:5, nrow = 2, ncol = 2)
A + B
##      [,1] [,2]
## [1,]    3    7
## [2,]    5    9
A / B
##           [,1] [,2]
## [1,] 0.5000000 0.75
## [2,] 0.6666667 0.80

Useful functions

  • colSums, rowSums, colMeans, rowMeans: returns vectors containing sum/mean of each column/row.
rowSums(A)
## [1] 4 6
  • cbind, rbind: bind matrices by column/row.
A <- matrix(1:4, nrow = 2, ncol = 2)
B <- matrix(5:8, nrow = 2, ncol = 2)
A
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
B
##      [,1] [,2]
## [1,]    5    7
## [2,]    6    8
cbind(A,B)
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8
rbind(A,B)
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## [3,]    5    7
## [4,]    6    8

Data Frames

  • A more structured matrix.

To create a data frame: data.frame(Var1 = var1.values, Var2 = var2.values, ...)

# Create example data frame
df <- data.frame(Names = c("Yang", "Daniel"), Title = c("Instructor", "GSI"), Coolness = c(0, 100), Evilness = c(100, 0))
# Jk hehe don't kill me. 

df
##    Names      Title Coolness Evilness
## 1   Yang Instructor        0      100
## 2 Daniel        GSI      100        0
  • summary(df): Statistical summary of columns of df
summary(df)
##     Names          Title      Coolness      Evilness  
##  Daniel:1   GSI       :1   Min.   :  0   Min.   :  0  
##  Yang  :1   Instructor:1   1st Qu.: 25   1st Qu.: 25  
##                            Median : 50   Median : 50  
##                            Mean   : 50   Mean   : 50  
##                            3rd Qu.: 75   3rd Qu.: 75  
##                            Max.   :100   Max.   :100

Reading data

  • In STATS 451, you will most likely be reading in data from text, csv, or RData files.

Two functions for reading in data:

  • read.csv(file_path, header = FALSE): read csv at file_path
  • read.table(file_path, header = FALSE): read text file at file_path

    • header: If TRUE, file contains the names of the variables as its first line.

To be completed.

Saving data

To be updated.

Indexing

To be updated.

Factors

To be updated

Packages in R

Installing packages

  • Syntax for installing packages: install.packages(pkg.name)
install.packages("ggplot2") # install ggplot2 

Using packages

  • Loading packages: library(pkg)
  • Package help page: help(pkg)
library(ggplot2) # load ggplot2

Packages used in STATS 451

To be updated.

Monte-Carlo Sampling

Simulating Random Variables

Helpful R functions:

  • Simulate Normal: rnorm(n, mean = 0, sd = 1)
  • Simulate Binomial: rbinom(n, size, prob)
  • Simulate Uniform: runif(n, min = 0, max = 1)
  • Simulate Exponential: rexp(n, rate = 1)

Approximating Expectations

Let \(X\) be a random variable with density \(f(x)\). Suppose you want to approximate \(E[g(X)]\) for some function \(g\) (e.g. mean, variance, etc). One way to do this is with the following procedure:

  1. Simulate \(X_1, \dots, X_N \sim f(x)\) for large \(N\).
  2. Approximate expectation with \[E[g(X)] = \int g(x) f(x) dx \approx \frac{1}{N} \sum_{i=1}^N g(X_i)\]

Examples

Example 1

Let \(X \sim N(2, 5)\) and suppose I want to approximate \(E[X^2]\). We know that \(E[X^2] = Var(X) + [E(X)]^2 = 5^2 + 2^2 = 29\).

N <- 1000000
X.samples <- rnorm(N, 2, 5) # Draw samples from N(2,5)
E_X2.approx <- mean(X.samples^2) # Take mean of square of samples 
E_X2.approx
## [1] 29.03194

Approximating Distributions

You can also use Monte-Carlo sampling to approximate distributions. Some canonical examples of distributions you may want to approximate are sampling distributions of test statistics (e.g. sample mean).

If you are reading this, you have probably heard of the Central Limit Theorem (CLT). The Central Limit Theorem tells us the sampling distribution of the sample mean is Normal. We can verify this with Monte-Carlo sampling.

I will post the code for this after Quiz 3. I want you all to try to code it by yourself first.

Plotting

Helpful R functions

  • Plot histogram: hist(x)
  • Plot scatterplot: plot(x,y)
    • The plot function is actually very general and what it plots depends on the the class of the input. However, in most cases, you will just use it to plot scatter plots.

Logicals/Missing values

  • You will frequently encounter missing values (NA, NULL, -Inf, Inf) when you are working with data in R. For example, if you try to take the log of 0:
log(0)
## [1] -Inf

To be completed.

Advanced R Stuff

apply/sapply/lapply/mapply/tapply

To be updated.

Random

  • Unlike in Python or C++, you may use periods in variable names. For example stats.451 is a valid variable name. (I’m not a huge fan but it is considered good style in R to use periods instead of underscores. This probably makes no sense to you if you know Python/C++)