It seems like alot of you are proficient in programming but not necessarily familiar with R and are worried it may hinder your performance in this class. I tried to look for some R resources for you all but most of them might be too basic for those who know programming or included too much information about R that you do not necessarily need (although may be good to learn). Therefore, I created this no-nonsense guide that only includes concepts I think are essential/helpful for the course. I will try to draw parallels between Python/C++ and R whenever possible for those of you who are proficient in these languages.
I will constantly update this guide as the course progresses and students ask me more questions about R. Most of the R functions you will use are closely tied to the statistical concepts that you learn about in lecture. Therefore, I will add them to this guide after you have learned about the concepts in lecture to not confuse/stress you all out.
I have omitted programming concepts that are not specific to R like for loops, if statements, etc and can be easily googled.
If you learned C++ by taking EECS 280/281, don’t even stress out about the programming in this course. Come to office hours and teach me about C++ please.
You may use Python for the homeworks but R may be worth learning because we will be using stan later in the course and in my opinion, it is easier to use in R. (There is a Python version of it though.)
tl;dr Don’t worry about the programming in this course. This guide includes most of the R concepts you need to know.
Data Structures in R
We will discuss these one-by-one below.
These are kind of like lists in Python and arrays/vectors in C++.
Atomic means they can only hold data of a single data type.
To create an atomic vector, we use the function c()
. The “c” here stands for combine or concatenate.
a <- c(1,2,3,4)
To index an element in an (atomic) vector, we use []
. For example, to index the first element of a
:
a[1]
## [1] 1
To index a vector with a vector of indices:
a[1:3] # Here 1:3 is the same as c(1, 2, 3). This will be discussed later.
## [1] 1 2 3
a[-i]
will return a
with the element in the \(i^{th}\) index deleted. In general, a[-index_vec]
will return a
with the elements in the indices in index_vec
deleted.
a[-2]
## [1] 1 3 4
a[-(1:3)]
## [1] 4
Alternatively, you can index vectors using logical statements. There are two ways to do this.
a[a < 3]
## [1] 1 2
which(a < 3)
## [1] 1 2
There are several functions in R that output useful vectors. Some of the ones that will be useful for this course are: rep
, seq
.
rep
replicates a given object a specified amount of times.
rep(x = 2, times = 10) # or just rep(2, 10)
## [1] 2 2 2 2 2 2 2 2 2 2
# Can also replicate vectors
rep(a, 2) # create vector with a replicated twice.
## [1] 1 2 3 4 1 2 3 4
seq
creates (incremented) sequences. It is like range
in Python. However, the seq
function is inclusive. (i.e. range(1,10)
only returns a list with numbers 1 - 9, whereas seq(1,10)
will return a vector with numbers 1- 10.)
seq(from = 1, to = 10, by = 2)
## [1] 1 3 5 7 9
To make sequence of consecutive numbers, you can use a:b
(This is usually used in for loops).
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
for (i in 1:4) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
+, -, *, /,
# create a vector and add it to a
b <- c(2, 3, 4, 5)
a + b
## [1] 3 5 7 9
c <- c(2, 3, 4, 5, 6, 7, 8)
a_plus_c <- a + c
## Warning in a + c: longer object length is not a multiple of shorter object
## length
a_plus_c
## [1] 3 5 7 9 7 9 11
In this example, a + c = (1, 2, 3, 4, 1, 2, 3) + (2, ,3, 4, 5 ,6, 7, 8)
.
### To append
# There is an append() function, but this method of appending suffices.
a <- c(a, 5)
a
## [1] 1 2 3 4 5
# Alternatively,
a[6] <- 6
a
## [1] 1 2 3 4 5 6
a <- c(a, 7:9) # can also append vectors to vectors
a
## [1] 1 2 3 4 5 6 7 8 9
# Remove second element in R
a <- a[-2]
a
## [1] 1 3 4 5 6 7 8 9
# Remove first 3 elements in R
a <- a[-(1:3)]
a
## [1] 5 6 7 8 9
a <- c(1, 2, c(3, c(4)))
a
## [1] 1 2 3 4
length(a)
## [1] 4
length
gives you the length of the vector.a <- c(1,2,3,4)
length(a)
## [1] 4
To create a list:
# Create empty list
x <- list()
x
## list()
You can also create lists with specified values.
# Create list with one numerical vector and one character vector
x <- list(c(1,2,3), c("one", "two", "three"))
x
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] "one" "two" "three"
x[[i]]
: returns the \(i^{th}\) element of list x
x[i]
: returns a list with \(i^{th}\) element as the only element
x[ind.vec]
will return a list with elements indexed by indices in ind.vec
x$ind
: returns element named by ind
. Python users can think of the names as keys.x[[1]]
## [1] 1 2 3
class(x[[1]])
## [1] "numeric"
x[1]
## [[1]]
## [1] 1 2 3
class(x[1])
## [1] "list"
x[c(1,2)]
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] "one" "two" "three"
# name elements in list
names(x) <- c("one", "two")
x
## $one
## [1] 1 2 3
##
## $two
## [1] "one" "two" "three"
x$one
## [1] 1 2 3
R’s two dimensional array that contains elements of the same atomic type. (We usually use matrices containing numeric elements.)
matrix(data, nrow, ncol, byrow , dimnames)
data
: input vectornrow
/ncol
: number of rows/columnsbyrow
: Fill matrix with input vector by row (default is fill by column)dimnames
: row and column names (We usually leave this blank.)# Create 2 x 3 matrix with elements 1 - 6
mat <- matrix(c(1,2,3,4,5,6), nrow = 2, ncol = 3)
mat
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
# Fill by row
mat <- matrix(1:6, nrow = 2, ncol = 3, byrow = TRUE)
mat
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
To index a list:
mat[i,]
: \(i^{th}\) row of mat
mat[,j]
: \(j^{th}\) column of mat
mat[i,j]
: \((i,j)^{th}\) element of mat
mat[1,] # first row of mat
## [1] 1 2 3
mat[,1] # first column of mat
## [1] 1 4
mat[1,3] # (1,3) element of mat
## [1] 3
mat[,1:3] # first 3 columns of mat
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
A %*% B
: Matrix multiply A and B. (if dim(A) = [m,n]
and dim(B) = [n,l]
)A * B
: Multiply A and B element-wiseA + B
, A - B
, A / B
: add, subtract, divide A and B element-wiseA <- matrix(1:4, nrow = 2, ncol = 2)
B <- matrix(1:6, nrow = 2, ncol = 3)
A %*% B
## [,1] [,2] [,3]
## [1,] 7 15 23
## [2,] 10 22 34
# A * B # This will give an error
B <- matrix(2:5, nrow = 2, ncol = 2)
A + B
## [,1] [,2]
## [1,] 3 7
## [2,] 5 9
A / B
## [,1] [,2]
## [1,] 0.5000000 0.75
## [2,] 0.6666667 0.80
colSums
, rowSums
, colMeans
, rowMeans
: returns vectors containing sum/mean of each column/row.rowSums(A)
## [1] 4 6
cbind
, rbind
: bind matrices by column/row.A <- matrix(1:4, nrow = 2, ncol = 2)
B <- matrix(5:8, nrow = 2, ncol = 2)
A
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
B
## [,1] [,2]
## [1,] 5 7
## [2,] 6 8
cbind(A,B)
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
rbind(A,B)
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
## [3,] 5 7
## [4,] 6 8
To create a data frame: data.frame(Var1 = var1.values, Var2 = var2.values, ...)
# Create example data frame
df <- data.frame(Names = c("Yang", "Daniel"), Title = c("Instructor", "GSI"), Coolness = c(0, 100), Evilness = c(100, 0))
# Jk hehe don't kill me.
df
## Names Title Coolness Evilness
## 1 Yang Instructor 0 100
## 2 Daniel GSI 100 0
summary(df)
: Statistical summary of columns of df
summary(df)
## Names Title Coolness Evilness
## Daniel:1 GSI :1 Min. : 0 Min. : 0
## Yang :1 Instructor:1 1st Qu.: 25 1st Qu.: 25
## Median : 50 Median : 50
## Mean : 50 Mean : 50
## 3rd Qu.: 75 3rd Qu.: 75
## Max. :100 Max. :100
Two functions for reading in data:
read.csv(file_path, header = FALSE)
: read csv at file_pathread.table(file_path, header = FALSE)
: read text file at file_path
header
: If TRUE
, file contains the names of the variables as its first line.To be completed.
To be updated.
To be updated.
To be updated
install.packages(pkg.name)
install.packages("ggplot2") # install ggplot2
library(pkg)
help(pkg)
library(ggplot2) # load ggplot2
To be updated.
Helpful R functions:
rnorm(n, mean = 0, sd = 1)
rbinom(n, size, prob)
runif(n, min = 0, max = 1)
rexp(n, rate = 1)
Let \(X\) be a random variable with density \(f(x)\). Suppose you want to approximate \(E[g(X)]\) for some function \(g\) (e.g. mean, variance, etc). One way to do this is with the following procedure:
Example 1
Let \(X \sim N(2, 5)\) and suppose I want to approximate \(E[X^2]\). We know that \(E[X^2] = Var(X) + [E(X)]^2 = 5^2 + 2^2 = 29\).
N <- 1000000
X.samples <- rnorm(N, 2, 5) # Draw samples from N(2,5)
E_X2.approx <- mean(X.samples^2) # Take mean of square of samples
E_X2.approx
## [1] 29.03194
You can also use Monte-Carlo sampling to approximate distributions. Some canonical examples of distributions you may want to approximate are sampling distributions of test statistics (e.g. sample mean).
If you are reading this, you have probably heard of the Central Limit Theorem (CLT). The Central Limit Theorem tells us the sampling distribution of the sample mean is Normal. We can verify this with Monte-Carlo sampling.
I will post the code for this after Quiz 3. I want you all to try to code it by yourself first.
Helpful R functions
hist(x)
plot(x,y)
plot
function is actually very general and what it plots depends on the the class of the input. However, in most cases, you will just use it to plot scatter plots.NA
, NULL
, -Inf
, Inf
) when you are working with data in R. For example, if you try to take the log of 0:log(0)
## [1] -Inf
To be completed.
To be updated.
stats.451
is a valid variable name. (I’m not a huge fan but it is considered good style in R to use periods instead of underscores. This probably makes no sense to you if you know Python/C++)