R
We can create variables by assigning them values:
firstvariable = 0
secondvariable = 10
thirdvariable = TRUE
fourthvariable = "hello"
If we want to print a variable, we can type its name. For example:
fourthvariable
## [1] "hello"
We can see the “type” of the variables using the function class
:
class(firstvariable)
## [1] "numeric"
class(thirdvariable)
## [1] "logical"
class(fourthvariable)
## [1] "character"
The names of the types of these variables are pretty intuitive but here’s an explanation:
numeric
variables can take on numerical values
logical
variables can take on the values TRUE
and FALSE
character
variables are characters
We can add, subtract, multiply, divide, and exponentiate numeric
variables:
firstvariable+secondvariable
## [1] 10
firstvariable-secondvariable
## [1] -10
firstvariable*secondvariable
## [1] 0
firstvariable^secondvariable # exponentiation
## [1] 0
We can combine operations. For example, we can compute the average of firstvariable
and secondvariable
as
(firstvariable+secondvariable)/2
## [1] 5
R
has a built-in mean
function, which we’ll see later.
We can’t add, subtract, multiply, divide or exponentiate character
variables (try it out: it’ll give you an error), but we can add, subtract, multiply, divide or exponentiate logical
variables. If the variable is TRUE
it’ll be treated as a 1
; if it’s FALSE
, it’ll be treated as a 0
:
logi1 = TRUE
logi2 = FALSE
logi1+logi2
## [1] 1
logi1*logi2
## [1] 0
logi1/logi2
## [1] Inf
logi1^logi2
## [1] 1
We can combine logical
and numeric
variables in operations. Again, TRUE
will be assigned 1
, and FALSE
will be assigned 0
.
Finally, we can do operations without using variables at all:
6/2*(2+1+TRUE)
## [1] 12
R has built-in functions such as sqrt
, exp
, log
, …
sqrt(4)
## [1] 2
exp(firstvariable)
## [1] 1
log(10, base=2)
## [1] 3.321928
If you’re not sure how a function works, you can ask for help by writing ?
before the name of the function.
We can define vectors as follows:
x1 = c(1, 2, 3, 4, 5, 6)
y1 = c("a","b","c","d","efg")
z1 = c("a", 2, 3, "e")
And we can create ranges of values with :
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
10:6
## [1] 10 9 8 7 6
We can add, multiply, divide, etc. all the components of a numeric
vector by the same number:
x1+5
## [1] 6 7 8 9 10 11
x1*5
## [1] 5 10 15 20 25 30
x1/5
## [1] 0.2 0.4 0.6 0.8 1.0 1.2
We can add and subtract vectors of the same length:
x2 = c(7, 8, 9, 10, 11, 12)
x1+x2
## [1] 8 10 12 14 16 18
x1-x2
## [1] -6 -6 -6 -6 -6 -6
Similarly, we can do componentwise multiplication and division:
x1*x2
## [1] 7 16 27 40 55 72
x1/x2
## [1] 0.1428571 0.2500000 0.3333333 0.4000000 0.4545455 0.5000000
If two vectors are of different lengths, we have to be careful! R
won’t give us a warning message:
x1
## [1] 1 2 3 4 5 6
x3 = c(2,3)
x1+x3
## [1] 3 5 5 7 7 9
x1*x3
## [1] 2 6 6 12 10 18
We can compute the dot product of 2 numeric
vectors of the same length:
t(x1)%*%x2
## [,1]
## [1,] 217
We can compute means, standard deviations, variances, etc:
mean(x1)
## [1] 3.5
sd(x1)
## [1] 1.870829
var(x1)
## [1] 3.5
length
and concatenatingWe can find the length of a vector with length
:
length(x1)
## [1] 6
And we can add values to an existing vector as follows:
x1
## [1] 1 2 3 4 5 6
c(x1,10) # add at the end
## [1] 1 2 3 4 5 6 10
c(10, x1) # add at the beginning
## [1] 10 1 2 3 4 5 6
We can concatenate vectors, too:
c(x1, x2)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
We can look at particular entries of a vector using brackets.
x1
## [1] 1 2 3 4 5 6
x1[1] # first entry
## [1] 1
x1[4] # 4th entry
## [1] 4
x1[length(x1)] # last entry
## [1] 6
In R
, indices start at 1
(in some other programming languages, indices start at 0
).
We can access subsets of vectors using vectors. For example, if we want to print the third and fifth entries of x1
:
x1[c(3,5)]
## [1] 3 5
We can subset using ranges of values with :
. For instance, if we want to select the second, third, fourth, and fifth entries of x1
:
x1[2:5]
## [1] 2 3 4 5
We can create matrices as follows:
A1 = matrix(c(1,2,3,4), nrow=2, ncol=2, byrow=TRUE) # read by row
A2 = matrix(c(1,3,2,4), nrow=2, ncol=2, byrow=FALSE) # read by column
Doing operations with matrices is straightforward:
A1%*%A2 # matrix product
## [,1] [,2]
## [1,] 7 10
## [2,] 15 22
A1*A2 # componentwise product
## [,1] [,2]
## [1,] 1 4
## [2,] 9 16
A1+A2 # componentwise addition
## [,1] [,2]
## [1,] 2 4
## [2,] 6 8
log(A1) # taking the log of the components
## [,1] [,2]
## [1,] 0.000000 0.6931472
## [2,] 1.098612 1.3862944
Indexing matrices is similar to indexing vectors. For example, if we want to access the element in the first row and second column of A1
:
A1[1,2] # accessing entries: rows first, then columns
## [1] 2
You can also index by full rows and columns. For example, if you want to select the first row of A1
:
A1[1,]
## [1] 1 2
If you want to access the second column:
A1[,2]
## [1] 2 4
Statisticians use R
because there are many libraries that contain useful functions. We can install libraries with install.packages
. For example, if we want to install ggplot2
, which is a useful library for plotting:
install.packages('ggplot2')
Once the library is installed, we can load it using library()
. If we want to load ggplot2
, we need to type:
library(ggplot2)
data.frame
sWe’ll use the dataset mpg
, which is in the ggplot2
library. First, we load it:
data(mpg)
The class of the dataset is data.frame
(and others), which are matrices that have columns that can have different types.
The function str
gives us some information about the variables in the dataset:
str(mpg)
## Classes 'tbl_df', 'tbl' and 'data.frame': 234 obs. of 11 variables:
## $ manufacturer: chr "audi" "audi" "audi" "audi" ...
## $ model : chr "a4" "a4" "a4" "a4" ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr "f" "f" "f" "f" ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr "p" "p" "p" "p" ...
## $ class : chr "compact" "compact" "compact" "compact" ...
We can print the first and last 5 observations in the dataset using head
and tail
:
head(mpg)
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
tail(mpg)
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 volkswagen pass… 1.8 1999 4 auto… f 18 29 p mids…
## 2 volkswagen pass… 2 2008 4 auto… f 19 28 p mids…
## 3 volkswagen pass… 2 2008 4 manu… f 21 29 p mids…
## 4 volkswagen pass… 2.8 1999 6 auto… f 16 26 p mids…
## 5 volkswagen pass… 2.8 1999 6 manu… f 18 26 p mids…
## 6 volkswagen pass… 3.6 2008 6 auto… f 17 26 p mids…
We can index the rows and columns of mpg
using the same syntax we used for indexing matrices:
mpg[3:7,c(1,4:5)]
## # A tibble: 5 x 3
## manufacturer year cyl
## <chr> <int> <int>
## 1 audi 2008 4
## 2 audi 2008 4
## 3 audi 1999 6
## 4 audi 1999 6
## 5 audi 2008 6
With data.frame
s we can extract variables using $
. For example, if we want to look at year
:
mpg$year
## [1] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 2008 1999 1999 2008
## [15] 2008 1999 2008 2008 2008 2008 2008 1999 2008 1999 1999 2008 2008 2008
## [29] 2008 2008 1999 1999 1999 2008 1999 2008 2008 1999 1999 1999 1999 2008
## [43] 2008 2008 1999 1999 2008 2008 2008 2008 1999 1999 2008 2008 2008 1999
## [57] 1999 1999 2008 2008 2008 1999 2008 1999 2008 2008 2008 2008 2008 2008
## [71] 1999 1999 2008 1999 1999 1999 2008 1999 1999 1999 2008 2008 1999 1999
## [85] 1999 1999 1999 2008 1999 2008 1999 1999 2008 2008 1999 1999 2008 2008
## [99] 2008 1999 1999 1999 1999 1999 2008 2008 2008 2008 1999 1999 2008 2008
## [113] 1999 1999 2008 1999 1999 2008 2008 2008 2008 2008 2008 2008 1999 1999
## [127] 2008 2008 2008 2008 1999 2008 2008 1999 1999 1999 2008 1999 2008 2008
## [141] 1999 1999 1999 2008 2008 2008 2008 1999 1999 2008 1999 1999 2008 2008
## [155] 1999 1999 1999 2008 2008 1999 1999 2008 2008 2008 2008 1999 1999 1999
## [169] 1999 2008 2008 2008 2008 1999 1999 1999 1999 2008 2008 1999 1999 2008
## [183] 2008 1999 1999 2008 1999 1999 2008 2008 1999 1999 2008 1999 1999 1999
## [197] 2008 2008 1999 2008 1999 1999 2008 1999 1999 2008 2008 1999 1999 2008
## [211] 2008 1999 1999 1999 1999 2008 2008 2008 2008 1999 1999 1999 1999 1999
## [225] 1999 2008 2008 1999 1999 2008 2008 1999 1999 2008
We can also index by logical conditions. For instance, if we want to work with the subset of Toyota cars:
mpg[mpg$manufacturer == "toyota",]
## # A tibble: 34 x 11
## manufacturer model displ year cyl trans drv cty hwy fl cla…
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <ch>
## 1 toyota 4run… 2.7 1999 4 manu… 4 15 20 r suv
## 2 toyota 4run… 2.7 1999 4 auto… 4 16 20 r suv
## 3 toyota 4run… 3.4 1999 6 auto… 4 15 19 r suv
## 4 toyota 4run… 3.4 1999 6 manu… 4 15 17 r suv
## 5 toyota 4run… 4 2008 6 auto… 4 16 20 r suv
## 6 toyota 4run… 4.7 2008 8 auto… 4 14 17 r suv
## 7 toyota camry 2.2 1999 4 manu… f 21 29 r mid…
## 8 toyota camry 2.2 1999 4 auto… f 21 27 r mid…
## 9 toyota camry 2.4 2008 4 manu… f 21 31 r mid…
## 10 toyota camry 2.4 2008 4 auto… f 21 31 r mid…
## # ... with 24 more rows
factor
is a variable type in R
useful for encoding categorical variables. Defining them is easy:
fac1 = factor(c("dog","cat","cat","dog"))
We can use summary
to create a quick table (note that summary
didn’t work well with character
variables):
summary(fac1)
## cat dog
## 2 2
The default ordering of the categories in a factor is alphabetical, which isn’t always the best or most intutive. We can see the different categories (in R
lingo, levels) of a factor and its ordering using levels
:
levels(fac1)
## [1] "cat" "dog"
Let’s use the hsb2
dataset (on the course website) to illustrate this point. The dataset contains a variable called ses
, which is socioeconomic status of the student. It can take on the values low
, middle
, and high
. Unfortunately, the default ordering of the factor is alphabetical, that is:
levels(hsb2$ses)
## [1] "high" "low" "middle"
The problem with this ordering is that if we create tables, plots, etc. R
will use this ordering, which is counterintuitive. For instance, if we create a 2 x 2 table of ses
and race
, we get
table(hsb2$ses, hsb2$race)
##
## african american asian hispanic white
## high 3 3 4 48
## low 11 3 9 24
## middle 6 5 11 73
This is not great.
How can we reorder the levels of a factor? The answer is
hsb2$ses = factor(hsb2$ses, ordered = TRUE, levels = c("low", "middle", "high"))
The code above rewrites the ses
variable in hsb2
to an ordered factor whose levels are low
, middle
, and high
(in that order).
If you don’t believe me (and you shouldn’t), here’s the code to verify that ses
is now ordered:
levels(hsb2$ses)
## [1] "low" "middle" "high"
table(hsb2$ses, hsb2$race)
##
## african american asian hispanic white
## low 11 3 9 24
## middle 6 5 11 73
## high 3 3 4 48
We won’t say much about list
s, but they’re useful if we want to keep objects of different types in a single place.
For example, suppose that we have a vector
and a matrix
:
v = 1:6
m = matrix(c(1,0,0,1),byrow=T,nrow=2)
Then, the following code creates a list
whose entries are the vector v
and the matrix m
:
l = list(v,m)
We can access, say, the second element of the list with
l[[2]]
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
And we can do things such as
l[[2]][2,1]
## [1] 0
l[[1]][4]
## [1] 4
We can add a new element to the list indexing by a new element
v2 = 3:4
l[[3]] = v2
We probably won’t see list
s again in the course, but it’s good to know that they exist.
R
We can get quick summaries of numeric
variables with summary
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
We can tabulate with table
table(mpg$manufacturer)
##
## audi chevrolet dodge ford honda hyundai
## 18 19 37 25 9 14
## jeep land rover lincoln mercury nissan pontiac
## 8 4 3 4 13 5
## subaru toyota volkswagen
## 14 34 27
table(mpg$manufacturer,mpg$year)
##
## 1999 2008
## audi 9 9
## chevrolet 7 12
## dodge 16 21
## ford 15 10
## honda 5 4
## hyundai 6 8
## jeep 2 6
## land rover 2 2
## lincoln 2 1
## mercury 2 2
## nissan 6 7
## pontiac 3 2
## subaru 6 8
## toyota 20 14
## volkswagen 16 11
table(mpg$year)
##
## 1999 2008
## 117 117
We can plot stuff, too. For example, hist
does histograms:
hist(mpg$displ, main="Engine displacement (in litres)",
col=rainbow(20),
xlim=c(0,10))
You can learn more about how to change the attributes of the plot with ?hist
.
We can create individual boxplots and boxplots grouped by values of categorical variables:
boxplot(mpg$displ)
boxplot(mpg$displ~mpg$manufacturer)
These plots are created using the graphics
library. There are other libraries that you can use to produce plots. One of them is ggplot2
, which we installed earlier. A nice thing about ggplot2
is that it has the function qplot
, which produces good-looking plots by default. For example:
qplot(mpg$displ)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(mpg$manufacturer)
qplot
is smart enough to produce different plots depending on the type of the object. We’ll cover ggplot2
in more detail later in the semester.