ggplot
and intro to the library(dplyr)
qplot
library(ggplot2)
data(diamonds)
We can create scatterplots with colored dots of different shapes:
qplot(x=carat, y=price, color=cut, shape=cut, data=diamonds)
## Warning: Using shapes for an ordinal variable is not advised
We can plot colored smoothed curves (potentially overlaid on points; not recommended, though).
qplot(x=carat, y=price, color=cut, geom='smooth', data=diamonds)+geom_point(alpha=0.02)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
We can plot panels with colored dots:
qplot(x=carat, y=price, color=cut, facets=color ~ ., data=diamonds)
Double panels!
qplot(x=carat, y=price, facets=color~cut, data=diamonds)
Double panels of colored dots… [Q: how many variables are we plotting at once now?]
qplot(x=carat, y=price, color=clarity, facets=color~cut, data=diamonds)
We could redo all of these with smooth curves/lines instead.
How can we plot relationships between 3 numerical/quantitative variables? This is one option
qplot(x=depth, y=carat, color=price, data=diamonds)
Another option is categorizing one of the variables, and then plotting bivariate relationships in panels. Below we partition the variable depth
into 4 categories (defined by its quartiles):
diamonds$depthcat = cut(diamonds$depth, breaks=quantile(diamonds$depth), include.lowest = TRUE)
qplot(x=carat, y=price, facets=.~depthcat, data=diamonds)
grid.arrange
Sometimes it’s useful to have unrelated plots in one panel.
Exercise With the base graphics
package, we can do that with par(mfrow=c( , ))
. For example, if we want a plot that has 2 rows, one with a histogram of price
and another row which has a bar plot of cut, how would we do that?
Answer
par(mfrow=c(2,1))
hist(diamonds$price)
plot(diamonds$cut)
Unfortunately, par(mfrow=c(,))
doesn’t work with ggplot
. Fortunately, we have grid.arrange
in library(gridExtra)
:
library(gridExtra)
p1 = qplot(price, data=diamonds)
p2 = qplot(cut, data = diamonds)
grid.arrange(p1, p2, nrow=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
If you want to learn more options, you can go here.
library(GGally)
There are some good extra plots based on ggplot2
in library(GGally)
. One of them is the equivalent of plot(dataset)
:
library(GGally)
ggpairs(diamonds[,5:10])
You can plot confidence intervals for regression coefficients easily:
data(mtcars)
mod = lm(mpg~wt+qsec+cyl,data=mtcars)
ggcoef(mod)
ggplot
Today we’ll cover more ggplot
. Last time we used qplot
and today we’ll use the more general ggplot
machinery. Plotting with ggplot
can be more or less broken down into the following steps.
Read in the data
Add aesthetics (aes
): Which variables go into the plot? What are the x
and y
? What variables are you using for color-coding?
Add geom
s: What kind of plot do you want?
Do data transformations, if needed.
Change labels/theme.
We’ll see some concrete examples today.
Let’s read in the diamonds data first.
library(ggplot2)
data(diamonds)
Let’s create a stacked % barplot:
ggplot(diamonds) + aes(x=cut, fill=color) + geom_bar()
We can change the type of barplot easily by adding options in the geom_bar
function:
ggplot(diamonds) + aes(x=cut, fill=color) + geom_bar(position='fill')
If you want side-by-side bars, you can use position = 'dodge'
ggplot(diamonds) + aes(x=cut, fill=color) + geom_bar(position='dodge')
We can use the same structure to produce a color-coded scatterplot:
p = ggplot(diamonds) + aes(x=carat, y=price, color=cut) + geom_point()
p
Note that we used the option color
in aes
.
We can break down plots in panels using facet_grid
:
p = p + facet_grid(. ~ cut)
p
And you can combine geom
s. For example
ggplot(diamonds) + aes(x=price) + geom_histogram(binwidth = 500, aes(y=..density..))+geom_density(color='red', size=1)
The relevant commands here are
ggtitle
: for changing the title
xlab
, ylab
: \(x\) and \(y\) labels
xlim
, ylim
: limits / scale of the plot
p + xlab("carat (weight)") + ylab("price ($)") + ggtitle("Price vs carat vs quality") + xlim(c(0,10))+ylim(c(0,30000))
You can also transform the axes
p + coord_trans(x='log',y='log')
library(dplyr)
We can manipulate / filter datasets easily with functions in library(dplyr)
. Let’s use the hsb2
dataset in library(openintro)
to illustrate some of the functions.
library(dplyr)
library(openintro)
data(hsb2)
hsb2 = as_tibble(hsb2)
select
: select / drop variablesYou can create subsets of the data that only contain a few of the variables with select
. For example, if you want to create a subset that only has the variables math
, race
, gender
, and ses
:
sub1 = hsb2 %>% select(-math,-race,-gender,-ses)
sub1
## # A tibble: 200 x 7
## id schtyp prog read write science socst
## <int> <fct> <fct> <int> <int> <int> <int>
## 1 70 public general 57 52 47 57
## 2 121 public vocational 68 59 63 61
## 3 86 public general 44 33 58 31
## 4 141 public vocational 63 44 53 56
## 5 172 public academic 47 52 53 61
## 6 113 public academic 44 52 63 61
## 7 50 public general 50 59 53 61
## 8 11 public academic 34 46 39 36
## 9 84 public general 63 57 58 51
## 10 48 public academic 57 55 50 51
## # ... with 190 more rows
The command %>%
is the so-called pipe operator, which can be used to combine functions in dplyr
(it’s kind of the same idea as +
in library(ggplot2)
).
If, on the other hand, you want to create a subset that excludes the variables math
, race
, gender
, and ses
:
sub2 = hsb2 %>% select(-math,-race,-gender,-ses)
sub2
## # A tibble: 200 x 7
## id schtyp prog read write science socst
## <int> <fct> <fct> <int> <int> <int> <int>
## 1 70 public general 57 52 47 57
## 2 121 public vocational 68 59 63 61
## 3 86 public general 44 33 58 31
## 4 141 public vocational 63 44 53 56
## 5 172 public academic 47 52 53 61
## 6 113 public academic 44 52 63 61
## 7 50 public general 50 59 53 61
## 8 11 public academic 34 46 39 36
## 9 84 public general 63 57 58 51
## 10 48 public academic 57 55 50 51
## # ... with 190 more rows
filter
: filter by logical conditionsThis one is pretty self-explanatory. As a reminder, the logical operators in r
are:
==
: equal to
!=
: not equal to
>=
, >
, <
, <=
: greater or equal to, greater than, less than, less than or equal to
|
: or
&
: and
For example, if you want to create a subset that only contains people who went to public school and got a score in math
greater than 70, you can do that as follows:
sub3 = hsb2 %>% filter(math > 70, schtyp == 'public')
sub3
## # A tibble: 8 x 11
## id gender race ses schtyp prog read write math science socst
## <int> <chr> <chr> <fct> <fct> <fct> <int> <int> <int> <int> <int>
## 1 95 male white high public academ… 73 60 71 61 71
## 2 143 male white middle public vocati… 63 63 75 72 66
## 3 132 male white middle public academ… 73 62 73 69 66
## 4 68 male white middle public academ… 73 67 71 63 66
## 5 57 female white middle public academ… 71 65 72 66 56
## 6 100 female white high public academ… 63 65 71 69 71
## 7 33 female asian low public academ… 57 65 72 54 56
## 8 161 female white low public academ… 57 62 72 61 61
If, on the other hand, you want to filter those who got a score in math
greater than 70 or went to public school (or both):
hsb2 %>% filter(math > 70 | schtyp == 'public')
## # A tibble: 170 x 11
## id gender race ses schtyp prog read write math science socst
## <int> <chr> <chr> <fct> <fct> <fct> <int> <int> <int> <int> <int>
## 1 70 male white low public gene… 57 52 41 47 57
## 2 121 female white midd… public voca… 68 59 53 63 61
## 3 86 male white high public gene… 44 33 54 58 31
## 4 141 male white high public voca… 63 44 47 53 56
## 5 172 male white midd… public acad… 47 52 57 53 61
## 6 113 male white midd… public acad… 44 52 51 63 61
## 7 50 male africa… midd… public gene… 50 59 42 53 61
## 8 11 male hispan… midd… public acad… 34 46 45 39 36
## 9 84 male white midd… public gene… 63 57 54 58 51
## 10 48 male africa… midd… public acad… 57 55 52 50 51
## # ... with 160 more rows
You can combine select
and filter
. For example, if you only want to keep the values of math
and schtyp
for people who went to public school and got a score in math
greater than 70, we can do that as follows
sub3 = sub3 %>% select(math, schtyp)
Or, equivalently
sub3 = hsb2 %>% filter(math > 70, schtyp == 'public') %>% select(math, schtyp)
mutate
: transform variablesWe can use mutate if we want to transform/create new variables. For example, if we want to create a new variable called avg
which contains the average schore in read
, write
, science
, and socst
:
hsb2 = hsb2 %>% mutate(avg=(read+write+science+socst)/4)
arrange
: sortYou can use arrange
to sort the data by the values of some variable (default is ascending; you can use desc()
to sort in descending order).
For example, if you want to sort by avg
, which is the new variable we created in the previous section:
hsb2 %>% arrange(avg)
## # A tibble: 200 x 12
## id gender race ses schtyp prog read write math science socst
## <int> <chr> <chr> <fct> <fct> <fct> <int> <int> <int> <int> <int>
## 1 45 female afri… low public voca… 34 35 41 29 26
## 2 67 male white low public voca… 37 37 42 33 32
## 3 108 male white midd… public gene… 34 33 41 36 36
## 4 53 male afri… midd… public voca… 34 37 46 39 31
## 5 15 male hisp… high public voca… 39 39 44 26 42
## 6 133 male white midd… public voca… 50 31 40 34 31
## 7 51 female afri… high public gene… 42 36 42 31 39
## 8 16 male hisp… low public voca… 47 31 44 36 36
## 9 164 male white midd… public voca… 31 36 46 39 46
## 10 107 male white low public voca… 47 39 47 42 26
## # ... with 190 more rows, and 1 more variable: avg <dbl>
If you want to sort in descending order
hsb2 %>% arrange(desc(avg))
## # A tibble: 200 x 12
## id gender race ses schtyp prog read write math science socst
## <int> <chr> <chr> <fct> <fct> <fct> <int> <int> <int> <int> <int>
## 1 61 female white high public acad… 76 63 60 67 66
## 2 132 male white midd… public acad… 73 62 73 69 66
## 3 192 male white high priva… acad… 65 67 63 66 71
## 4 68 male white midd… public acad… 73 67 71 63 66
## 5 100 female white high public acad… 63 65 71 69 71
## 6 157 male white midd… public gene… 68 59 58 74 66
## 7 95 male white high public acad… 73 60 71 61 71
## 8 180 female white high priva… acad… 71 65 69 58 71
## 9 143 male white midd… public voca… 63 63 75 72 66
## 10 93 female white high public acad… 73 67 62 58 66
## # ... with 190 more rows, and 1 more variable: avg <dbl>
group_by
and summarize
: obtain summaries by variablesWe can create objects which contain summaries for different groups by combining group_by
and summarize
:
hsb2 %>% group_by(race) %>% summarize(medMath = median(math), sdMath = sd(math))
## # A tibble: 4 x 3
## race medMath sdMath
## <chr> <dbl> <dbl>
## 1 african american 45 6.49
## 2 asian 61 10.1
## 3 hispanic 47 6.98
## 4 white 54 9.38
And you can combine these function with the other functions we learned today. For example:
hsb2 %>% group_by(race) %>% filter(math > 70) %>% summarize(n=n())
## # A tibble: 2 x 2
## race n
## <chr> <int>
## 1 asian 1
## 2 white 9
Tells us that there are 10 people who got a math
score greater than 70, and that 1 of them is asian
and 9 of them are white
.
Next time