In this chapter, we’ll cover the basics of ggplot2
. There are many good resources on the topic.
Some of them are:
We will use the diamonds
dataset in library(ggplot2)
, so we load the library and the dataset.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.2
data(diamonds)
The dataset has 53940 observations and 10 columns, with column names
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"
## [8] "x" "y" "z"
You can get more information on what the variables are by running the R
command
?diamonds
qplot
We have seen how to create some basic plots with qplot
already. Now it’s time to summarize what we learned and go a little deeper. The good thing about qplot
is that it has good defaults for a decent amount of situations.
One variable at a time!
The most common plots for univariate categorical data are pie charts and bar plots. People who have done research on data visualization agree that pie charts are bad. For example, if you get help for the function pie
(which produces pie charts in the base graphics
library on R
), you get the following note:
Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.
Cleveland (1985), page 264: “Data that can be shown by pie charts always can be shown by a dot chart. This means that judgements of position along a common scale can be made instead of the less accurate angle judgements.” This statement is based on the empirical investigations of Cleveland and McGill as well as investigations by perceptual psychologists.
It’s possible to create pie charts with ggplot2
(see e.g. this link) but we won’t cover them here.
Creating a bar plot with qplot
is very easy:
qplot(cut, data=diamonds)
In ggplot2
, plots can be saved as variables. This is a useful feature because, in ggplot2
, we create visualizations sequentially by adding layers to a plot. It can also make our code look cleaner. Let’s save the plot in a variable and add stuff to it.
barcut = qplot(cut, data=diamonds)
Sometimes I feel like the default fontsize of the plots is too small. We can change the font size as follows
barcut = barcut+theme(text=element_text(size=15))
barcut
We can change the color of the bars:
barcut = barcut+geom_bar(fill='steelblue')
barcut
And we can flip the coordinates
barcut = barcut+coord_flip()
barcut
We can add a title, too!
barcut = barcut+ggtitle("Quality of cut")
barcut
We can change labels:
barcut = barcut+xlab("quality of cut")+ylab("count")
barcut
We can center the title:
barcut + theme(plot.title = element_text(hjust = 0.5))
Since we were saving the changes as we produced them, all the previous changes to the plot were saved. As you might imagine, an equivalent chunk of code to produce the plot is
qplot(cut, data=diamonds)+
theme(text=element_text(size=15))+
geom_bar(fill='steelblue')+
coord_flip()+
ggtitle("Quality of cut")+
theme(plot.title = element_text(hjust = 0.5))
We can change the general theme quite easily as well:
qplot(cut, data=diamonds)+theme_minimal()
You can find a list of default themes here. You’ll have access to more themes if you install library(ggthemes)
. See this reference for more details.
In qplot
, the default plot for quantitative data looks like this
qplot(price,data=diamonds)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can also make density plots.
qplot(price, geom='density', data=diamonds)
We can change the color of the density plot as follows:
qplot(price, geom='density', data=diamonds) + geom_density(fill='steelblue')
Creating stacked barplots is easy.
q1 = qplot(x=cut, fill=color, data=diamonds)
q1
Alternatively,
q2 = qplot(x=color, fill=cut, data=diamonds)
q2
We can play around with the color palette with scale_fill_brewer
.
q2 + scale_fill_brewer(palette="Spectral")
Different options here! But some of them are bad. For example, I think this is a bad plot:
qplot(x=price, fill=cut, data=diamonds)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
How can we create a better plot? A less bad plot is
qplot(x=price, color=cut, geom='density', data=diamonds)
Side-by-side boxplots are a better alternative:
qplot(x=cut, y=price, geom='boxplot', data=diamonds)+coord_flip()
We can add color using fill
:
qplot(x=cut, y=price, fill=cut, geom='boxplot', data=diamonds)+
theme(legend.position="none")+
coord_flip()
The code theme(legend.position="none")
gets rid of the legend.
Another option is using facets
.
qplot(price, facets = cut ~ ., data=diamonds)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Scatterplots are your best bet here.
qplot(x=carat, y=price, data=diamonds)
We can add some smoothed trend:
qplot(x=carat, y=price, data=diamonds)+geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
And we can fit some linear trend:
qplot(x=carat, y=price, data=diamonds)+geom_smooth(method='lm')
The linear fit isn’t very good. A linear relationship looks good on the log-log scale.
qplot(x=log(carat), y=log(price), data=diamonds)+geom_smooth(method='lm')
The relevant commands here are
ggtitle
: for changing the title
xlab
, ylab
: \(x\) and \(y\) labels
xlim
, ylim
: limits / scale of the plot
qplot(x=carat, y=price, data=diamonds)+
xlab("carat (weight)") +
ylab("price ($)") +
ggtitle("Price vs carat") +
xlim(c(0,10))+
ylim(c(0,30000))
qplot
We can create scatterplots with colored dots of different shapes:
qplot(x=carat, y=price, color=cut, shape=cut, data=diamonds)
## Warning: Using shapes for an ordinal variable is not advised
We can plot colored smoothed curves (potentially overlaid on points; not recommended, though).
qplot(x=carat, y=price, color=cut, geom='smooth', data=diamonds)+geom_point(alpha=0.02)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
We can plot panels with colored dots:
qplot(x=carat, y=price, color=cut, facets=color ~ ., data=diamonds)
Double panels!
qplot(x=carat, y=price, facets=color~cut, data=diamonds)
Double panels of colored dots… [Q: how many variables are we plotting at once now?]
qplot(x=carat, y=price, color=clarity, facets=color~cut, data=diamonds)
We could redo all of these with smooth curves/lines instead.
How can we plot relationships between 3 numerical/quantitative variables? This is one option
qplot(x=depth, y=carat, color=price, data=diamonds)
Another option is categorizing one of the variables, and then plotting bivariate relationships in panels. Below we partition the variable depth
into 4 categories (defined by its quartiles):
diamonds$depthcat = cut(diamonds$depth, breaks=quantile(diamonds$depth), include.lowest = TRUE)
qplot(x=carat, y=price, facets=.~depthcat, data=diamonds)
grid.arrange
Sometimes it’s useful to have unrelated plots in one panel. Some of you have seen base R
plots, and have used par(mfrow=c(,))
before. Unfortunately, par(mfrow=c(,))
doesn’t work with ggplot
. Fortunately, we have grid.arrange
in library(gridExtra)
:
library(gridExtra)
p1 = qplot(price, data=diamonds)
p2 = qplot(cut, data = diamonds)
grid.arrange(p1, p2, nrow=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
If you want to learn more options, you can go here.
library(GGally)
There are some good extra plots based on ggplot2
in library(GGally)
. One of them is the equivalent of plot(dataset)
:
library(GGally)
ggpairs(diamonds[,5:10])
Another useful function is ggcorr
, which creates heatmaps (which are helpful for visualizing correlation matrices). For example, we can create a heatmap for the variables mpg
, cyl
, hp
, and wt
in data(mtcars)
:
ggcorr(mtcars %>% select(mpg, cyl, hp, wt), label = TRUE)
You can plot confidence intervals for regression coefficients easily:
data(mtcars)
mod = lm(mpg~wt+qsec+cyl,data=mtcars)
ggcoef(mod)
ggplot
So far, we used qplot
. Now, we’ll use the more general ggplot
machinery. Plotting with ggplot
can be more or less broken down into the following steps.
Read in the data
Add aesthetics (aes
): Which variables go into the plot? What are the x
and y
? What variables are you using for color-coding/shapes/etc.?
Add geom
s: What kind of plot do you want?
Do data transformations, if needed.
Change labels/theme.
Let’s read in the diamonds and make some plots.
library(ggplot2)
data(diamonds)
Let’s create a stacked percentage barplot:
ggplot(diamonds) + aes(x=cut, fill=color) + geom_bar()
We can change the type of barplot easily by adding options in the geom_bar
function:
ggplot(diamonds) + aes(x=cut, fill=color) + geom_bar(position='fill')
If you want side-by-side bars, you can use position = 'dodge'
ggplot(diamonds) + aes(x=cut, fill=color) + geom_bar(position='dodge')
We can use the same structure to produce a color-coded scatterplot:
p = ggplot(diamonds) + aes(x=carat, y=price, color=cut) + geom_point()
p
Note that we used the option color
in aes
.
We can break down plots in panels using facet_grid
:
p = p + facet_grid(. ~ cut)
p
And we can combine geom
s. For example:
ggplot(diamonds) + aes(x=price) + geom_histogram(binwidth = 500, aes(y=..density..))+geom_density(color='red', size=1)