In this chapter, we’ll cover the basics of ggplot2. There are many good resources on the topic.

Some of them are:

We will use the diamonds dataset in library(ggplot2), so we load the library and the dataset.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.2
data(diamonds)

The dataset has 53940 observations and 10 columns, with column names

##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"

You can get more information on what the variables are by running the R command

?diamonds

Basic plots with qplot

We have seen how to create some basic plots with qplot already. Now it’s time to summarize what we learned and go a little deeper. The good thing about qplot is that it has good defaults for a decent amount of situations.

Univariate plots

One variable at a time!

Categorical

The most common plots for univariate categorical data are pie charts and bar plots. People who have done research on data visualization agree that pie charts are bad. For example, if you get help for the function pie (which produces pie charts in the base graphics library on R), you get the following note:

Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.

Cleveland (1985), page 264: “Data that can be shown by pie charts always can be shown by a dot chart. This means that judgements of position along a common scale can be made instead of the less accurate angle judgements.” This statement is based on the empirical investigations of Cleveland and McGill as well as investigations by perceptual psychologists.

It’s possible to create pie charts with ggplot2 (see e.g. this link) but we won’t cover them here.

Creating a bar plot with qplot is very easy:

qplot(cut, data=diamonds)

In ggplot2, plots can be saved as variables. This is a useful feature because, in ggplot2, we create visualizations sequentially by adding layers to a plot. It can also make our code look cleaner. Let’s save the plot in a variable and add stuff to it.

barcut = qplot(cut, data=diamonds)

Sometimes I feel like the default fontsize of the plots is too small. We can change the font size as follows

barcut = barcut+theme(text=element_text(size=15))
barcut

We can change the color of the bars:

barcut = barcut+geom_bar(fill='steelblue')
barcut

And we can flip the coordinates

barcut = barcut+coord_flip()
barcut

We can add a title, too!

barcut = barcut+ggtitle("Quality of cut")
barcut 

We can change labels:

barcut = barcut+xlab("quality of cut")+ylab("count")
barcut 

We can center the title:

barcut + theme(plot.title = element_text(hjust = 0.5))

Since we were saving the changes as we produced them, all the previous changes to the plot were saved. As you might imagine, an equivalent chunk of code to produce the plot is

qplot(cut, data=diamonds)+
  theme(text=element_text(size=15))+
  geom_bar(fill='steelblue')+
  coord_flip()+
  ggtitle("Quality of cut")+
  theme(plot.title = element_text(hjust = 0.5))

We can change the general theme quite easily as well:

qplot(cut, data=diamonds)+theme_minimal()

You can find a list of default themes here. You’ll have access to more themes if you install library(ggthemes). See this reference for more details.

Quantitative

In qplot, the default plot for quantitative data looks like this

qplot(price,data=diamonds)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can also make density plots.

qplot(price, geom='density', data=diamonds) 

We can change the color of the density plot as follows:

qplot(price, geom='density', data=diamonds) + geom_density(fill='steelblue')

Bivariate

Categorical vs Categorical

Creating stacked barplots is easy.

q1 = qplot(x=cut, fill=color, data=diamonds)
q1

Alternatively,

q2 = qplot(x=color, fill=cut, data=diamonds)
q2 

We can play around with the color palette with scale_fill_brewer.

q2 + scale_fill_brewer(palette="Spectral")

More info here and here.

Categorical vs Quantitative

Different options here! But some of them are bad. For example, I think this is a bad plot:

qplot(x=price, fill=cut, data=diamonds)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

How can we create a better plot? A less bad plot is

qplot(x=price, color=cut, geom='density', data=diamonds)

Side-by-side boxplots are a better alternative:

qplot(x=cut, y=price, geom='boxplot', data=diamonds)+coord_flip()

We can add color using fill:

qplot(x=cut, y=price, fill=cut, geom='boxplot', data=diamonds)+
  theme(legend.position="none")+
  coord_flip()

The code theme(legend.position="none") gets rid of the legend.

Another option is using facets.

qplot(price, facets = cut ~ ., data=diamonds)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Quantitative vs Quantitative

Scatterplots are your best bet here.

qplot(x=carat, y=price, data=diamonds)

We can add some smoothed trend:

qplot(x=carat, y=price, data=diamonds)+geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

And we can fit some linear trend:

qplot(x=carat, y=price, data=diamonds)+geom_smooth(method='lm')

The linear fit isn’t very good. A linear relationship looks good on the log-log scale.

qplot(x=log(carat), y=log(price), data=diamonds)+geom_smooth(method='lm')

Axes, titles and labels

The relevant commands here are

qplot(x=carat, y=price, data=diamonds)+ 
  xlab("carat (weight)") + 
  ylab("price ($)") + 
  ggtitle("Price vs carat") +
  xlim(c(0,10))+
  ylim(c(0,30000))

More qplot

More than 2 variables at once

We can create scatterplots with colored dots of different shapes:

qplot(x=carat, y=price, color=cut, shape=cut, data=diamonds)
## Warning: Using shapes for an ordinal variable is not advised

We can plot colored smoothed curves (potentially overlaid on points; not recommended, though).

qplot(x=carat, y=price, color=cut, geom='smooth', data=diamonds)+geom_point(alpha=0.02)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

We can plot panels with colored dots:

qplot(x=carat, y=price, color=cut, facets=color ~ ., data=diamonds)

Double panels!

qplot(x=carat, y=price, facets=color~cut, data=diamonds)

Double panels of colored dots… [Q: how many variables are we plotting at once now?]

qplot(x=carat, y=price, color=clarity, facets=color~cut, data=diamonds)

We could redo all of these with smooth curves/lines instead.

How can we plot relationships between 3 numerical/quantitative variables? This is one option

qplot(x=depth, y=carat, color=price, data=diamonds)

Another option is categorizing one of the variables, and then plotting bivariate relationships in panels. Below we partition the variable depth into 4 categories (defined by its quartiles):

diamonds$depthcat = cut(diamonds$depth, breaks=quantile(diamonds$depth), include.lowest = TRUE)
qplot(x=carat, y=price, facets=.~depthcat, data=diamonds)

grid.arrange

Sometimes it’s useful to have unrelated plots in one panel. Some of you have seen base R plots, and have used par(mfrow=c(,)) before. Unfortunately, par(mfrow=c(,)) doesn’t work with ggplot. Fortunately, we have grid.arrange in library(gridExtra):

library(gridExtra)
p1 = qplot(price, data=diamonds)
p2 = qplot(cut, data = diamonds)
grid.arrange(p1, p2, nrow=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

If you want to learn more options, you can go here.

Some cool plots in library(GGally)

There are some good extra plots based on ggplot2 in library(GGally). One of them is the equivalent of plot(dataset):

library(GGally)
ggpairs(diamonds[,5:10])

Another useful function is ggcorr, which creates heatmaps (which are helpful for visualizing correlation matrices). For example, we can create a heatmap for the variables mpg, cyl, hp, and wt in data(mtcars):

ggcorr(mtcars %>% select(mpg, cyl, hp, wt), label = TRUE)

You can plot confidence intervals for regression coefficients easily:

data(mtcars)
mod = lm(mpg~wt+qsec+cyl,data=mtcars)
ggcoef(mod)

More ggplot

So far, we used qplot. Now, we’ll use the more general ggplot machinery. Plotting with ggplot can be more or less broken down into the following steps.

  1. Read in the data

  2. Add aesthetics (aes): Which variables go into the plot? What are the x and y? What variables are you using for color-coding/shapes/etc.?

  3. Add geoms: What kind of plot do you want?

  4. Do data transformations, if needed.

  5. Change labels/theme.

Let’s read in the diamonds and make some plots.

library(ggplot2)
data(diamonds)

Let’s create a stacked percentage barplot:

ggplot(diamonds) + aes(x=cut, fill=color) + geom_bar()

We can change the type of barplot easily by adding options in the geom_bar function:

ggplot(diamonds) + aes(x=cut, fill=color) + geom_bar(position='fill')

If you want side-by-side bars, you can use position = 'dodge'

ggplot(diamonds) + aes(x=cut, fill=color) + geom_bar(position='dodge')

We can use the same structure to produce a color-coded scatterplot:

p = ggplot(diamonds) + aes(x=carat, y=price, color=cut) + geom_point()
p

Note that we used the option color in aes.

We can break down plots in panels using facet_grid:

p = p + facet_grid(. ~ cut)
p

And we can combine geoms. For example:

ggplot(diamonds) + aes(x=price) + geom_histogram(binwidth = 500, aes(y=..density..))+geom_density(color='red', size=1)