More ggplot and intro to the library(dplyr)

More qplot
More ggplot
Axes, titles and labels
Intro to the library(dplyr)

More `qplot`

library(ggplot2)
data(diamonds)

More than 2 variables at once

We can create scatterplots with colored dots of different shapes:

qplot(x=carat, y=price, color=cut, shape=cut, data=diamonds)

## Warning: Using shapes for an ordinal variable is not advised

We can plot colored smoothed curves (potentially overlaid on points; not recommended, though).

qplot(x=carat, y=price, color=cut, geom='smooth', data=diamonds)+geom_point(alpha=0.02)

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

We can plot panels with colored dots:

qplot(x=carat, y=price, color=cut, facets=color ~ ., data=diamonds)

Double panels!

qplot(x=carat, y=price, facets=color~cut, data=diamonds)

Double panels of colored dots… [Q: how many variables are we plotting at once now?]

qplot(x=carat, y=price, color=clarity, facets=color~cut, data=diamonds)

We could redo all of these with smooth curves/lines instead.

How can we plot relationships between 3 numerical/quantitative variables? This is one option

qplot(x=depth, y=carat, color=price, data=diamonds)

Another option is categorizing one of the variables, and then plotting bivariate relationships in panels. Below we partition the variable depth into 4 categories (defined by its quartiles):

diamonds$depthcat = cut(diamonds$depth, breaks=quantile(diamonds$depth), include.lowest = TRUE)
qplot(x=carat, y=price, facets=.~depthcat, data=diamonds)

`grid.arrange`

Sometimes it’s useful to have unrelated plots in one panel.

Exercise With the base graphics package, we can do that with par(mfrow=c( , )). For example, if we want a plot that has 2 rows, one with a histogram of price and another row which has a bar plot of cut, how would we do that?

Answer

par(mfrow=c(2,1))
hist(diamonds$price)
plot(diamonds$cut)

Unfortunately, par(mfrow=c(,)) doesn’t work with ggplot. Fortunately, we have grid.arrange in library(gridExtra):

library(gridExtra)
p1 = qplot(price, data=diamonds)
p2 = qplot(cut, data = diamonds)
grid.arrange(p1, p2, nrow=2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

If you want to learn more options, you can go here.

Some cool plots in `library(GGally)`

There are some good extra plots based on ggplot2 in library(GGally). One of them is the equivalent of plot(dataset):

library(GGally)
ggpairs(diamonds[,5:10])

You can plot confidence intervals for regression coefficients easily:

data(mtcars)
mod = lm(mpg~wt+qsec+cyl,data=mtcars)
ggcoef(mod)

More `ggplot`

Today we’ll cover more ggplot. Last time we used qplot and today we’ll use the more general ggplot machinery. Plotting with ggplot can be more or less broken down into the following steps.

Read in the data
Add aesthetics (aes): Which variables go into the plot? What are the x and y? What variables are you using for color-coding?
Add geoms: What kind of plot do you want?
Do data transformations, if needed.
Change labels/theme.

We’ll see some concrete examples today.

Let’s read in the diamonds data first.

library(ggplot2)
data(diamonds)

Let’s create a stacked % barplot:

ggplot(diamonds) + aes(x=cut, fill=color) + geom_bar()

We can change the type of barplot easily by adding options in the geom_bar function:

ggplot(diamonds) + aes(x=cut, fill=color) + geom_bar(position='fill')

If you want side-by-side bars, you can use position = 'dodge'

ggplot(diamonds) + aes(x=cut, fill=color) + geom_bar(position='dodge')

We can use the same structure to produce a color-coded scatterplot:

p = ggplot(diamonds) + aes(x=carat, y=price, color=cut) + geom_point()
p

Note that we used the option color in aes.

We can break down plots in panels using facet_grid:

p = p + facet_grid(. ~ cut)
p

And you can combine geoms. For example

ggplot(diamonds) + aes(x=price) + geom_histogram(binwidth = 500, aes(y=..density..))+geom_density(color='red', size=1)

Axes, titles and labels

The relevant commands here are

ggtitle: for changing the title
xlab, ylab: \(x\) and \(y\) labels
xlim, ylim: limits / scale of the plot

p + xlab("carat (weight)") + ylab("price ($)") + ggtitle("Price vs carat vs quality") + xlim(c(0,10))+ylim(c(0,30000))

You can also transform the axes

p + coord_trans(x='log',y='log')

Intro to the `library(dplyr)`

We can manipulate / filter datasets easily with functions in library(dplyr). Let’s use the hsb2 dataset in library(openintro) to illustrate some of the functions.

library(dplyr)
library(openintro)
data(hsb2)
hsb2 = as_tibble(hsb2)

`select`: select / drop variables

You can create subsets of the data that only contain a few of the variables with select. For example, if you want to create a subset that only has the variables math, race, gender, and ses:

sub1 = hsb2 %>% select(-math,-race,-gender,-ses)
sub1

## # A tibble: 200 x 7
##       id schtyp prog        read write science socst
##    <int> <fct>  <fct>      <int> <int>   <int> <int>
##  1    70 public general       57    52      47    57
##  2   121 public vocational    68    59      63    61
##  3    86 public general       44    33      58    31
##  4   141 public vocational    63    44      53    56
##  5   172 public academic      47    52      53    61
##  6   113 public academic      44    52      63    61
##  7    50 public general       50    59      53    61
##  8    11 public academic      34    46      39    36
##  9    84 public general       63    57      58    51
## 10    48 public academic      57    55      50    51
## # ... with 190 more rows

The command %>% is the so-called pipe operator, which can be used to combine functions in dplyr (it’s kind of the same idea as + in library(ggplot2)).

If, on the other hand, you want to create a subset that excludes the variables math, race, gender, and ses:

sub2 =  hsb2 %>% select(-math,-race,-gender,-ses)
sub2

## # A tibble: 200 x 7
##       id schtyp prog        read write science socst
##    <int> <fct>  <fct>      <int> <int>   <int> <int>
##  1    70 public general       57    52      47    57
##  2   121 public vocational    68    59      63    61
##  3    86 public general       44    33      58    31
##  4   141 public vocational    63    44      53    56
##  5   172 public academic      47    52      53    61
##  6   113 public academic      44    52      63    61
##  7    50 public general       50    59      53    61
##  8    11 public academic      34    46      39    36
##  9    84 public general       63    57      58    51
## 10    48 public academic      57    55      50    51
## # ... with 190 more rows

`filter`: filter by logical conditions

This one is pretty self-explanatory. As a reminder, the logical operators in r are:

==: equal to
!=: not equal to
>=, >, <, <=: greater or equal to, greater than, less than, less than or equal to
|: or
&: and

For example, if you want to create a subset that only contains people who went to public school and got a score in math greater than 70, you can do that as follows:

sub3 = hsb2 %>% filter(math > 70, schtyp == 'public') 
sub3

## # A tibble: 8 x 11
##      id gender race  ses    schtyp prog     read write  math science socst
##   <int> <chr>  <chr> <fct>  <fct>  <fct>   <int> <int> <int>   <int> <int>
## 1    95 male   white high   public academ…    73    60    71      61    71
## 2   143 male   white middle public vocati…    63    63    75      72    66
## 3   132 male   white middle public academ…    73    62    73      69    66
## 4    68 male   white middle public academ…    73    67    71      63    66
## 5    57 female white middle public academ…    71    65    72      66    56
## 6   100 female white high   public academ…    63    65    71      69    71
## 7    33 female asian low    public academ…    57    65    72      54    56
## 8   161 female white low    public academ…    57    62    72      61    61

If, on the other hand, you want to filter those who got a score in math greater than 70 or went to public school (or both):

hsb2 %>% filter(math > 70 | schtyp == 'public')

## # A tibble: 170 x 11
##       id gender race    ses   schtyp prog   read write  math science socst
##    <int> <chr>  <chr>   <fct> <fct>  <fct> <int> <int> <int>   <int> <int>
##  1    70 male   white   low   public gene…    57    52    41      47    57
##  2   121 female white   midd… public voca…    68    59    53      63    61
##  3    86 male   white   high  public gene…    44    33    54      58    31
##  4   141 male   white   high  public voca…    63    44    47      53    56
##  5   172 male   white   midd… public acad…    47    52    57      53    61
##  6   113 male   white   midd… public acad…    44    52    51      63    61
##  7    50 male   africa… midd… public gene…    50    59    42      53    61
##  8    11 male   hispan… midd… public acad…    34    46    45      39    36
##  9    84 male   white   midd… public gene…    63    57    54      58    51
## 10    48 male   africa… midd… public acad…    57    55    52      50    51
## # ... with 160 more rows

You can combine select and filter. For example, if you only want to keep the values of math and schtyp for people who went to public school and got a score in math greater than 70, we can do that as follows

sub3 = sub3 %>% select(math, schtyp)

Or, equivalently

sub3 =  hsb2 %>% filter(math > 70, schtyp == 'public') %>% select(math, schtyp)

`mutate`: transform variables

We can use mutate if we want to transform/create new variables. For example, if we want to create a new variable called avg which contains the average schore in read, write, science, and socst:

hsb2 =  hsb2 %>% mutate(avg=(read+write+science+socst)/4)

`arrange`: sort

You can use arrange to sort the data by the values of some variable (default is ascending; you can use desc() to sort in descending order).

For example, if you want to sort by avg, which is the new variable we created in the previous section:

hsb2 %>% arrange(avg)

## # A tibble: 200 x 12
##       id gender race  ses   schtyp prog   read write  math science socst
##    <int> <chr>  <chr> <fct> <fct>  <fct> <int> <int> <int>   <int> <int>
##  1    45 female afri… low   public voca…    34    35    41      29    26
##  2    67 male   white low   public voca…    37    37    42      33    32
##  3   108 male   white midd… public gene…    34    33    41      36    36
##  4    53 male   afri… midd… public voca…    34    37    46      39    31
##  5    15 male   hisp… high  public voca…    39    39    44      26    42
##  6   133 male   white midd… public voca…    50    31    40      34    31
##  7    51 female afri… high  public gene…    42    36    42      31    39
##  8    16 male   hisp… low   public voca…    47    31    44      36    36
##  9   164 male   white midd… public voca…    31    36    46      39    46
## 10   107 male   white low   public voca…    47    39    47      42    26
## # ... with 190 more rows, and 1 more variable: avg <dbl>

If you want to sort in descending order

hsb2 %>% arrange(desc(avg))

## # A tibble: 200 x 12
##       id gender race  ses   schtyp prog   read write  math science socst
##    <int> <chr>  <chr> <fct> <fct>  <fct> <int> <int> <int>   <int> <int>
##  1    61 female white high  public acad…    76    63    60      67    66
##  2   132 male   white midd… public acad…    73    62    73      69    66
##  3   192 male   white high  priva… acad…    65    67    63      66    71
##  4    68 male   white midd… public acad…    73    67    71      63    66
##  5   100 female white high  public acad…    63    65    71      69    71
##  6   157 male   white midd… public gene…    68    59    58      74    66
##  7    95 male   white high  public acad…    73    60    71      61    71
##  8   180 female white high  priva… acad…    71    65    69      58    71
##  9   143 male   white midd… public voca…    63    63    75      72    66
## 10    93 female white high  public acad…    73    67    62      58    66
## # ... with 190 more rows, and 1 more variable: avg <dbl>

`group_by` and `summarize`: obtain summaries by variables

We can create objects which contain summaries for different groups by combining group_by and summarize:

hsb2 %>% group_by(race) %>% summarize(medMath = median(math), sdMath = sd(math))

## # A tibble: 4 x 3
##   race             medMath sdMath
##   <chr>              <dbl>  <dbl>
## 1 african american      45   6.49
## 2 asian                 61  10.1 
## 3 hispanic              47   6.98
## 4 white                 54   9.38

And you can combine these function with the other functions we learned today. For example:

hsb2 %>% group_by(race) %>% filter(math > 70) %>% summarize(n=n())

## # A tibble: 2 x 2
##   race      n
##   <chr> <int>
## 1 asian     1
## 2 white     9

Tells us that there are 10 people who got a math score greater than 70, and that 1 of them is asian and 9 of them are white.

More?

Next time
dplyr cheat sheet

More `ggplot` and intro to the `library(dplyr)`

Víctor Peña

More `qplot`

More than 2 variables at once

`grid.arrange`

Some cool plots in `library(GGally)`

More `ggplot`

Axes, titles and labels

Intro to the `library(dplyr)`

`select`: select / drop variables

`filter`: filter by logical conditions

`mutate`: transform variables

`arrange`: sort

`group_by` and `summarize`: obtain summaries by variables

More?

More ggplot and intro to the library(dplyr)

Víctor Peña

More qplot

More than 2 variables at once

grid.arrange

Some cool plots in library(GGally)

More ggplot

Axes, titles and labels

Intro to the library(dplyr)

select: select / drop variables

filter: filter by logical conditions

mutate: transform variables

arrange: sort

group_by and summarize: obtain summaries by variables

More?

More `ggplot` and intro to the `library(dplyr)`

More `qplot`

`grid.arrange`

Some cool plots in `library(GGally)`

More `ggplot`

Intro to the `library(dplyr)`

`select`: select / drop variables

`filter`: filter by logical conditions

`mutate`: transform variables

`arrange`: sort

`group_by` and `summarize`: obtain summaries by variables