Missing data with dplyr

In R, missing data are coded as NA. You can filter out missing data easily using dplyr.

We’re going to use the dataset speed_gender_height.csv, which contained heights, genders, and maximum driving speeds for a sample of 1325 invididuals. The variable speed had some missing values and 0s, which corresponded to people who had never driven before. We can filter out people who never drove and people whose value of speed is missing as follows:

speed = as_tibble(read.csv('./speed_gender_height.csv'))
speed = speed %>% filter(!is.na(speed), speed > 0) %>% select(-X)

In general, we can omit rows that have missing values using na.omit:

speed = speed %>% na.omit 

Combining data sets

In lecture, we learned about left_join, right_join, inner_join, and full_join. Our starting point is 2 datasets a and b, which have a variable in common named x. We wish to merge the datasets by matching values of x.

To make this concrete, let’s work with a specific example. The tibbles below contain names of drinks:

nonalcoholic = tibble(name=c("g&t","mimosa","rum and coke","calimocho","polar"), 
                      mixer=c("tonic","orange juice", "coke", "coke","seltzer"))
alcoholic = tibble(name=c("g&t","mimosa","rum and coke","calimocho","IPA","dark and stormy"), 
                    spirit=c("gin","sparkling wine", "rum", "red wine","beer","rum"))

Let’s try out what happens when we try the different joins on the tibbles:

left_join(nonalcoholic, alcoholic, by='name') 
## # A tibble: 5 x 3
##   name         mixer        spirit        
##   <chr>        <chr>        <chr>         
## 1 g&t          tonic        gin           
## 2 mimosa       orange juice sparkling wine
## 3 rum and coke coke         rum           
## 4 calimocho    coke         red wine      
## 5 polar        seltzer      <NA>
right_join(nonalcoholic, alcoholic, by='name') 
## # A tibble: 6 x 3
##   name            mixer        spirit        
##   <chr>           <chr>        <chr>         
## 1 g&t             tonic        gin           
## 2 mimosa          orange juice sparkling wine
## 3 rum and coke    coke         rum           
## 4 calimocho       coke         red wine      
## 5 IPA             <NA>         beer          
## 6 dark and stormy <NA>         rum
inner_join(nonalcoholic, alcoholic, by='name') 
## # A tibble: 4 x 3
##   name         mixer        spirit        
##   <chr>        <chr>        <chr>         
## 1 g&t          tonic        gin           
## 2 mimosa       orange juice sparkling wine
## 3 rum and coke coke         rum           
## 4 calimocho    coke         red wine
full_join(nonalcoholic, alcoholic, by='name')
## # A tibble: 7 x 3
##   name            mixer        spirit        
##   <chr>           <chr>        <chr>         
## 1 g&t             tonic        gin           
## 2 mimosa          orange juice sparkling wine
## 3 rum and coke    coke         rum           
## 4 calimocho       coke         red wine      
## 5 polar           seltzer      <NA>          
## 6 IPA             <NA>         beer          
## 7 dark and stormy <NA>         rum

We can also use joins for filtering. For example, semi_join(a, b, by='x') keeps all the rows in a that have a matching value of x in b (it doesn’t create more columns with the information in b), and anti_join(a, b, by='x') keeps the rows in a that don’t have a matching value of x in b:

semi_join(nonalcoholic, alcoholic, by='name')
## # A tibble: 4 x 2
##   name         mixer       
##   <chr>        <chr>       
## 1 g&t          tonic       
## 2 mimosa       orange juice
## 3 rum and coke coke        
## 4 calimocho    coke
anti_join(nonalcoholic, alcoholic, by='name')
## # A tibble: 1 x 2
##   name  mixer  
##   <chr> <chr>  
## 1 polar seltzer

Suppose we get prices for the drinks:

nonalcoholic = nonalcoholic %>% mutate(price=c(8, 6, 7, 5, 5))
alcoholic = alcoholic  %>% mutate('$'=c(8, 6, 7, 5, 5, NA))

How can we join the new tibbles nicely?

First, we can rename the price variable in alcoholic:

alcoholic = alcoholic %>% rename(price='$')

And then, we can full_join the tibbles by their name and price (we want a double match):

join = full_join(nonalcoholic, alcoholic, by=c('name','price'))
join
## # A tibble: 7 x 4
##   name            mixer        price spirit        
##   <chr>           <chr>        <dbl> <chr>         
## 1 g&t             tonic            8 gin           
## 2 mimosa          orange juice     6 sparkling wine
## 3 rum and coke    coke             7 rum           
## 4 calimocho       coke             5 red wine      
## 5 polar           seltzer          5 <NA>          
## 6 IPA             <NA>             5 beer          
## 7 dark and stormy <NA>            NA rum

If we want to reorder the columns so that the spirit comes first, followed by the mixer, we can use select:

join = join %>% select(name, spirit, mixer, price)
join
## # A tibble: 7 x 4
##   name            spirit         mixer        price
##   <chr>           <chr>          <chr>        <dbl>
## 1 g&t             gin            tonic            8
## 2 mimosa          sparkling wine orange juice     6
## 3 rum and coke    rum            coke             7
## 4 calimocho       red wine       coke             5
## 5 polar           <NA>           seltzer          5
## 6 IPA             beer           <NA>             5
## 7 dark and stormy rum            <NA>            NA

Now, suppose we got a new tibble with the extra components that were missing for dark and stormy

ds = tibble(name='dark and stormy', spirit='ginger beer', price = 8)

How can we incorporate this new information?

We can start by full_joining

full_join(join, ds, by='name')
## # A tibble: 7 x 6
##   name            spirit.x       mixer        price.x spirit.y    price.y
##   <chr>           <chr>          <chr>          <dbl> <chr>         <dbl>
## 1 g&t             gin            tonic              8 <NA>             NA
## 2 mimosa          sparkling wine orange juice       6 <NA>             NA
## 3 rum and coke    rum            coke               7 <NA>             NA
## 4 calimocho       red wine       coke               5 <NA>             NA
## 5 polar           <NA>           seltzer            5 <NA>             NA
## 6 IPA             beer           <NA>               5 <NA>             NA
## 7 dark and stormy rum            <NA>              NA ginger beer       8

This join has a few problems. We’d like to combine price.x and price.y. We can do that using the function coalesce, and then getting rid of price.x and price.y:

full_join(join, ds, by='name') %>%  
  mutate(price=coalesce(price.x,price.y)) %>% select(-price.x, -price.y)
## # A tibble: 7 x 5
##   name            spirit.x       mixer        spirit.y    price
##   <chr>           <chr>          <chr>        <chr>       <dbl>
## 1 g&t             gin            tonic        <NA>            8
## 2 mimosa          sparkling wine orange juice <NA>            6
## 3 rum and coke    rum            coke         <NA>            7
## 4 calimocho       red wine       coke         <NA>            5
## 5 polar           <NA>           seltzer      <NA>            5
## 6 IPA             beer           <NA>         <NA>            5
## 7 dark and stormy rum            <NA>         ginger beer     8

This is mostly fine. If we want to combine spirit.x and spirit.y:

full_join(join, ds, by='name') %>%  
  mutate(price=coalesce(price.x,price.y)) %>% select(-price.x, -price.y) %>%
    replace(is.na(.),"") %>%
    unite(spirit, spirit.x, spirit.y, sep="")
## # A tibble: 7 x 4
##   name            spirit         mixer        price
##   <chr>           <chr>          <chr>        <dbl>
## 1 g&t             gin            tonic            8
## 2 mimosa          sparkling wine orange juice     6
## 3 rum and coke    rum            coke             7
## 4 calimocho       red wine       coke             5
## 5 polar           ""             seltzer          5
## 6 IPA             beer           ""               5
## 7 dark and stormy rumginger beer ""               8

First, we replaced the NAs by empty spaces, and then we joined the variables spirit.x, and spirit.y using unite. The formatting isn’t great, unforunately. We can play around a little bit to make it look nicer:

ds = ds %>% mutate(spirit=paste(" ", spirit,sep=""))
final = full_join(join, ds, by='name') %>%  
  mutate(price=coalesce(price.x,price.y)) %>%
  select(-price.x, -price.y)  %>%
  replace(is.na(.),"") %>%
  unite(spirit, spirit.x, spirit.y, sep="")
final
## # A tibble: 7 x 4
##   name            spirit          mixer        price
##   <chr>           <chr>           <chr>        <dbl>
## 1 g&t             gin             tonic            8
## 2 mimosa          sparkling wine  orange juice     6
## 3 rum and coke    rum             coke             7
## 4 calimocho       red wine        coke             5
## 5 polar           ""              seltzer          5
## 6 IPA             beer            ""               5
## 7 dark and stormy rum ginger beer ""               8

Now suppose that we’re at some bar where the total price for a mixed drink is equal to the sum of the prices of the components. The prices are

nonalcoholic = tibble(name=c("g&t","mimosa","rum and coke","calimocho","polar"), 
                      mixer=c("tonic","orange juice", "coke", "coke","seltzer"))  %>% 
                      mutate(price=c(3, 2, 2, 2, 5))
alcoholic = tibble(name=c("g&t","mimosa","rum and coke","calimocho","IPA"), 
                    spirit=c("gin","sparkling wine", "rum", "red wine","beer")) %>% 
                    mutate(price=c(5, 4, 5, 3, 5))

How can we merge the data and combine (add) prices?

We can start by full_joining by name

full_join(nonalcoholic, alcoholic, by='name')
## # A tibble: 6 x 5
##   name         mixer        price.x spirit         price.y
##   <chr>        <chr>          <dbl> <chr>            <dbl>
## 1 g&t          tonic              3 gin                  5
## 2 mimosa       orange juice       2 sparkling wine       4
## 3 rum and coke coke               2 rum                  5
## 4 calimocho    coke               2 red wine             3
## 5 polar        seltzer            5 <NA>                NA
## 6 IPA          <NA>              NA beer                 5

Unfortunately, a simple mutate instruction isn’t satisfactory because the sum of an NA + a number is equal to NA:

full_join(nonalcoholic, alcoholic, by='name') %>% mutate(total=price.x+price.y)
## # A tibble: 6 x 6
##   name         mixer        price.x spirit         price.y total
##   <chr>        <chr>          <dbl> <chr>            <dbl> <dbl>
## 1 g&t          tonic              3 gin                  5     8
## 2 mimosa       orange juice       2 sparkling wine       4     6
## 3 rum and coke coke               2 rum                  5     7
## 4 calimocho    coke               2 red wine             3     5
## 5 polar        seltzer            5 <NA>                NA    NA
## 6 IPA          <NA>              NA beer                 5    NA

We can solve this problem in different ways. One way is using coalesce to replace the NAs by zeros in price.x and price.y, and then adding:

full_join(nonalcoholic, alcoholic, by='name') %>% 
  mutate(price.x=coalesce(price.x,0), price.y=coalesce(price.y,0), total=price.x+price.y) %>% 
  select(name, spirit, mixer, total)
## # A tibble: 6 x 4
##   name         spirit         mixer        total
##   <chr>        <chr>          <chr>        <dbl>
## 1 g&t          gin            tonic            8
## 2 mimosa       sparkling wine orange juice     6
## 3 rum and coke rum            coke             7
## 4 calimocho    red wine       coke             5
## 5 polar        <NA>           seltzer          5
## 6 IPA          beer           <NA>             5

In general, the way that R deals with NAs can be a little frustrating, as these examples show.