Combining datasets

Joining data
Set operations and binding

Joining data

In lecture, we learned about left_join, right_join, inner_join, and full_join. Our starting point is 2 datasets a and b, which have a variable in common named x. We wish to merge the datasets by matching values of x.

left_join(a, b, by='x'): Keep all the rows in a, find rows in dataset b that have matching values of x in a, and include the extra columns in b that don’t appear in a. Don’t include rows in b that have no matching value of x in b.
right_join(a, b, by='x'): Keep all the rows in b, find rows in dataset a that have matching values of x in b, and include the extra columns in a that don’t appear in b. Don’t include rows in a that have no matching value of x in b.
inner_join(a, b, by='x'): Keep rows that have values of x that appear in both a and b.
full_join(a, b, by='x'): Keep all the rows in a and b, even if they don’t have matching values in x.

To make this concrete, let’s work with a specific example. The tibbles below contain names of drinks:

nonalcoholic = tibble(name=c("g&t","mimosa","rum and coke","calimocho","polar"), 
                      mixer=c("tonic","orange juice", "coke", "coke","seltzer"))
alcoholic = tibble(name=c("g&t","mimosa","rum and coke","calimocho","IPA","dark and stormy"), 
                    spirit=c("gin","sparkling wine", "rum", "red wine","beer","rum"))

Let’s try out what happens when we try the different joins on the tibbles:

left_join(nonalcoholic, alcoholic, by='name')

## # A tibble: 5 x 3
##   name         mixer        spirit        
##   <chr>        <chr>        <chr>         
## 1 g&t          tonic        gin           
## 2 mimosa       orange juice sparkling wine
## 3 rum and coke coke         rum           
## 4 calimocho    coke         red wine      
## 5 polar        seltzer      <NA>

right_join(nonalcoholic, alcoholic, by='name')

## # A tibble: 6 x 3
##   name            mixer        spirit        
##   <chr>           <chr>        <chr>         
## 1 g&t             tonic        gin           
## 2 mimosa          orange juice sparkling wine
## 3 rum and coke    coke         rum           
## 4 calimocho       coke         red wine      
## 5 IPA             <NA>         beer          
## 6 dark and stormy <NA>         rum

inner_join(nonalcoholic, alcoholic, by='name')

## # A tibble: 4 x 3
##   name         mixer        spirit        
##   <chr>        <chr>        <chr>         
## 1 g&t          tonic        gin           
## 2 mimosa       orange juice sparkling wine
## 3 rum and coke coke         rum           
## 4 calimocho    coke         red wine

full_join(nonalcoholic, alcoholic, by='name')

## # A tibble: 7 x 3
##   name            mixer        spirit        
##   <chr>           <chr>        <chr>         
## 1 g&t             tonic        gin           
## 2 mimosa          orange juice sparkling wine
## 3 rum and coke    coke         rum           
## 4 calimocho       coke         red wine      
## 5 polar           seltzer      <NA>          
## 6 IPA             <NA>         beer          
## 7 dark and stormy <NA>         rum

We can also use joins for filtering. For example, semi_join(a, b, by='x') keeps all the rows in a that have a matching value of x in b (it doesn’t create more columns with the information in b), and anti_join(a, b, by='x') keeps the rows in a that don’t have a matching value of x in b:

semi_join(nonalcoholic, alcoholic, by='name')

## # A tibble: 4 x 2
##   name         mixer       
##   <chr>        <chr>       
## 1 g&t          tonic       
## 2 mimosa       orange juice
## 3 rum and coke coke        
## 4 calimocho    coke

anti_join(nonalcoholic, alcoholic, by='name')

## # A tibble: 1 x 2
##   name  mixer  
##   <chr> <chr>  
## 1 polar seltzer

Suppose we get prices for the drinks:

nonalcoholic = nonalcoholic %>% mutate(price=c(8, 6, 7, 5, 5))
alcoholic = alcoholic  %>% mutate('$'=c(8, 6, 7, 5, 5, NA))

How can we join the new tibbles nicely?

First, we can rename the price variable in alcoholic:

alcoholic = alcoholic %>% rename(price='$')

And then, we can full_join the tibbles by their name and price (we want a double match):

join = full_join(nonalcoholic, alcoholic, by=c('name','price'))
join

## # A tibble: 7 x 4
##   name            mixer        price spirit        
##   <chr>           <chr>        <dbl> <chr>         
## 1 g&t             tonic            8 gin           
## 2 mimosa          orange juice     6 sparkling wine
## 3 rum and coke    coke             7 rum           
## 4 calimocho       coke             5 red wine      
## 5 polar           seltzer          5 <NA>          
## 6 IPA             <NA>             5 beer          
## 7 dark and stormy <NA>            NA rum

If we want to reorder the columns so that the spirit comes first, followed by the mixer, we can use select:

join = join %>% select(name, spirit, mixer, price)
join

## # A tibble: 7 x 4
##   name            spirit         mixer        price
##   <chr>           <chr>          <chr>        <dbl>
## 1 g&t             gin            tonic            8
## 2 mimosa          sparkling wine orange juice     6
## 3 rum and coke    rum            coke             7
## 4 calimocho       red wine       coke             5
## 5 polar           <NA>           seltzer          5
## 6 IPA             beer           <NA>             5
## 7 dark and stormy rum            <NA>            NA

Now, suppose we got a new tibble with the extra components that were missing for dark and stormy

ds = tibble(name='dark and stormy', spirit='ginger beer', price = 8)

How can we incorporate this new information?

We can start by full_joining

full_join(join, ds, by='name')

## # A tibble: 7 x 6
##   name            spirit.x       mixer        price.x spirit.y    price.y
##   <chr>           <chr>          <chr>          <dbl> <chr>         <dbl>
## 1 g&t             gin            tonic              8 <NA>             NA
## 2 mimosa          sparkling wine orange juice       6 <NA>             NA
## 3 rum and coke    rum            coke               7 <NA>             NA
## 4 calimocho       red wine       coke               5 <NA>             NA
## 5 polar           <NA>           seltzer            5 <NA>             NA
## 6 IPA             beer           <NA>               5 <NA>             NA
## 7 dark and stormy rum            <NA>              NA ginger beer       8

This join has a few problems. We’d like to combine price.x and price.y. We can do that using the function coalesce, and then getting rid of price.x and price.y:

full_join(join, ds, by='name') %>%  
  mutate(price=coalesce(price.x,price.y)) %>% select(-price.x, -price.y)

## # A tibble: 7 x 5
##   name            spirit.x       mixer        spirit.y    price
##   <chr>           <chr>          <chr>        <chr>       <dbl>
## 1 g&t             gin            tonic        <NA>            8
## 2 mimosa          sparkling wine orange juice <NA>            6
## 3 rum and coke    rum            coke         <NA>            7
## 4 calimocho       red wine       coke         <NA>            5
## 5 polar           <NA>           seltzer      <NA>            5
## 6 IPA             beer           <NA>         <NA>            5
## 7 dark and stormy rum            <NA>         ginger beer     8

This is mostly fine. If we want to combine spirit.x and spirit.y:

full_join(join, ds, by='name') %>%  
  mutate(price=coalesce(price.x,price.y)) %>% select(-price.x, -price.y) %>%
    replace(is.na(.),"") %>%
    unite(spirit, spirit.x, spirit.y, sep="")

## # A tibble: 7 x 4
##   name            spirit         mixer        price
##   <chr>           <chr>          <chr>        <dbl>
## 1 g&t             gin            tonic            8
## 2 mimosa          sparkling wine orange juice     6
## 3 rum and coke    rum            coke             7
## 4 calimocho       red wine       coke             5
## 5 polar           ""             seltzer          5
## 6 IPA             beer           ""               5
## 7 dark and stormy rumginger beer ""               8

First, we replaced the NAs by empty spaces, and then we joined the variables spirit.x, and spirit.y using unite. The formatting isn’t great, unforunately. We can play around a little bit to make it look nicer:

ds = ds %>% mutate(spirit=paste(" ", spirit,sep=""))
final = full_join(join, ds, by='name') %>%  
  mutate(price=coalesce(price.x,price.y)) %>%
  select(-price.x, -price.y)  %>%
  replace(is.na(.),"") %>%
  unite(spirit, spirit.x, spirit.y, sep="")
final

## # A tibble: 7 x 4
##   name            spirit          mixer        price
##   <chr>           <chr>           <chr>        <dbl>
## 1 g&t             gin             tonic            8
## 2 mimosa          sparkling wine  orange juice     6
## 3 rum and coke    rum             coke             7
## 4 calimocho       red wine        coke             5
## 5 polar           ""              seltzer          5
## 6 IPA             beer            ""               5
## 7 dark and stormy rum ginger beer ""               8

Now suppose that we’re at some bar where the total price for a mixed drink is equal to the sum of the prices of the components. The prices are

nonalcoholic = tibble(name=c("g&t","mimosa","rum and coke","calimocho","polar"), 
                      mixer=c("tonic","orange juice", "coke", "coke","seltzer"))  %>% 
                      mutate(price=c(3, 2, 2, 2, 5))
alcoholic = tibble(name=c("g&t","mimosa","rum and coke","calimocho","IPA"), 
                    spirit=c("gin","sparkling wine", "rum", "red wine","beer")) %>% 
                    mutate(price=c(5, 4, 5, 3, 5))

How can we merge the data and combine (add) prices?

We can start by full_joining by name

full_join(nonalcoholic, alcoholic, by='name')

## # A tibble: 6 x 5
##   name         mixer        price.x spirit         price.y
##   <chr>        <chr>          <dbl> <chr>            <dbl>
## 1 g&t          tonic              3 gin                  5
## 2 mimosa       orange juice       2 sparkling wine       4
## 3 rum and coke coke               2 rum                  5
## 4 calimocho    coke               2 red wine             3
## 5 polar        seltzer            5 <NA>                NA
## 6 IPA          <NA>              NA beer                 5

Unfortunately, a simple mutate instruction isn’t satisfactory because the sum of an NA + a number is equal to NA:

full_join(nonalcoholic, alcoholic, by='name') %>% mutate(total=price.x+price.y)

## # A tibble: 6 x 6
##   name         mixer        price.x spirit         price.y total
##   <chr>        <chr>          <dbl> <chr>            <dbl> <dbl>
## 1 g&t          tonic              3 gin                  5     8
## 2 mimosa       orange juice       2 sparkling wine       4     6
## 3 rum and coke coke               2 rum                  5     7
## 4 calimocho    coke               2 red wine             3     5
## 5 polar        seltzer            5 <NA>                NA    NA
## 6 IPA          <NA>              NA beer                 5    NA

We can solve this problem in different ways. One way is using coalesce to replace the NAs by zeros in price.x and price.y, and then adding:

full_join(nonalcoholic, alcoholic, by='name') %>% 
  mutate(price.x=coalesce(price.x,0), price.y=coalesce(price.y,0), total=price.x+price.y) %>% 
  select(name, spirit, mixer, total)

## # A tibble: 6 x 4
##   name         spirit         mixer        total
##   <chr>        <chr>          <chr>        <dbl>
## 1 g&t          gin            tonic            8
## 2 mimosa       sparkling wine orange juice     6
## 3 rum and coke rum            coke             7
## 4 calimocho    red wine       coke             5
## 5 polar        <NA>           seltzer          5
## 6 IPA          beer           <NA>             5

In general, the way that R deals with NAs can be a little frustrating, as these examples show.

Set operations and binding

Source: Data wrangling cheat sheet

Consider the tibbles y and z, which have matching variable names x1 and x2 (this is important!):

## # A tibble: 3 x 2
##   x1       x2
##   <chr> <int>
## 1 A         1
## 2 B         2
## 3 C         3

## # A tibble: 3 x 2
##   x1       x2
##   <chr> <int>
## 1 B         2
## 2 C         3
## 3 D         4

Set operations

intersect(y,z): find rows that appear in both y and z

intersect(y,z)

## # A tibble: 2 x 2
##   x1       x2
##   <chr> <int>
## 1 B         2
## 2 C         3

union(y,z): find rows that appear in y or z (or both):

union(y,z)

## # A tibble: 4 x 2
##   x1       x2
##   <chr> <int>
## 1 D         4
## 2 C         3
## 3 B         2
## 4 A         1

setdiff(y, z): find rows that appear in y but not in z:

setdiff(y,z)

## # A tibble: 1 x 2
##   x1       x2
##   <chr> <int>
## 1 A         1

Binding rows and columns

If you want to bind the rows of y and z

bind_rows(y,z)

## # A tibble: 6 x 2
##   x1       x2
##   <chr> <int>
## 1 A         1
## 2 B         2
## 3 C         3
## 4 B         2
## 5 C         3
## 6 D         4

If you want to bind the columns of y and z:

bind_cols(y,z)

## # A tibble: 3 x 4
##   x1       x2 x11     x21
##   <chr> <int> <chr> <int>
## 1 A         1 B         2
## 2 B         2 C         3
## 3 C         3 D         4

Combining datasets

Víctor Peña

Joining data

Set operations and binding